Chaudhary, D and Kahali, B and Simmhan, Y (2019) An empirical study on efficient storage of human genome data. In: 26th IEEE International Conference on High Performance Computing Workshops, HiPCW 2019, 17-20, December 2019, Hyderabad, pp. 87-92.
PDF
int_con_hig_per_com_dat_ana_wor_87-92_2019.pdf - Published Version Restricted to Registered users only Download (596kB) | Request a copy |
Abstract
Next-generation sequencing (NGS) has become affordable and fast, facilitating large scale population-level Whole Genome Sequencing (WGS) studies. NGS and its processing pipeline generate 100's of gigabytes of data per human subject, which can grow to petabytes for large studies, such as the upcoming GenomeIndia program. At these scales, affordable and reliable storage of data becomes a challenge. Here, we propose a preliminary data management architecture for storage and querying of data from the GenomeIndia project. In this initial empirical study, we focus on existing generic and domain-specific compression techniques for reducing the storage space of genome sequence data and compare erasure coding and replication in providing reliability on commodity hardware. We report the time and space complexity of these approaches, and this will reform the future design of our architecture.
Item Type: | Conference Paper |
---|---|
Publication: | Proceedings - 26th IEEE International Conference on High Performance Computing Workshops, HiPCW 2019 |
Publisher: | Institute of Electrical and Electronics Engineers Inc. |
Additional Information: | cited By 0; Conference of 26th IEEE International Conference on High Performance Computing Workshops, HiPCW 2019 ; Conference Date: 17 December 2019 Through 20 December 2019; Conference Code:157942 |
Department/Centre: | Division of Interdisciplinary Sciences > Computational and Data Sciences Others |
Date Deposited: | 18 Aug 2020 07:30 |
Last Modified: | 18 Aug 2020 07:30 |
URI: | http://eprints.iisc.ac.in/id/eprint/65002 |
Actions (login required)
View Item |