ePrints@IIScePrints@IISc Home | About | Browse | Latest Additions | Advanced Search | Contact | Help

Implementation of human whole genome sequencing data analysis: A containerized framework for sustained and enhanced throughput

Panda, A and Subramanian, K and Kahali, B (2021) Implementation of human whole genome sequencing data analysis: A containerized framework for sustained and enhanced throughput. In: Informatics in Medicine Unlocked, 25 .

[img]
Preview
PDF
inf_med_unl_25_2021.pdf - Published Version

Download (8MB) | Preview
[img] Archive (ZIP)
ScienceDirect_files_09Sep2021_09-38-23.575.zip - Published Supplemental Material

Download (4MB)
Official URL: https://doi.org/10.1016/j.imu.2021.100684

Abstract

Whole Genome Sequencing (WGS) provides information for each base of the entire 3.2 billion base pairs of the diploid human genome. Therefore, WGS plays an important role in identifying genetic variations for populations and understanding disease signatures in cohort studies or cases with rare genetic disorders. Nonetheless, discoveries from high throughput WGS are dependent on efficient processing, analyzing, and storing this enormous amount of genomic sequencing data, often in the scale of petabytes. Although there has been a significant reduction in genome sequencing costs in recent years, high-performance computation costs have not decreased in a directly proportional fashion. The objective of the present work is to develop a Docker-based container method for human whole genome sequencing data processing and analysis for detecting genetic variations from paired end WGS short reads. Our method provides an approach to simultaneously process multiple genomes within a single compute system while guaranteeing sustained and stable handling of the memory requirements for the genomic data processing and ensuring no unwanted termination of the currently running parallel jobs. This method also achieves a 40 reduction in execution time. To encourage widespread adoption and ease of WGS analysis, our containerized pipeline will be made publicly available. We have tested this approach for human genome data from Illumina WGS platforms and report the benchmark metrics in two different workstation environments in this communication. Compared to truth sets, our approach calls variants with 99 precision and recall. © 2021 The Authors

Item Type: Journal Article
Publication: Informatics in Medicine Unlocked
Publisher: Elsevier Ltd
Additional Information: The copyright for this article belongs to Authors
Department/Centre: Autonomous Societies / Centres > Centre for Brain Research
Date Deposited: 29 Nov 2021 11:16
Last Modified: 29 Nov 2021 11:16
URI: http://eprints.iisc.ac.in/id/eprint/69978

Actions (login required)

View Item View Item