ePrints@IIScePrints@IISc Home | About | Browse | Latest Additions | Advanced Search | Contact | Help

Coverage-preserving sparsification of overlap graphs for long-read assembly

Jain, C (2023) Coverage-preserving sparsification of overlap graphs for long-read assembly. In: Bioinformatics (Oxford, England), 39 (3).

bio_39-3_2023.pdf - Published Version

Download (712kB) | Preview
Official URL: https://doi.org/10.1093/bioinformatics/btad124


MOTIVATION: Read-overlap-based graph data structures play a central role in computing de novo genome assembly. Most long-read assemblers use Myers's string graph model to sparsify overlap graphs. Graph sparsification improves assembly contiguity by removing spurious and redundant connections. However, a graph model must be coverage-preserving, i.e. it must ensure that there exist walks in the graph that spell all chromosomes, given sufficient sequencing coverage. This property becomes even more important for diploid genomes, polyploid genomes, and metagenomes where there is a risk of losing haplotype-specific information. RESULTS: We develop a novel theoretical framework under which the coverage-preserving properties of a graph model can be analyzed. We first prove that de Bruijn graph and overlap graph models are guaranteed to be coverage-preserving. We next show that the standard string graph model lacks this guarantee. The latter result is consistent with prior work suggesting that removal of contained reads, i.e. the reads that are substrings of other reads, can lead to coverage gaps during string graph construction. Our experiments done using simulated long reads from HG002 human diploid genome show that 50 coverage gaps are introduced on average by ignoring contained reads from nanopore datasets. To remedy this, we propose practical heuristics that are well-supported by our theoretical results and are useful to decide which contained reads should be retained to avoid coverage gaps. Our method retains a small fraction of contained reads (1-2) and closes majority of the coverage gaps. AVAILABILITY AND IMPLEMENTATION: Source code is available through GitHub (https://github.com/at-cg/ContainX) and Zenodo with doi: 10.5281/zenodo.7687543. © The Author(s) 2023. Published by Oxford University Press.

Item Type: Journal Article
Publication: Bioinformatics (Oxford, England)
Publisher: NLM (Medline)
Additional Information: The copyright for this article belongs to the Authors.
Keywords: algorithm; DNA sequence; high throughput sequencing; human; human genome; metagenome; procedures; software, Algorithms; Genome, Human; High-Throughput Nucleotide Sequencing; Humans; Metagenome; Sequence Analysis, DNA; Software
Department/Centre: Division of Interdisciplinary Sciences > Computational and Data Sciences
Date Deposited: 13 Apr 2023 10:21
Last Modified: 13 Apr 2023 10:21
URI: https://eprints.iisc.ac.in/id/eprint/81325

Actions (login required)

View Item View Item