ePrints@IIScePrints@IISc Home | About | Browse | Latest Additions | Advanced Search | Contact | Help

Telomere-to-telomere assembly by preserving contained reads

Kamath, SS and Bindra, M and Pal, D and Jain, C (2024) Telomere-to-telomere assembly by preserving contained reads. In: Genome Research, 34 (11). pp. 1908-1918.

[img]
Preview
PDF
Gen_Res_2024.pdf - Published Version

Download (1MB) | Preview
Official URL: https://doi.org/10.1101/gr.279311.124

Abstract

Automated telomere-to-telomere (T2T) de novo assembly of diploid and polyploid genomes remains a formidable task. A string graph is a commonly used assembly graph representation in the assembly algorithms. The string graph formulation employs graph simplification heuristics, which drastically reduce the count of vertices and edges. One of these heuristics involves removing the reads contained in longer reads. In practice, this heuristic occasionally introduces gaps in the assembly by removing all reads that cover one or more genome intervals. The factors contributing to such gaps remain poorly understood. In this work, we mathematically derived the frequency of observing a gap near a germline and a somatic heterozygous variant locus. Our analysis shows that (1) an assembly gap due to contained read deletion is an order of magnitude more frequent in Oxford Nanopore Technologies (ONT) reads than Pacific Biosciences high-fidelity (PacBio HiFi) reads due to differences in their read-length distributions, and (2) this frequency decreases with an increase in the sequencing depth. Drawing cues from these observations, we addressed the weakness of the string graph formulation by developing the repeat-aware fragmenting tool (RAFT) assembly algorithm. RAFT addresses the issue of contained reads by fragmenting reads and producing a more uniform read-length distribution. The algorithm retains spanned repeats in the reads during the fragmentation. We empirically demonstrate that RAFT significantly reduces the number of gaps using simulated data sets. Using real ONT and PacBio HiFi data sets of the HG002 human genome, we achieved a twofold increase in the contig NG50 and the number of haplotype-resolved T2T contigs compared to hifiasm. © 2024 Kamath et al.

Item Type: Journal Article
Publication: Genome Research
Publisher: Cold Spring Harbor Laboratory Press
Additional Information: The copyright for this article belongs to authors.
Department/Centre: Division of Interdisciplinary Sciences > Computational and Data Sciences
Date Deposited: 12 Dec 2024 19:07
Last Modified: 12 Dec 2024 19:07
URI: http://eprints.iisc.ac.in/id/eprint/87029

Actions (login required)

View Item View Item