Kamath, SS and Bindra, M and Pal, D and Jain, C (2024) Telomere-to-telomere assembly by preserving contained reads. In: Genome Research, 34 (11). pp. 1908-1918.
|
PDF
Gen_Res_2024.pdf - Published Version Download (1MB) | Preview |
Abstract
Automated telomere-to-telomere (T2T) de novo assembly of diploid and polyploid genomes remains a formidable task. A string graph is a commonly used assembly graph representation in the assembly algorithms. The string graph formulation employs graph simplification heuristics, which drastically reduce the count of vertices and edges. One of these heuristics involves removing the reads contained in longer reads. In practice, this heuristic occasionally introduces gaps in the assembly by removing all reads that cover one or more genome intervals. The factors contributing to such gaps remain poorly understood. In this work, we mathematically derived the frequency of observing a gap near a germline and a somatic heterozygous variant locus. Our analysis shows that (1) an assembly gap due to contained read deletion is an order of magnitude more frequent in Oxford Nanopore Technologies (ONT) reads than Pacific Biosciences high-fidelity (PacBio HiFi) reads due to differences in their read-length distributions, and (2) this frequency decreases with an increase in the sequencing depth. Drawing cues from these observations, we addressed the weakness of the string graph formulation by developing the repeat-aware fragmenting tool (RAFT) assembly algorithm. RAFT addresses the issue of contained reads by fragmenting reads and producing a more uniform read-length distribution. The algorithm retains spanned repeats in the reads during the fragmentation. We empirically demonstrate that RAFT significantly reduces the number of gaps using simulated data sets. Using real ONT and PacBio HiFi data sets of the HG002 human genome, we achieved a twofold increase in the contig NG50 and the number of haplotype-resolved T2T contigs compared to hifiasm. © 2024 Kamath et al.
Item Type: | Journal Article |
---|---|
Publication: | Genome Research |
Publisher: | Cold Spring Harbor Laboratory Press |
Additional Information: | The copyright for this article belongs to authors. |
Department/Centre: | Division of Interdisciplinary Sciences > Computational and Data Sciences |
Date Deposited: | 12 Dec 2024 19:07 |
Last Modified: | 12 Dec 2024 19:07 |
URI: | http://eprints.iisc.ac.in/id/eprint/87029 |
Actions (login required)
View Item |