Chandra, G and Gibney, D and Jain, C (2024) Haplotype-Aware Sequence Alignment to Pangenome Graphs. In: 28th International Conference on Research in Computational Molecular Biology., 29 April 2024 through 2 May 2024, Cambridge, pp. 381-384.
PDF
lec_not_com_sci_14758_38-384_2024.pdf - Published Version Restricted to Registered users only Download (290kB) | Request a copy |
Abstract
Modern pangenome graphs are built using haplotype-resolved genome assemblies. While mapping reads to a pangenome graph, prioritizing alignments that are consistent with the known haplotypes has been shown to improve genotyping accuracy. However, the existing rigorous formulations for sequence-to-graph co-linear chaining and alignment problems do not consider the haplotype paths in a pangenome graph. This often leads to spurious read alignments to those paths that are unlikely recombinations of the known haplotypes. We present novel formulations and algorithms for haplotype-aware sequence alignment to directed acyclic graphs (DAGs). We consider both sequence-to-DAG chaining and sequence-to-DAG alignment problems. Drawing inspiration from the commonly used models for genotype imputation, we assume that a query sequence is an imperfect mosaic of the reference haplotypes. Accordingly, we extend previous chaining and alignment formulations by introducing a recombination penalty for a haplotype switch. First, we solve haplotype-aware sequence-to-DAG alignment in O(|Q||E||H|) time where Q is the query sequence, E is the set of edges, and H is the set of haplotypes represented in the graph. To complement our solution, we prove that an algorithm significantly faster than O(|Q||E||H|) is impossible under the Strong Exponential Time Hypothesis (SETH). Second, we propose a haplotype-aware chaining algorithm that runs in O(|H|Nlog|H|N) time after graph preprocessing, where N is the count of input anchors. We then establish that a chaining algorithm significantly faster than O(|H|N) is impossible under SETH. As a proof-of-concept of our algorithmic solutions, we implemented the chaining algorithm in the Minichain aligner (https://github.com/at-cg/minichain). We demonstrate the advantage of the algorithm by aligning sequences sampled from human major histocompatibility complex (MHC) to a pangenome graph of 60 MHC haplotypes. The proposed algorithm offers better consistency with ground-truth recombinations when compared to a haplotype-agnostic algorithm. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024.
Item Type: | Conference Paper |
---|---|
Publication: | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
Publisher: | Springer Science and Business Media Deutschland GmbH |
Additional Information: | The copyright for this article belongs to Springer Science and Business Media Deutschland GmbH. |
Keywords: | Bioinformatics; Directed graphs; Genes; Pattern matching, Acyclic graphs; Alignment Problems; Genome sequencing; Haplotypes; Major histocompatibility complex; Pangenome; Pattern-matching; Query sequence; Sequence alignments; Strong exponential time hypothesis, Alignment |
Department/Centre: | Division of Interdisciplinary Sciences > Computational and Data Sciences |
Date Deposited: | 13 Aug 2024 05:51 |
Last Modified: | 13 Aug 2024 05:51 |
URI: | http://eprints.iisc.ac.in/id/eprint/85274 |
Actions (login required)
View Item |