ePrints@IIScePrints@IISc Home | About | Browse | Latest Additions | Advanced Search | Contact | Help

Haplotype-Aware Sequence Alignment to Pangenome Graphs

Chandra, G and Gibney, D and Jain, C (2024) Haplotype-Aware Sequence Alignment to Pangenome Graphs. In: 28th International Conference on Research in Computational Molecular Biology., 29 April 2024 through 2 May 2024, Cambridge, pp. 381-384.

[img] PDF
lec_not_com_sci_14758_38-384_2024.pdf - Published Version
Restricted to Registered users only

Download (290kB) | Request a copy
Official URL: https://doi.org/10.1007/978-1-0716-3989-4_36

Abstract

Modern pangenome graphs are built using haplotype-resolved genome assemblies. While mapping reads to a pangenome graph, prioritizing alignments that are consistent with the known haplotypes has been shown to improve genotyping accuracy. However, the existing rigorous formulations for sequence-to-graph co-linear chaining and alignment problems do not consider the haplotype paths in a pangenome graph. This often leads to spurious read alignments to those paths that are unlikely recombinations of the known haplotypes. We present novel formulations and algorithms for haplotype-aware sequence alignment to directed acyclic graphs (DAGs). We consider both sequence-to-DAG chaining and sequence-to-DAG alignment problems. Drawing inspiration from the commonly used models for genotype imputation, we assume that a query sequence is an imperfect mosaic of the reference haplotypes. Accordingly, we extend previous chaining and alignment formulations by introducing a recombination penalty for a haplotype switch. First, we solve haplotype-aware sequence-to-DAG alignment in O(|Q||E||H|) time where Q is the query sequence, E is the set of edges, and H is the set of haplotypes represented in the graph. To complement our solution, we prove that an algorithm significantly faster than O(|Q||E||H|) is impossible under the Strong Exponential Time Hypothesis (SETH). Second, we propose a haplotype-aware chaining algorithm that runs in O(|H|Nlog|H|N) time after graph preprocessing, where N is the count of input anchors. We then establish that a chaining algorithm significantly faster than O(|H|N) is impossible under SETH. As a proof-of-concept of our algorithmic solutions, we implemented the chaining algorithm in the Minichain aligner (https://github.com/at-cg/minichain). We demonstrate the advantage of the algorithm by aligning sequences sampled from human major histocompatibility complex (MHC) to a pangenome graph of 60 MHC haplotypes. The proposed algorithm offers better consistency with ground-truth recombinations when compared to a haplotype-agnostic algorithm. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024.

Item Type: Conference Paper
Publication: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Publisher: Springer Science and Business Media Deutschland GmbH
Additional Information: The copyright for this article belongs to Springer Science and Business Media Deutschland GmbH.
Keywords: Bioinformatics; Directed graphs; Genes; Pattern matching, Acyclic graphs; Alignment Problems; Genome sequencing; Haplotypes; Major histocompatibility complex; Pangenome; Pattern-matching; Query sequence; Sequence alignments; Strong exponential time hypothesis, Alignment
Department/Centre: Division of Interdisciplinary Sciences > Computational and Data Sciences
Date Deposited: 13 Aug 2024 05:51
Last Modified: 13 Aug 2024 05:51
URI: http://eprints.iisc.ac.in/id/eprint/85274

Actions (login required)

View Item View Item