Haplotype-Aware Sequence Alignment toÂ Pangenome Graphs

Chandra, G and Gibney, D and Jain, C (2024) Haplotype-Aware Sequence Alignment toÂ Pangenome Graphs. In: 28th International Conference on Research in Computational Molecular Biology., 29 April 2024 through 2 May 2024, Cambridge, pp. 381-384.

PDF
lec_not_com_sci_14758_38-384_2024.pdf - Published Version
Restricted to Registered users only
Download (290kB) | Request a copy

Official URL: https://doi.org/10.1007/978-1-0716-3989-4_36

Abstract

Modern pangenome graphs are built using haplotype-resolved genome assemblies. While mapping reads to a pangenome graph, prioritizing alignments that are consistent with the known haplotypes has been shown to improve genotyping accuracy. However, the existing rigorous formulations for sequence-to-graph co-linear chaining and alignment problems do not consider the haplotype paths in a pangenome graph. This often leads to spurious read alignments to those paths that are unlikely recombinations of the known haplotypes. We present novel formulations and algorithms for haplotype-aware sequence alignment to directed acyclic graphs (DAGs). We consider both sequence-to-DAG chaining and sequence-to-DAG alignment problems. Drawing inspiration from the commonly used models for genotype imputation, we assume that a query sequence is an imperfect mosaic of the reference haplotypes. Accordingly, we extend previous chaining and alignment formulations by introducing a recombination penalty for a haplotype switch. First, we solve haplotype-aware sequence-to-DAG alignment in O(|Q||E||H|) time where Q is the query sequence, E is the set of edges, and H is the set of haplotypes represented in the graph. To complement our solution, we prove that an algorithm significantly faster than O(|Q||E||H|) is impossible under the Strong Exponential Time Hypothesis (SETH). Second, we propose a haplotype-aware chaining algorithm that runs in O(|H|Nlog|H|N) time after graph preprocessing, where N is the count of input anchors. We then establish that a chaining algorithm significantly faster than O(|H|N) is impossible under SETH. As a proof-of-concept of our algorithmic solutions, we implemented the chaining algorithm in the Minichain aligner (https://github.com/at-cg/minichain). We demonstrate the advantage of the algorithm by aligning sequences sampled from human major histocompatibility complex (MHC) to a pangenome graph of 60 MHC haplotypes. The proposed algorithm offers better consistency with ground-truth recombinations when compared to a haplotype-agnostic algorithm. Â© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024.

Item Type:	Conference Paper
Publication:	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Publisher:	Springer Science and Business Media Deutschland GmbH
Additional Information:	The copyright for this article belongs to Springer Science and Business Media Deutschland GmbH.
Keywords:	Bioinformatics; Directed graphs; Genes; Pattern matching, Acyclic graphs; Alignment Problems; Genome sequencing; Haplotypes; Major histocompatibility complex; Pangenome; Pattern-matching; Query sequence; Sequence alignments; Strong exponential time hypothesis, Alignment
Department/Centre:	Division of Interdisciplinary Sciences > Computational and Data Sciences
Date Deposited:	13 Aug 2024 05:51
Last Modified:	13 Aug 2024 05:51
URI:	http://eprints.iisc.ac.in/id/eprint/85274

Actions (login required)

View Item


	Powered by EPrints		A service from The J.R.D. Tata Memorial Library Indian Institute of Science, Bengaluru-560012, India