ePrints@IIScePrints@IISc Home | About | Browse | Latest Additions | Advanced Search | Contact | Help

Co-linear Chaining with Overlaps and Gap Costs

Jain, C and Gibney, D and Thankachan, SV (2022) Co-linear Chaining with Overlaps and Gap Costs. In: 26th International Conference on Research in Computational Molecular Biology, RECOMB 2022, 22 - 25 May 2022, San Diego, pp. 246-262.

[img]
Preview
PDF
Springer_RECOMB_13278_246-262_2022.pdf - Published Version

Download (856kB) | Preview
Official URL: https://doi.org/10.1007/978-3-031-04749-7_15

Abstract

Co-linear chaining has proven to be a powerful heuristic for finding near-optimal alignments of long DNA sequences (e.g., long reads or a genome assembly) to a reference. It is used as an intermediate step in several alignment tools that employ a seed-chain-extend strategy. Despite this popularity, efficient subquadratic-time algorithms for the general case where chains support anchor overlaps and gap costs are not currently known. We present algorithms to solve the co-linear chaining problem with anchor overlaps and gap costs in O~ (n) time, where n denotes the count of anchors. We also establish the first theoretical connection between co-linear chaining cost and edit distance. Specifically, we prove that for a fixed set of anchors under a carefully designed chaining cost function, the optimal ‘anchored’ edit distance equals the optimal co-linear chaining cost. Finally, we demonstrate experimentally that optimal co-linear chaining cost under the proposed cost function can be computed orders of magnitude faster than edit distance, and achieves correlation coefficient above 0.9 with edit distance for closely as well as distantly related sequences.

Item Type: Conference Paper
Publication: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Publisher: Springer Science and Business Media Deutschland GmbH
Additional Information: The copyright for this article belongs to the Springer Science and Business Media Deutschland GmbH
Keywords: Bioinformatics; Cost functions; DNA sequences; Optimization, Co-linear chaining; Correlation coefficient; Cost distances; Cost-function; Edit distance; Fixed sets; Genome assembly; Near-optimal alignments; Orders of magnitude; Time algorithms, Anchors
Department/Centre: Division of Interdisciplinary Sciences > Computational and Data Sciences
Date Deposited: 30 Jun 2022 09:44
Last Modified: 30 Jun 2022 09:44
URI: https://eprints.iisc.ac.in/id/eprint/74086

Actions (login required)

View Item View Item