ePrints@IIScePrints@IISc Home | About | Browse | Latest Additions | Advanced Search | Contact | Help

Algorithms for Colinear Chaining with Overlaps and Gap Costs

Jain, C and Gibney, D and Thankachan, SV (2022) Algorithms for Colinear Chaining with Overlaps and Gap Costs. In: Journal of computational biology : a journal of computational molecular cell biology, 29 (11). pp. 1237-1251.

[img] PDF
jou_com_bio_ 29-11_1237 - 1251_2022.pdf - Published Version
Restricted to Registered users only

Download (871kB) | Request a copy
Official URL: https://doi.org/10.1089/cmb.2022.0266

Abstract

Colinear chaining has proven to be a powerful heuristic for finding near-optimal alignments of long DNA sequences (e.g., long reads or a genome assembly) to a reference. It is used as an intermediate step in several alignment tools that employ a seed-chain-extend strategy. Despite this popularity, efficient subquadratic time algorithms for the general case where chains support anchor overlaps and gap costs are not currently known. We present algorithms to solve the colinear chaining problem with anchor overlaps and gap costs in formula presented time, wheren denotes the count of anchors. The degree of the polylogarithmic factor depends on the type of anchors used (e.g., fixed-length anchors) and the type of precedence an optimal anchor chain is required to satisfy. We also establish the first theoretical connection between colinear chaining cost and edit distance. Specifically, we prove that for a fixed set of anchors under a carefully designed chaining cost function, the optimal "anchored" edit distance equals the optimal colinear chaining cost. The anchored edit distance for two sequences and a set of anchors is only a slight generalization of the standard edit distance. It adds an additional cost of one to an alignment of two matching symbols that are not supported by any anchor. Finally, we demonstrate experimentally that optimal colinear chaining cost under the proposed cost function can be computed orders of magnitude faster than edit distance, and achieves correlation coefficient >0.9 with edit distance for closely as well as distantly related sequences.

Item Type: Journal Article
Publication: Journal of computational biology : a journal of computational molecular cell biology
Publisher: NLM (Medline)
Additional Information: The copyright for this article belongs to NLM (Medline).
Keywords: algorithm, Algorithms
Department/Centre: Division of Interdisciplinary Sciences > Computational and Data Sciences
Date Deposited: 03 Jan 2023 04:56
Last Modified: 03 Jan 2023 04:56
URI: https://eprints.iisc.ac.in/id/eprint/78663

Actions (login required)

View Item View Item