ePrints@IIScePrints@IISc Home | About | Browse | Latest Additions | Advanced Search | Contact | Help

Suite of tools for statistical N-gram language modeling for pattern mining in whole genome sequences

Ganapathiraju, Madhavi K and Mitchell, Asia D and Thahir, Mohamed and Motwani, Kamiya and Ananthasubramanian, Seshan (2012) Suite of tools for statistical N-gram language modeling for pattern mining in whole genome sequences. In: Journal of Bioinformatics and Computational Biology, 10 (6). p. 1250016.

[img] PDF
jl_bio_com_bio_10-6_mad_2012.pdf - Published Version
Restricted to Registered users only

Download (3MB) | Request a copy
Official URL: http://dx.doi.org/10.1142/S0219720012500163

Abstract

Genome sequences contain a number of patterns that have biomedical significance. Repetitive sequences of various kinds are a primary component of most of the genomic sequence patterns. We extended the suffix-array based Biological Language Modeling Toolkit to compute n-gram frequencies as well as n-gram language-model based perplexity in windows over the whole genome sequence to find biologically relevant patterns. We present the suite of tools and their application for analysis on whole human genome sequence.

Item Type: Journal Article
Publication: Journal of Bioinformatics and Computational Biology
Publisher: World Scientific Publishing Company
Additional Information: Copyright of this article belongs to World Scientific Publishing Company.
Keywords: Statistical Language Modeling; N-Gram Analysis; Genome Sequence Analysis
Department/Centre: Division of Interdisciplinary Sciences > Supercomputer Education & Research Centre
Date Deposited: 15 Feb 2013 10:28
Last Modified: 15 Feb 2013 10:28
URI: http://eprints.iisc.ac.in/id/eprint/45363

Actions (login required)

View Item View Item