Ganapathiraju, Madhavi K and Mitchell, Asia D and Thahir, Mohamed and Motwani, Kamiya and Ananthasubramanian, Seshan (2012) Suite of tools for statistical N-gram language modeling for pattern mining in whole genome sequences. In: Journal of Bioinformatics and Computational Biology, 10 (6). p. 1250016.
PDF
jl_bio_com_bio_10-6_mad_2012.pdf - Published Version Restricted to Registered users only Download (3MB) | Request a copy |
Abstract
Genome sequences contain a number of patterns that have biomedical significance. Repetitive sequences of various kinds are a primary component of most of the genomic sequence patterns. We extended the suffix-array based Biological Language Modeling Toolkit to compute n-gram frequencies as well as n-gram language-model based perplexity in windows over the whole genome sequence to find biologically relevant patterns. We present the suite of tools and their application for analysis on whole human genome sequence.
Item Type: | Journal Article |
---|---|
Publication: | Journal of Bioinformatics and Computational Biology |
Publisher: | World Scientific Publishing Company |
Additional Information: | Copyright of this article belongs to World Scientific Publishing Company. |
Keywords: | Statistical Language Modeling; N-Gram Analysis; Genome Sequence Analysis |
Department/Centre: | Division of Interdisciplinary Sciences > Supercomputer Education & Research Centre |
Date Deposited: | 15 Feb 2013 10:28 |
Last Modified: | 15 Feb 2013 10:28 |
URI: | http://eprints.iisc.ac.in/id/eprint/45363 |
Actions (login required)
View Item |