ePrints@IIScePrints@IISc Home | About | Browse | Latest Additions | Advanced Search | Contact | Help

An efficient incremental protein sequence clustering algorithm

Vijaya, PA and Murty, Narasimha M and Subramanian, DK (2003) An efficient incremental protein sequence clustering algorithm. In: Conference on Convergent Technologies for Asia-Pacific Region TENCON 2003, 15-17 October, Bangalore,India, Vol.1, 409 -413.


Download (354kB)


Clustering is the division of data into groups of similar objects. The main objective of this unsupervised learning technique is to find a natural grouping or meaningful partition by using a distance or similarity function. Clustering is mainly used for dimensionally reduction, prototype selections/abstractions for pattern classification, data reorganization and indexing and for detecting outliers and nosiy patterns. Clustering techniques are applied in pattern classification schemes, bioinformatics, data mining, web mining, biometrics document processing,remote sensed data analysis, biomedical data analysis, etc., in which the data size is data size is very large. In this paper an efficient incremental clustering algorithm, 'leaders-subleaders', an extension of the leader algorithm, suitable for protein sequences of bioinformatics, is proposed for effective clustering and prototype selection for pattern classification. It is another simple and efficient technique to generate a hierarchical structure for finding the subgroups/subclusters within each cluster which may be used to find the superfamily, family and subfamily relationships of protein sequences. The experimental results (classification accuracy using the prototypes obtained and the computation time) of the proposed algorithm are compared with those of the leader-based and nearest neighbour classifier (NNC) methods. It is found to be computationally efficient when compared to NNC. Classification accuracy obtained using the representatives generated by the leaders-subleaders method is found to be better than that of using leaders as representatives and it approaches to that of NNC if sequential search is used on the sequences from the selected subcluster.

Item Type: Conference Paper
Publisher: IEEE
Additional Information: Copyright 1990 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.
Department/Centre: Division of Electrical Sciences > Computer Science & Automation
Date Deposited: 05 Jan 2006
Last Modified: 19 Sep 2010 04:22
URI: http://eprints.iisc.ac.in/id/eprint/4845

Actions (login required)

View Item View Item