ePrints@IIScePrints@IISc Home | About | Browse | Latest Additions | Advanced Search | Contact | Help

Efficient median based clustering and classification techniques for protein sequences

Vijaya, PA and Murty, Narasimha M and Subramanian, DK (2006) Efficient median based clustering and classification techniques for protein sequences. In: Pattern Analysis & Applications, 9 (2-3). pp. 243-255.


Download (278kB)


In this paper, an efficient K-medians clustering (unsupervised) algorithm for prototype selection and Supervised K-medians (SKM) classification technique for protein sequences are presented. For sequence data sets, a median string/sequence can be used as the cluster/group representative. In K-medians clustering technique, a desired number of clusters, K, each represented by a median string/sequence, is generated and these median sequences are used as prototypes for classifying the new/test sequence whereas in SKM classification technique, median sequence in each group/class of labelled protein sequences is determined and the set of median sequences is used as prototypes for classification purpose. It is found that the K-medians clustering technique outperforms the leader based technique and also SKM classification technique performs better than that of motifs based approach for the data sets used. We further use a simple technique to reduce time and space requirements during protein sequence clustering and classification. During training and testing phase, the similarity score value between a pair of sequences is determined by selecting a portion of the sequence instead of the entire sequence. It is like selecting a subset of features for sequence data sets. The experimental results of the proposed method on K-medians, SKM and Nearest Neighbour Classifier (NNC) techniques show that the Classification Accuracy (CA) using the prototypes generated/used does not degrade much but the training and testing time are reduced significantly. Thus the experimental results indicate that the similarity score does not need to be calculated by considering the entire length of the sequence for achieving a good CA. Even space requirement is reduced during both training and classification.

Item Type: Journal Article
Publication: Pattern Analysis & Applications
Publisher: Springer London
Additional Information: The copyright belongs to Springer London
Keywords: Clustering;Protein sequences;Median strings sequences;Set median;Prototypes;Feature selection;Classification accuracy
Department/Centre: Division of Electrical Sciences > Computer Science & Automation
Date Deposited: 06 Sep 2007
Last Modified: 19 Sep 2010 04:39
URI: http://eprints.iisc.ac.in/id/eprint/11810

Actions (login required)

View Item View Item