Cluster labeling for multilingual scatter/gather using comparable corpora

Tholpadi, Goutham and Das, Mrinal Kanti and Bhattacharyya, Chiranjib and Shevade, Shirish (2012) Cluster labeling for multilingual scatter/gather using comparable corpora. In: ECIR 2012, 34th European Conference on IR Research, April 1-5, 2012, Barcelona, Spain.

PDF
IR_Rese_388_2012.pdf - Published Version
Restricted to Registered users only
Download (234kB) | Request a copy

Official URL: http://dx.doi.org/10.1007/978-3-642-28997-2_33

Abstract

Scatter/Gather systems are increasingly becoming useful in browsing document corpora. Usability of the present-day systems are restricted to monolingual corpora, and their methods for clustering and labeling do not easily extend to the multilingual setting, especially in the absence of dictionaries/machine translation. In this paper, we study the cluster labeling problem for multilingual corpora in the absence of machine translation, but using comparable corpora. Using a variational approach, we show that multilingual topic models can effectively handle the cluster labeling problem, which in turn allows us to design a novel Scatter/Gather system ShoBha. Experimental results on three datasets, namely the Canadian Hansards corpus, the entire overlapping Wikipedia of English, Hindi and Bengali articles, and a trilingual news corpus containing 41,000 articles, confirm the utility of the proposed system.

Item Type:	Conference Paper
Publisher:	Springer
Additional Information:	Copyright of this article belongs to Springer.
Keywords:	Cluster Labeling; Multilingual; Scatter/Gather; Comparable Corpora; Topic Models
Department/Centre:	Division of Electrical Sciences > Computer Science & Automation
Date Deposited:	22 Nov 2013 11:38
Last Modified:	22 Nov 2013 11:38
URI:	http://eprints.iisc.ac.in/id/eprint/47817

Actions (login required)

View Item


	Powered by EPrints		A service from The J.R.D. Tata Memorial Library Indian Institute of Science, Bengaluru-560012, India