ePrints@IIScePrints@IISc Home | About | Browse | Latest Additions | Advanced Search | Contact | Help

Chronological text similarity with pretrained embedding and edit distance

Shree Charran, R and Jain, S and Dubey, RK (2022) Chronological text similarity with pretrained embedding and edit distance. [Book Chapter]

Full text not available from this repository.
Official URL: https://doi.org/10.1016/B978-0-12-824054-0.00014-9

Abstract

Text similarity algorithms are the core building blocks for several natural language processing (NLP) applications. Question-Answering, Search Engines, Plagiarism check, and Summarization are a few of the several applications with text similarity as the building block. Over the last two decades, immense research has been conducted by the NLP and computer science community to develop more accurate algorithms to detect semantic and syntactic similarities. But, more than often than not these works have overlooked the importance of chronological order of text, which is very essential in cases like a manufacturing-process text where we try to find the order in which ingredients are added, processed, etc. When we additionally capture word order similarity, the final text similarity tends to be more accurate. In this chapter, we present a simple ensemble method to capture semantic, syntactic, and chronological similarities between two sentences. The ensemble consists of a proportioned modified edit distance score and cosine similarity score from pretrained state-of-the-art embedding. The embedding captures the semantic similarity and the modified edit distance helps capture the word order, thus making the model more robust. Further, we have studied the effects of various ensemble ratios to help achieve maximum accuracy. The proposed model was demonstrated on the standard STS-B dataset and custom dataset. The results show that the ensemble achieves far better accuracy than the individual models to measure similarity in sentences.

Item Type: Book Chapter
Publication: Artificial Intelligence and Machine Learning for EDGE Computing
Publisher: Elsevier
Additional Information: The copyright for this article belongs to Elsevier.
Keywords: Chronology; Edit distance; Embedding; Text similarity
Department/Centre: Division of Interdisciplinary Sciences > Management Studies
Date Deposited: 19 Oct 2022 09:45
Last Modified: 19 Oct 2022 09:45
URI: https://eprints.iisc.ac.in/id/eprint/77456

Actions (login required)

View Item View Item