Kumar, HRS and Madhavaraj, A and Ramakrishnan, AG (2019) Splitting merged characters of kannada benchmark dataset using simplified paired-valleys and l-cut. In: 25th National Conference on Communications, NCC 2019, 20 - 23 February 2019, Bangalore.
PDF
NCC_2019.pdf - Published Version Restricted to Registered users only Download (904kB) | Request a copy |
Abstract
Abstract We reduce the computational complexity of the paired-valley algorithm for splitting merged characters, from Θ(N2) down to Θ(N), where N is the number of symbols merged. We also propose an effective way (L-cut algorithm) to separate the merged half-consonants (known in Kannada as ottus) from the base symbols. We have created a benchmark dataset of 4033 sub-word images in Kannada, each comprising two or more merged characters. We test the recognition accuracy of Tesseract OCR on the created benchmark dataset, before and after applying our technique. The accuracy of Tesseract v3 OCR on the created dataset of 61.6% increases by 20% to a value of 81.7% after the splitting of the characters by our method. The algorithm's scalability to other scripts has been explored by limited experiments on Telugu and Tamil.
Item Type: | Conference Paper |
---|---|
Publication: | 25th National Conference on Communications, NCC 2019 |
Publisher: | Institute of Electrical and Electronics Engineers Inc. |
Additional Information: | The copyright for this article belongs to Institute of Electrical and Electronics Engineers Inc. . |
Keywords: | Computational complexity; Landforms; Optical character recognition, Kannada; Merged characters; Old books; Ottu; Paired valleys; Printed texts; Tamil; Telugu; Tesseract, Statistical tests |
Department/Centre: | Division of Electrical Sciences > Electrical Engineering |
Date Deposited: | 29 Nov 2022 05:31 |
Last Modified: | 29 Nov 2022 05:31 |
URI: | https://eprints.iisc.ac.in/id/eprint/78061 |
Actions (login required)
View Item |