ePrints@IIScePrints@IISc Home | About | Browse | Latest Additions | Advanced Search | Contact | Help

CTC-Based End-To-End ASR for the Low Resource Sanskrit Language with Spectrogram Augmentation

Anoop, CS and Ramakrishnan, AG (2021) CTC-Based End-To-End ASR for the Low Resource Sanskrit Language with Spectrogram Augmentation. In: 27th National Conference on Communications, NCC 2021, 27-30 jul 2021, Kanpur.

[img] PDF
IEEE_NCC_2021.pdf - Published Version
Restricted to Registered users only

Download (874kB) | Request a copy
Official URL: https://doi.org/10.1109/NCC52529.2021.9530162

Abstract

Sanskrit is one of the Indian languages which fares poorly, with regard to the development of language-based tools. In this work, we build a connectionist temporal classification (CTC) based end-to-end large vocabulary continuous speech recognition system for Sanskrit. To our knowledge, this is the first time an end-to-end framework is being used for automatic speech recognition in Sanskrit. A Sanskrit speech corpus with around 5.5 hours of speech data is used for training a neural network with a CTC objective. 80-dimensional mel-spectrogram together with their delta and delta-delta is used as the input features. Spectrogram augmentation techniques are used to effectively increase the amount of training data. The trained CTC acoustic model is assessed in terms of character error rate (CER) on greedy decoding. Weighted finite-state transducer (WFST) decoding is used to obtain the word level transcriptions from the character level probability distributions obtained at the output of the CTC network. The decoder WFST, which maps the CTC output characters to the words in the lexicon, is constructed by composing 3 individual finite-state transducers (FST), namely token, lexicon and grammar. Trigram models trained from a text corpus of 262338 sentences are used for language modeling in grammar FST. The system achieves a word error rate (WER) of 7.64 and a sentence error rate (SER) of 32.44 on the Sanskrit test set of 558 utterances with spectrogram augmentation and WFST decoding. Spectrogram augmentation provides an absolute improvement of 13.86 in WER. © 2021 IEEE.

Item Type: Conference Paper
Publication: 2021 National Conference on Communications, NCC 2021
Publisher: Institute of Electrical and Electronics Engineers Inc.
Additional Information: The copyright for this article belongs to Institute of Electrical and Electronics Engineers Inc.
Keywords: Continuous speech recognition; Errors; Modeling languages; Probability distributions; Spectrographs; Speech; Transducers, ASR; Connectionist temporal classification; Sanskrit; Spectrogram augmentation; Spectrograms; Temporal classification; Weighted finite-state transducer decoding; Weighted finite-state transducers, Decoding
Department/Centre: Division of Electrical Sciences > Electrical Engineering
Date Deposited: 07 Dec 2021 10:22
Last Modified: 07 Dec 2021 10:22
URI: http://eprints.iisc.ac.in/id/eprint/70377

Actions (login required)

View Item View Item