ePrints@IIScePrints@IISc Home | About | Browse | Latest Additions | Advanced Search | Contact | Help

Gram Vaani ASR Challenge on spontaneous telephone speech recordings in regional variations of Hindi

Bhanushali, A and Bridgman, G and Deekshitha, G and Ghosh, P and Kumar, P and Kumar, S and Kolladath, AR and Ravi, N and Seth, A and Seth, A and Singh, A and Sukhadia, VN and Umesh, S and Udupa, S and Durga Prasad, LVSV (2022) Gram Vaani ASR Challenge on spontaneous telephone speech recordings in regional variations of Hindi. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 18 - 22 September 2022, Incheon, pp. 3548-3552.

[img] PDF
INTERSPEECH_2022.pdf - Published Version
Restricted to Registered users only

Download (233kB) | Request a copy
Official URL: https://doi.org/10.21437/Interspeech.2022-11371

Abstract

This paper describes the corpus and baseline systems for the Gram Vaani Automatic Speech Recognition (ASR) challenge in regional variations of Hindi. The corpus for this challenge comprises the spontaneous telephone speech recordings collected by a social technology enterprise, Gram Vaani. The regional variations of Hindi together with spontaneity of speech, natural background and transcriptions with variable accuracy due to crowdsourcing make it a unique corpus for ASR on spontaneous telephonic speech. Around, 1108 hours of real-world spontaneous speech recordings, including 1000 hours of unlabelled training data, 100 hours of labelled training data, 5 hours of development data and 3 hours of evaluation data, have been released as a part of the challenge. The efficacy of both training and test sets are validated on different ASR systems in both traditional time-delay neural network-hidden Markov model (TDNN-HMM) frameworks and fully-neural end-to-end (E2E) setup. The word error rate (WER) and character error rate (CER) on eval set for a TDNN model trained on 100 hours of labelled data are 29.7 and 15.1, respectively. While, in E2E setup, WER and CER on eval set for a conformer model trained on 100 hours of data are 32.9 and 19.0, respectively.

Item Type: Conference Paper
Publication: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publisher: International Speech Communication Association
Additional Information: The copyright for this article belongs to International Speech Communication Association.
Keywords: Audio recordings; Neural networks; Speech communication; Speech recognition; Telephone sets, Automatic speech recognition; Gram vaani; Hindi speech data; Real-world; Real-world automatic speech recognition challenge; Speech data; Speech recording; Spontaneous telephone speech data; Telephone speech, Hidden Markov models
Department/Centre: Division of Electrical Sciences > Electrical Engineering
Date Deposited: 10 Nov 2022 06:24
Last Modified: 10 Nov 2022 06:24
URI: https://eprints.iisc.ac.in/id/eprint/77859

Actions (login required)

View Item View Item