A Robust Speaking Rate Estimator Using a CNN-BLSTM Network

Srinivasan, A and Singh, D and Yarra, C and Illa, A and Ghosh, PK (2021) A Robust Speaking Rate Estimator Using a CNN-BLSTM Network. In: Circuits, Systems, and Signal Processing .

PDF
cir_sys_sig_pro_2021.pdf - Published Version
Restricted to Registered users only
Download (1MB) | Request a copy

Official URL: https://doi.org/10.1007/s00034-021-01754-1

Abstract

Direct acoustic feature-based speaking rate estimation is useful in applications including pronunciation assessment, dysarthria detection and automatic speech recognition. Most of the existing works on speaking rate estimation have steps which are heuristically designed. In contrast to the existing works, in this work a data-driven approach with convolutional neural network-bidirectional long short-term memory (CNN-BLSTM) network is proposed to jointly optimize all steps in speaking rate estimation through a single framework. Also, unlike existing deep learning-based methods for speaking rate estimation, the proposed approach estimates the speaking rate for an entire speech utterance in one go instead of considering segments of a fixed duration. We consider the traditional 19 sub-band energy (SBE) contours as the low-level features as the input of the proposed CNN-BLSTM network. The state-of-the-art direct acoustic feature-based speaking rate estimation techniques are developed based on 19 SBEs as well. Experiments are performed separately using three native English speech corpora (Switchboard, TIMIT and CTIMIT) and a non-native English speech corpus (ISLE). Among these, TIMIT and Switchboard are used for training the network. However, testing is carried out on all the four corpora as well as TIMIT and Switchboard with additive noise, namely white, car, high-frequency-channel, cockpit, and babble at 20, 10 and 0Â dB signal-to-noise ratios. The proposed CNN-BLSTM approach outperforms the best of the existing techniques in clean as well as noisy conditions for all four corpora. Â© 2021, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.

Item Type:	Journal Article
Publication:	Circuits, Systems, and Signal Processing
Publisher:	Birkhauser
Additional Information:	The copyright for this article belongs to Birkhauser
Keywords:	Acoustic noise; Additive noise; Deep learning; Electric switchboards; Signal to noise ratio; Speech; Speech recognition; Well testing, Acoustic features; Automatic speech recognition; Data-driven approach; High frequency channels; Learning-based methods; Low-level features; Pronunciation assessment; State of the art, Convolutional neural networks
Department/Centre:	Division of Electrical Sciences > Electrical Engineering
Date Deposited:	30 Aug 2021 06:21
Last Modified:	30 Aug 2021 06:21
URI:	http://eprints.iisc.ac.in/id/eprint/69556

Actions (login required)

View Item


	Powered by EPrints		A service from The J.R.D. Tata Memorial Library Indian Institute of Science, Bengaluru-560012, India