ePrints@IIScePrints@IISc Home | About | Browse | Latest Additions | Advanced Search | Contact | Help

Acoustic and articulatory feature based speech rate estimation using a convolutional dense neural network

Mannem, R and Mallela, J and Illa, A and Ghosh, PK (2019) Acoustic and articulatory feature based speech rate estimation using a convolutional dense neural network. In: 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019, 15 - 19 September 2019, Graz, pp. 929-933.

[img] PDF
INTERSPEECH_2019.pdf - Published Version
Restricted to Registered users only

Download (2MB) | Request a copy
Official URL: https://doi.org/10.21437/Interspeech.2019-2295

Abstract

In this paper, we propose a speech rate estimation approach using a convolutional dense neural network (CDNN). The CDNN based approach uses the acoustic and articulatory features for speech rate estimation. The Mel Frequency Cepstral Coefficients (MFCCs) are used as acoustic features and the articulograms representing time-varying vocal tract profile are used as articulatory features. The articulogram is computed from a real-time magnetic resonance imaging (rtMRI) video in the midsagittal plane of a subject while speaking. However, in practice, the articulogram features are not directly available, unlike acoustic features from speech recording. Thus, we use an Acoustic-to-Articulatory Inversion method using a bidirectional long-short-term memory network which estimates the articulogram features from the acoustics. The proposed CDNN based approach using estimated articulatory features requires both acoustic and articulatory features during training but it requires only acoustic data during testing. Experiments are conducted using rtMRI videos from four subjects each speaking 460 sentences. The Pearson correlation coefficient is used to evaluate the speech rate estimation. It is found that the CDNN based approach gives a better correlation coefficient than the temporal and selected sub-band correlation (TCSSBC) based baseline scheme by 81.58 and 73.68 (relative) in seen and unseen subject conditions respectively.

Item Type: Conference Paper
Publication: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publisher: International Speech Communication Association
Additional Information: The copyright for this article belongs to International Speech Communication Association.
Keywords: Acoustic-to-articulatory inversion; Articulogram; Bidirectional long-short-term memory; Convolutional dense neural network; Speech rate estimation
Department/Centre: Division of Electrical Sciences > Electrical Engineering
Date Deposited: 05 Dec 2022 10:01
Last Modified: 05 Dec 2022 10:01
URI: https://eprints.iisc.ac.in/id/eprint/78254

Actions (login required)

View Item View Item