ePrints@IIScePrints@IISc Home | About | Browse | Latest Additions | Advanced Search | Contact | Help

Speech rate task-specific representation learning from acoustic-articulatory data

Mannem, R and Hima Jyothi, R and Illa, A and Ghosh, PK (2020) Speech rate task-specific representation learning from acoustic-articulatory data. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 25-29 October 2020, Shanghai; China, pp. 2892-2896.

[img] PDF
Pro-Ann-Con-2020-2892-2896.pdf - Published Version
Restricted to Registered users only

Download (275kB) | Request a copy
Official URL: https://dx.doi.org/10.21437/Interspeech.2020-2259

Abstract

In this work, speech rate is estimated using the task-specific representations which are learned from the acoustic-articulatory data, in contrast to generic representations which may not be optimal for the speech rate estimation. 1-D convolutional filters are used to learn speech rate specific acoustic representations from the raw speech. A convolutional dense neural network (CDNN) is used to estimate the speech rate from the learned representations. In practice, articulatory data is not directly available; thus, we use Acoustic-to-Articulatory Inversion (AAI) to derive the articulatory representations from acoustics. However, these pseudo-articulatory representations are also generic and not optimized for any task. To learn the speech-rate specific pseudo-articulatory representations, we propose a joint training of BLSTM-based AAI and CDNN using a weighted loss function that considers the losses corresponding to speech rate estimation and articulatory prediction. The proposed model yields an improvement in speech rate estimation by ~18.5 in terms of pearson correlation coefficient (CC) compared to the baseline CDNN model with generic articulatory representations as inputs. To utilize complementary information from articulatory features, we further perform experiments by concatenating task-specific acoustic and pseudo-articulatory representations, which yield an improvement in CC by ~2.5 compared to the baseline CDNN model. © 2020 ISCA

Item Type: Conference Paper
Publication: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publisher: International Speech Communication Association
Additional Information: cited By 0; Conference of 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020 ; Conference Date: 25 October 2020 Through 29 October 2020; Conference Code:165507
Keywords: Convolution; Convolutional neural networks; Correlation methods; Speech, Articulatory data; Articulatory features; Articulatory inversion; Generic representation; Model yields; Pearson correlation coefficients; Speech rates; Weighted loss function, Speech communication
Department/Centre: Division of Electrical Sciences > Electrical Engineering
Date Deposited: 12 Jan 2021 05:41
Last Modified: 12 Jan 2021 05:41
URI: http://eprints.iisc.ac.in/id/eprint/67640

Actions (login required)

View Item View Item