ePrints@IIScePrints@IISc Home | About | Browse | Latest Additions | Advanced Search | Contact | Help

Estimating articulatory movements in speech production with transformer networks

Udupa, S and Roy, A and Singh, A and Illa, A and Ghosh, PK (2021) Estimating articulatory movements in speech production with transformer networks. In: 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021, 30 Aug - 03 Sep 2021, Brno, pp. 3156-3160.

[img] PDF
INTERSPEECH_2021.pdf - Published Version
Restricted to Registered users only

Download (430kB) | Request a copy
Official URL: https://doi.org/10.21437/Interspeech.2021-1375

Abstract

We estimate articulatory movements in speech production from different modalities - acoustics and phonemes. Acoustic-to-articulatory inversion (AAI) is a sequence-to-sequence task. On the other hand, phoneme to articulatory (PTA) motion estimation faces a key challenge in reliably aligning the text and the articulatory movements. To address this challenge, we explore the use of a transformer architecture - FastSpeech, with explicit duration modelling to learn hard alignments between the phonemes and articulatory movements. We also train a transformer model on AAI. We use correlation coefficient (CC) and root mean squared error (rMSE) to assess the estimation performance in comparison to existing methods on both tasks. We observe 154, 11.8 & 4.8 relative improvement in CC with subject-dependent, pooled and fine-tuning strategies, respectively, for PTA estimation. Additionally, on the AAI task, we obtain 1.5, 3 and 3.1 relative gain in CC on the same setups compared to the state-of-the-art baseline. We further present the computational benefits of having transformer architecture as representation blocks. Copyright © 2021 ISCA.

Item Type: Conference Paper
Publication: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publisher: International Speech Communication Association
Additional Information: The copyright for this article belongs to International Speech Communication Association
Keywords: Mean square error; Network architecture; Speech communication, Acoustics to articulatory inversion; Articulatory inversion; Correlation coefficient; Duration modelling; Electromagnetic articulograph; Electromagnetics; Explicit duration; Phoneme to articulatory estimation; Speech production; Transformer network, Motion estimation
Department/Centre: Division of Electrical Sciences > Electrical Engineering
Date Deposited: 03 Dec 2021 08:53
Last Modified: 03 Dec 2021 08:53
URI: http://eprints.iisc.ac.in/id/eprint/70646

Actions (login required)

View Item View Item