Unsupervised raw waveform representation learning for ASR

Agrawal, P and Ganapathy, S (2019) Unsupervised raw waveform representation learning for ASR. In: 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019, 15 - 19 September 2019, Graz, pp. 3451-3455.

PDF
INTERSPEECH_2019.pdf - Published Version
Restricted to Registered users only
Download (669kB) | Request a copy

Official URL: https://doi.org/10.21437/Interspeech.2019-2652

Abstract

In this paper, we propose a deep representation learning approach using the raw speech waveform in an unsupervised learning paradigm. The first layer of the proposed deep model performs acoustic filtering while the subsequent layer performs modulation filtering. The acoustic filterbank is implemented using cosine-modulated Gaussian filters whose parameters are learned. The modulation filtering is performed on log transformed outputs of the first layer and this is achieved using a skip connection based architecture. The outputs from this two layer filtering are fed to the variational autoencoder model. All the model parameters including the filtering layers are learned using the VAE cost function. We employ the learned representations (second layer outputs) in a speech recognition task. Experiments are conducted on Aurora-4 (additive noise with channel artifact) and CHiME-3 (additive noise with reverberation) databases. In these experiments, the learned representations from the proposed framework provide significant improvements in ASR results over the baseline filterbank features and other robust front-ends (average relative improvements of 16 and 6 in word error rate over baseline features on clean and multi-condition training, respectively on Aurora-4 dataset, and 21 over the baseline features on CHiME-3 database). Copyright Â© 2019 ISCA

Item Type:	Conference Paper
Publication:	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publisher:	International Speech Communication Association
Additional Information:	The copyright for this article belongs to International Speech Communication Association.
Keywords:	Convolutional variational autoencoder; Cosine-modulated Gaussian filterbank; Raw speech waveform; Speech recognition; Unsupervised representation learning
Department/Centre:	Division of Electrical Sciences > Electrical Engineering
Date Deposited:	05 Dec 2022 07:03
Last Modified:	05 Dec 2022 07:03
URI:	https://eprints.iisc.ac.in/id/eprint/78247

Actions (login required)

View Item


	Powered by EPrints		A service from The J.R.D. Tata Memorial Library Indian Institute of Science, Bengaluru-560012, India