Speaker and Language Aware Training for End-To-End ASR

Bansal, S and Malhotra, K and Ganapathy, S (2019) Speaker and Language Aware Training for End-To-End ASR. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019, 15-18, December 2019, Singapore, pp. 494-501.

PDF
iee_aut_spe_rec_und_wor_494-501_2019.pdf - Published Version
Restricted to Registered users only
Download (497kB) | Request a copy

Official URL: https://dx.doi.org/10.1109/ASRU46091.2019.9004000

Abstract

The end-To-end (E2E) approach to automatic speech recognition (ASR) is a simplified and an elegant approach where a single deep neural network model directly converts the acoustic feature sequence to the text sequence. The current approach to end-To-end ASR uses the neural network model (trained with sequence loss) along with an external character/word based language model (LM) in a decoding pass to output the text sequence. In this work, we propose a new objective function for end-To-end ASR training where the LM score is explicitly introduced in the attention model loss function without any additional training parameters. In this manner, the neural network is made LM aware and this simplifies the model training process. We also propose to incorporate an attention based sequence summary feature in the ASR model which allows the system to be speaker aware. With several E2E ASR experiments on TED-LIUM, WSJ and Librispeech datasets, we show that the proposed speaker and LM aware training improves the ASR performance significantly over the state-of-Art E2E approaches. We achieve the best published results reported for WSJ dataset.

Item Type:	Conference Paper
Publication:	2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings
Publisher:	Institute of Electrical and Electronics Engineers Inc.
Additional Information:	cited By 0; Conference of 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 ; Conference Date: 15 December 2019 Through 18 December 2019; Conference Code:157953
Keywords:	Arts computing; Character recognition; Computational linguistics; Deep neural networks; Modeling languages, Acoustic features; Automatic speech recognition; End to end; Language model; Neural network model; Objective functions; Speaker adaptation; Training parameters, Speech recognition
Department/Centre:	Division of Electrical Sciences > Electrical Engineering
Date Deposited:	18 Aug 2020 09:51
Last Modified:	18 Aug 2020 09:51
URI:	http://eprints.iisc.ac.in/id/eprint/65001

Actions (login required)

View Item


	Powered by EPrints		A service from The J.R.D. Tata Memorial Library Indian Institute of Science, Bengaluru-560012, India