Whisper activity detection using CNN-LSTM based attention pooling network trained for a speaker identification task

Naini, AR and Satyapriya, M and Ghosh, PK (2020) Whisper activity detection using CNN-LSTM based attention pooling network trained for a speaker identification task. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 25-29 October 2020, Shanghai; China, pp. 2922-2926.

PDF
Pro-Ann-Con-nt-Spe-Comm-Ass-2020-2922-2926.pdf - Published Version
Restricted to Registered users only
Download (524kB) | Request a copy

Official URL: https://dx.doi.org/10.21437/Interspeech.2020-3217

Abstract

In this work, we proposed a method to detect the whispered speech region in a noisy audio file called whisper activity detection (WAD). Due to the lack of pitch and noisy nature of whispered speech, it makes WAD a way more challenging task than standard voice activity detection (VAD). In this work, we proposed a Long-short term memory (LSTM) based whisper activity detection algorithm. However, this LSTM network is trained by keeping it as an attention pooling layer to a Convolutional neural network (CNN), which is trained for a speaker identification task. WAD experiments with 186 speakers, with eight noise types in seven different signal-to-noise ratio (SNR) conditions, show that the proposed method performs better than the best baseline scheme in most of the conditions. Particularly in the case of unknown noises and environmental conditions, the proposed WAD performs significantly better than the best baseline scheme. Another key advantage of the proposed WAD method is that it requires only a small part of the training data with annotation to fine-tune the post-processing parameters, unlike the existing baseline schemes requiring full training data annotated with the whispered speech regions. Copyright Â© 2020 ISCA

Item Type:	Conference Paper
Publication:	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publisher:	International Speech Communication Association
Additional Information:	cited By 0; Conference of 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020 ; Conference Date: 25 October 2020 Through 29 October 2020; Conference Code:165507
Keywords:	Continuous speech recognition; Convolutional neural networks; Loudspeakers; Multilayer neural networks; Signal to noise ratio; Speech communication, Activity detection; Environmental conditions; Post processing; Speaker identification; Training data; Unknown noise; Voice activity detection; Whispered speech, Long short-term memory
Department/Centre:	Division of Electrical Sciences > Electrical Engineering
Date Deposited:	12 Jan 2021 06:47
Last Modified:	12 Jan 2021 06:47
URI:	http://eprints.iisc.ac.in/id/eprint/67646

Actions (login required)

View Item


	Powered by EPrints		A service from The J.R.D. Tata Memorial Library Indian Institute of Science, Bengaluru-560012, India