Naini, AR and Satyapriya, M and Ghosh, PK (2020) Whisper activity detection using CNN-LSTM based attention pooling network trained for a speaker identification task. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 25-29 October 2020, Shanghai; China, pp. 2922-2926.
PDF
Pro-Ann-Con-nt-Spe-Comm-Ass-2020-2922-2926.pdf - Published Version Restricted to Registered users only Download (524kB) | Request a copy |
Abstract
In this work, we proposed a method to detect the whispered speech region in a noisy audio file called whisper activity detection (WAD). Due to the lack of pitch and noisy nature of whispered speech, it makes WAD a way more challenging task than standard voice activity detection (VAD). In this work, we proposed a Long-short term memory (LSTM) based whisper activity detection algorithm. However, this LSTM network is trained by keeping it as an attention pooling layer to a Convolutional neural network (CNN), which is trained for a speaker identification task. WAD experiments with 186 speakers, with eight noise types in seven different signal-to-noise ratio (SNR) conditions, show that the proposed method performs better than the best baseline scheme in most of the conditions. Particularly in the case of unknown noises and environmental conditions, the proposed WAD performs significantly better than the best baseline scheme. Another key advantage of the proposed WAD method is that it requires only a small part of the training data with annotation to fine-tune the post-processing parameters, unlike the existing baseline schemes requiring full training data annotated with the whispered speech regions. Copyright © 2020 ISCA
Item Type: | Conference Paper |
---|---|
Publication: | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
Publisher: | International Speech Communication Association |
Additional Information: | cited By 0; Conference of 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020 ; Conference Date: 25 October 2020 Through 29 October 2020; Conference Code:165507 |
Keywords: | Continuous speech recognition; Convolutional neural networks; Loudspeakers; Multilayer neural networks; Signal to noise ratio; Speech communication, Activity detection; Environmental conditions; Post processing; Speaker identification; Training data; Unknown noise; Voice activity detection; Whispered speech, Long short-term memory |
Department/Centre: | Division of Electrical Sciences > Electrical Engineering |
Date Deposited: | 12 Jan 2021 06:47 |
Last Modified: | 12 Jan 2021 06:47 |
URI: | http://eprints.iisc.ac.in/id/eprint/67646 |
Actions (login required)
View Item |