Madhavaraj, A and Ramakrishnan, AG (2019) Scattering transform inspired filterbank learning from raw speech for better acoustic modeling. In: 2019 IEEE Region 10 Conference: Technology, Knowledge, and Society, TENCON 2019, 17-20 October 2019, Hotel Grand HyattKerala; India, pp. 1154-1158.
PDF
TENCON_2019.pdf - Published Version Restricted to Registered users only Download (295kB) | Request a copy |
Abstract
We propose a neural network architecture, which operates on the raw speech signal, where the first layer contains a series of 1D time-domain filters. The output of this layer is fed to the second layer, which is a bank of 2D-convolution filters that capture the spectro-temporal modulations in the speech signal. The outputs of these two layers are concatenated, normalized and then fed to a feed-forward neural network to predict the senone posteriors, which are used for ASR decoding. During the training of the neural network, we have employed different strategies, where the 1D and 2D filters are initialized with (a) Gabor filters and (b) random values and the filter coefficients are either (a) allowed to be updated along with the other affine transform parameters of the network or (b) fixed during training. ASR experiments are conducted on 160 hours of Tamil speech data and the proposed architecture gives an absolute improvement in word error rate (WER) of 1.35 and 1.21 with respect to the neural network models trained on mel-frequency cepstral coefficients and log-filterbank energy features, respectively. We have also compared the performances of various strategies for filter initialization and training and reported the WERs. © 2019 IEEE.
Item Type: | Conference Paper |
---|---|
Publication: | IEEE Region 10 Annual International Conference, Proceedings/TENCON |
Publisher: | Institute of Electrical and Electronics Engineers Inc. |
Additional Information: | Copyright of this article belongs to IEEE |
Keywords: | Affine transforms; Deep neural networks; Feedforward neural networks; Filter banks; Gabor filters; Modulation; Network architecture; Speech; Speech communication; Speech recognition, Filter coefficients; Mel frequency cepstral co-efficient; Neural network model; Proposed architectures; Scale filter; Scattering transforms; Spectro-temporal modulations; Word error rate, Multilayer neural networks |
Department/Centre: | Division of Electrical Sciences > Electrical Engineering |
Date Deposited: | 25 Feb 2020 10:29 |
Last Modified: | 25 Sep 2022 08:42 |
URI: | https://eprints.iisc.ac.in/id/eprint/64440 |
Actions (login required)
View Item |