Scattering transform inspired filterbank learning from raw speech for better acoustic modeling

Madhavaraj, A and Ramakrishnan, AG (2019) Scattering transform inspired filterbank learning from raw speech for better acoustic modeling. In: 2019 IEEE Region 10 Conference: Technology, Knowledge, and Society, TENCON 2019, 17-20 October 2019, Hotel Grand HyattKerala; India, pp. 1154-1158.

PDF
TENCON_2019.pdf - Published Version
Restricted to Registered users only
Download (295kB) | Request a copy

Official URL: https://doi.org/10.1109/TENCON.2019.8929240

Abstract

We propose a neural network architecture, which operates on the raw speech signal, where the first layer contains a series of 1D time-domain filters. The output of this layer is fed to the second layer, which is a bank of 2D-convolution filters that capture the spectro-temporal modulations in the speech signal. The outputs of these two layers are concatenated, normalized and then fed to a feed-forward neural network to predict the senone posteriors, which are used for ASR decoding. During the training of the neural network, we have employed different strategies, where the 1D and 2D filters are initialized with (a) Gabor filters and (b) random values and the filter coefficients are either (a) allowed to be updated along with the other affine transform parameters of the network or (b) fixed during training. ASR experiments are conducted on 160 hours of Tamil speech data and the proposed architecture gives an absolute improvement in word error rate (WER) of 1.35 and 1.21 with respect to the neural network models trained on mel-frequency cepstral coefficients and log-filterbank energy features, respectively. We have also compared the performances of various strategies for filter initialization and training and reported the WERs. Â© 2019 IEEE.

Item Type:	Conference Paper
Publication:	IEEE Region 10 Annual International Conference, Proceedings/TENCON
Publisher:	Institute of Electrical and Electronics Engineers Inc.
Additional Information:	Copyright of this article belongs to IEEE
Keywords:	Affine transforms; Deep neural networks; Feedforward neural networks; Filter banks; Gabor filters; Modulation; Network architecture; Speech; Speech communication; Speech recognition, Filter coefficients; Mel frequency cepstral co-efficient; Neural network model; Proposed architectures; Scale filter; Scattering transforms; Spectro-temporal modulations; Word error rate, Multilayer neural networks
Department/Centre:	Division of Electrical Sciences > Electrical Engineering
Date Deposited:	25 Feb 2020 10:29
Last Modified:	25 Sep 2022 08:42
URI:	https://eprints.iisc.ac.in/id/eprint/64440

Actions (login required)

View Item


	Powered by EPrints		A service from The J.R.D. Tata Memorial Library Indian Institute of Science, Bengaluru-560012, India