Dereverberation of Autoregressive Envelopes for Far-field Speech Recognition

The task of speech recognition in far-field environments is adversely affected by the reverberant artifacts that elicit as the temporal smearing of the sub-band envelopes. In this paper, we develop a neural model for speech dereverberation using the long-term sub-band envelopes of speech. The sub-band envelopes are derived using frequency domain linear prediction (FDLP) which performs an autoregressive estimation of the Hilbert envelopes. The neural dereverberation model estimates the envelope gain which when applied to reverberant signals suppresses the late reflection components in the far-field signal. The dereverberated envelopes are used for feature extraction in speech recognition. Further, the sequence of steps involved in envelope dereverberation, feature extraction and acoustic modeling for ASR can be implemented as a single neural processing pipeline which allows the joint learning of the dereverberation network and the acoustic model. Several experiments are performed on the REVERB challenge dataset, CHiME-3 dataset and VOiCES dataset. In these experiments, the joint learning of envelope dereverberation and acoustic model yields significant performance improvements over the baseline ASR system based on log-mel spectrogram as well as other past approaches for dereverberation (average relative improvements of 10-24% over the baseline system). A detailed analysis on the choice of hyper-parameters and the cost function involved in envelope dereverberation is also provided.


Introduction
Automatic speech recognition (ASR) is a challenging task in far-field conditions. This is particularly due to the fact that the speech signal will be reverberant and noisy. The word error rates (WER) in ASR have seen a dramatic improvement over the past decade due to the advancements in deep learning based techniques [1]. Still the deterioration in performance in noisy and reverberant conditions persist [2]. A relative increase in WER of 75% is reported by [3,4] when the signal from headset microphone is replaced with far-field array microphone signals in the ASR systems. This deterioration is due to temporal smearing of time domain envelopes caused by reverberation [5].
One common approach to suppress reverberation is to combine all channels by beamforming [6] before feeding it to the ASR system. Recently, unsupervised neural mask estimator for generalized eigen-value beamforming is proposed [7].
Traditional pre-possessing also includes the weighted prediction error (WPE) [8] based dereverberation along with the beamforming in most state-of-art far-field ASR systems. Further, multi-condition training is usually used to alleviate the mismatch between training and testing [9]. Here, either simulated reverberant data or real far-field data can be added to the training data. However, even with these techniques, the beamformed signal shows significant amount of temporal smoothing in sub-band envelopes. The temporal smearing is caused by the superposition of direct path signal and reflected signals and this leads to ASR performance degradation [10].
In this paper, we analyze the effect of reverberation on sub-band Hilbert envelopes. We show that the effect of reverberation can be approximated as convolution of the long-term sub-band envelopes of clean speech with the envelope of room impulse response function. In order to compensate for the late reverberation component in the envelope, we explore a Wiener filtering approach where the Wiener filter gain is computed using a deep neural network (DNN). The gain estimation network is implemented using a convolutional-long short term memory (CLSTM) model. The gain is multiplied with the sub-band envelopes to suppress reverberation artifacts. The sub-band envelopes are converted to spectrographic features through integration and used for deep neural network based ASR. The sub-band envelopes are derived using the autoregressive modeling framework of frequency domain linear prediction (FDLP) [11,12].
The steps involved in envelope dereverberation, feature extraction and acoustic modeling for ASR can all be implemented as neural network layers. Therefore, we also propose an approach for joint learning of the speech dereverberation model with the ASR acoustic modeling network as a single neural model. Various ASR experiments are performed on the REVERB challenge dataset [13] as well as the CHiME-3 dataset [14]. In these experiments, we show that the proposed approach improves over the state-of-the-art ASR systems based on log-mel features as well as other past approaches proposed for speech dereverberation and denoising based on deep learning. In addition, we also extend the approach to large vocabulary speech recognition on VOiCES dataset [15,16].
The rest of the paper is organized as follows. The related prior work is discussed in Section 2. This section also discusses the key contributions from the proposed work. Section 3 provides details regarding the reverberation artifacts and autoregressive envelope estimation using frequency domain linear prediction. In Section 4, we discuss the envelope dereverberation model, feature extraction as well as the joint approach to dereverberation with acoustic modeling for ASR. The ASR experiments and results are discussed in Section 5.
Various model parameter choices and additional analyses are reported in Section 6. This is followed by a summary of the work in Section 7.

Related prior work
Xu et. al. in [17] attempted to find a mapping function from noisy and clean signals using supervised neural network, which is used for enhancement in the testing stage. In a similar manner, speech separation problem is also explored with ideal ratio mask based neural mapping [18]. Zhao et. al. proposed a LSTM model for late reflection prediction in the spectrogram domain for reverberant speech [19]. A spectral mapping approach using the log-magnitude inputs was attempted by Han et. al [20]. A mask based approach to dereverberation on the complex short-term Fourier transform domain was explored by Williamson et. al [21].
Speech enhancement for speech recognition based on neural networks has been explored in [22,23,24]. In Maas et. al [25], a recurrent neural network is used to map noise-corrupted input features to their corresponding clean versions.
A context aware recurrent neural network based convolutional encoder-decoder architecture was used in [26] to map the power spectral features of noisy and clean speech. In a recent work by Pandey et. al [27], the speech enhancement is learned in the time domain itself, but using a matrix multiplication to convert the time domain signal into frequency domain and the frequency domain loss is used for training. This approach uses mean absolute error between the STFT frames of the clean and noisy speech for training.
The joint learning of the speech enhancement neural model and the acoustic model was attempted in [28]. Here, a DNN based speech separation model is coupled with a DNN based acoustic model and the weights are adjusted jointly. Bo Wu et. al. [29] proposed to unify the speech enhancement neural model and the acoustic model trained separately, and then the joint model is further trained to improve the ASR performance. The power spectrum in the log domain was used as features in the enhancement stage. Bo Wu et. al. [30] also explored an end-to-end deep learning approach in, where the knowledge about reverberation time is incorporated in DNN based dereverberation front end.
This reverberation time aware-DNN enhancement module and ASR acoustic module are further trained jointly to improve the ASR cost.
The key contributions from the current work can be summarized as follows, • Deriving a signal model for reverberation effects on sub-band speech envelopes and posing the dereverberation problem as a gain estimation prob-lem.
• Dereverberation of the autoregressive estimates of the sub-band envelope using a CLSTM model followed by feature extraction for ASR.
• Joint learning of the dereverberation model parameters and the acoustic model for ASR in a single neural pipeline.
• Illustrating the performance benefits of the proposed approach for multiple ASR tasks.
We use FDLP features [31] for far-field speech. This paper extends the prior work done in [32] by proposing a joint neural dereverberation which forms an elegant neural learning framework. Further, several ASR experiments with the joint modeling approach are also conducted in this work.

Sub-band Envelopes -Effect of Reverberation and Autoregressive Estimation
We present the signal model for reverberation and the autoregressive model for estimating the sub-band envelopes [10,33].

Signal model
When speech is recorded in far-field reverberant environment, the data collected in the microphone is modeled as where x(t), h(t) and r(t) denote the clean speech signal, the room impulse response and the reverberant speech respectively. The room response function where h e (t) and h l (t) represent the early and late reflection components.
Let x q (n), h q (n) and r q (n) denote the decimated sub-band clean speech, room-response and the reverberant speech signal respectively. Here q = 1, .., Q denotes the sub-band index and n denotes the decimated time-index (frame).
Assuming an ideal band-pass filtering we can write (using Eq. 1), In the proposed model, we explore the modeling of the sub-band temporal envelopes. In order to extract the envelopes, the analytic signal based demod- shown that [11,34], If two signals have a modulating envelope on the same modulating sinusoidal carrier signal (single AM-FM signal), the convolution operation of the two signals will have an envelope which is the convolution of the two envelopes, i.e., the envelope of the convolution of the two signals is the convolution of the envelope of the signals. For sub-band speech signals, this envelope convolution model will form a good approximation if the sub-band signals are narrow-band.
Then, for band-pass filters with narrow band-width, we get the following approximation between the sub-band envelope (defined as the magnitude of the analytic signal) components of the reverberant signal and those of the clean speech signal.
where m rq (n), m xq (n), m hq (n) denote the sub-band envelopes of reverberant speech, clean speech and room response respectively. We can further split the envelope into early and late reflection coefficients. timates the spectral envelope of a signal, FDLP estimates the temporal envelope of the signal [35], i.e. square of its Hilbert envelope [36]. The Hilbert envelope is given by the inverse Fourier transform of the auto-correlation function of discrete cosine transform (DCT) [37,38].

Autoregressive modeling of sub-band envelopes
We use the auto-correlation of the DCT coefficients to model the temporal envelope of the signal. The autoregressive (AR) modeling property of linear prediction implies that the model preserves the peak location of the signal (which tend to be more robust in the presence of noise and reverberation) [39]. For the FDLP model, the sub-band AR model tries to preserve the peaks in temporal envelope [33].
Let x(t) denote an N -point discrete sequence. The type-I odd DCT [40] y[k] for k = 0, 1, . . . , N − 1 is given by, where c t,k = 1 for t, k > 0 and c t,k = 1 2 for t, k = 0 and c t,k = 1 √ 2 for the values of t, k where only one of the index is 0 and M = 2N − 1.
An even symmetric version of the input signal x(t) is the signal q(t) of length The analytic signal of a discrete time sequence can be defined using the onesided discrete Fourier transform (DFT) [37]. The analytic signal q a (t) of the even-symmetric signal q(t) can be shown to be [33] the zero-padded DCT (upto Further, it can be shown that [37], the auto-correlation of the zero-padded DCT signalŷ[k] and the squared magnitude of the analytic signal (Hilbert envelope) of the even-symmetric signal |q a (t)| 2 are Fourier transform pairs [35].
Hence, the application of linear prediction on the zero-padded DCT signal yields the AR model of the Hilbert envelope of signal.
Let the linear prediction coefficients obtained from the zero-padded DCT signal be denoted as {a k } p k=0 , where p is the order of the LP. The FDLP model for the envelope is given by, where σ denotes the LP gain. The envelope estimated in above equation represents the autoregressive model of the temporal envelopes. Note that, when the model is applied on sub-band DCT coefficients, the envelope estimated will be the sub-band temporal envelope.
In this work, the sub-band envelopes of speech in mel-spaced bands are estimated using FDLP. Specifically, the discrete cosine transform (DCT) of subband signal r q (t) is computed and a linear prediction (LP) is applied on the DCT components. The LP envelope estimated using the prediction on the DCT components provides an all-pole model of the sub-band envelopes m rq (n).

Envelope Dereverberation and Joint Modeling
The proposed framework (Figure 1), consists of three modules, (i) envelope dereverberation, (ii) feature extraction and (iii) ASR acoustic model.

Neural dereverberation network
As seen in Eq. (5), the FDLP envelope of reverberant speech can be expressed as sum of the direct component (early reflection) and those with the  late reflection. In the envelope dereverberation model, our aim is to input the envelope of the reverberant sub-band temporal envelope m rq (n) to predict the late reflection components m rql (n). Once this prediction is achieved, the late reflection component can be subtracted from the sub-band envelope to suppress the artifacts of reverberation. A similar analogy to this envelope subtraction approach is the spectral subtraction model where the noise and clean power spectral density (PSD) gets added in noisy speech PSD. If Gaussian assumptions are made for PSD components [41], the Wiener filtering approach to noisy speech enhancement provides the minimum mean squared error, where the noisy PSD is multiplied by the gain of the filter. In a similar manner, we pose the dereverberation problem as an envelope gain estimation problem.
The envelope gain (G q ) is defined as, The gain G q (n) is estimated using the input sub-band envelope m rq (n).
With the gain estimate, the dereverberated envelope can be computed as, The product model of enhancement is inspired by Wiener filtering principles.
where Q is the number of sub-bands. The model is trained to predict the log-gain {log(G q )} Q q=1 . The sub-band dereverberated envelope is, whereĜ q (n) is the estimate of the gain from the model. In particular, letm rqe (n) denote the dereverberated sub-band envelope obtained using Eq. (10). Further, let w(n) denote a Hamming window of size 10 (corresponding to 25 ms at 400Hz sampling). Then, the features for ASR are extracted as, where * is the convolution operation, and F q denotes the scalar feature of qth sub-band. Here, m denotes the feature frame index at 10ms sampling (100 Hz).
The features for all the Q sub-bands are spliced to form the final feature vector for ASR model training.
The set of operations described above for short-term integration can be implemented as a 1-D CNN layer with a fixed Hamming shaped kernel size of 10 and a stride 4. A log non-linearity is applied on the convolution output.
The integrated envelopes are used as time-frequency representations for ASR training. A context of 21 frames, with 10 frames on the left and 10 frames on the right is used in the acoustic model training.

Acoustic Model
The architecture of the acoustic model is based on convolutional long short term memory (CLSTM) networks ( Figure 1). The acoustic model corresponds to 2-D CLSTM network described in [31], consisting of 4 layers of CNN, a layer of LSTM with 1024 units performing recurrence over frequency and 3 fully connected layers with batch normalization.

Joint learning
As shown in Figure 1, the three modules of (i) envelope dereverberation, (ii) feature extraction and context formation and (iii) the ASR acoustic modeling can be combined into a single neural end-to-end framework 1

. The intermediate envelope integration step is implemented as a 1-layer of 1-D convolutions with
Hamming shaped kernel and log non-linearity. The context creation for acoustic features in the given segment is also performed as a fixed 1-D convolution layer.
In this manner, the entire processing pipeline can be performed using an elegant joint learning approach.

Joint loss function
The separate deverberation model is trained to minimize the mean square error loss, E M SE , which is the squared error between the reverberant envelope and the clean counter part. For joint training, we have two loss functions, one is the mean square error loss, E M SE for a mini-batch and the cross-entropy loss, E CE between the senone targets and the corresponding posteriors for the same mini-batch. We use a combination of these two losses. Thus the final joint loss, E T otal is given by,

Experiments and results
The experiments are performed on REVERB challenge [13] and CHiME-3 [14] datasets. For the baseline model, we use WPE enhancement [8] along with unsupervised GEV beamforming [7]. This signal is processed with filter-

ASR framework
We use the Kaldi toolkit [44] for deriving the senone alignments used in the PyTorch deep learning framework for acoustic modeling. A hidden Markov model -Gaussian mixture model (HMM-GMM) system is trained with MFCC (Mel Frequency Cepstral Coefficients) features [45] to generate the alignments for training the CLSTM acoustic model. A tri-gram language model [46] is used in the ASR decoding and the best language model weight obtained from development set is used for the evaluation set.

REVERB Challenge ASR
The REVERB challenge dataset [47] for ASR consists of 8 channel recordings with real and simulated reverberation conditions. The simulated data is comprised of reverberant utterances generated (from the WSJCAM0 corpus [48]) by artificially convolving clean WSJCAM0 recordings with the measured room impulse responses (RIRs) and adding noise at an SNR of 20 dB. The simulated data has six different reverberation conditions. The real data, which is comprised of utterances from the MC-WSJ-AV corpus [49], consists of utterances spoken by human speakers in a noisy reverberant room. The training set consists of 7861 utterances from the clean WSJCAM0 training data by convolved with 24 measured RIRs. Average relative improvements of 10% on the development set and about 6% on the evaluation set is achieved compared to the BF-FBANK baseline.

Discussion
BF-FBANK + spectral mapping derevb. [43]  predict the clean utterance. The results for end-to-end dereverberation network (joint learning) proposed in [30] is also compared with the proposed work in Table 1. This suggests that, even though the jointly learned neural model is trained only with simulated reverberation, it generalizes well on unseen real data.

CHiME-3 ASR
The CHiME-3 dataset [14] for the ASR has multiple microphone tablet device recording in four different environments, namely, public transport (BUS), cafe (CAF), street junction (STR) and pedestrian area (PED). For each of the above environments, real and simulated data are present. The real data consists of 6 channel recordings from WSJ0 corpus sampled at 16 kHz spoken in the four varied environments. The simulated data was constructed by mixing clean utterances with the environment noise. The training dataset consists of 1600 (real) noisy recordings and 7138 (simulated) noisy recordings from 83 speakers.

Discussion
The WER results for experiments on CHiME-3 dataset are shown in Table 2.
The FDLP baseline, denoted as BF-FDLP is better than the FBANK baseline (BF-FBANK). We observe average relative improvements of 8% on the development set and about 12% on the evaluation set when comparing BF-FDLP and BF-FBANK baseline systems. It can also be seen from Table 2 that the proposed dereverberation method improves the FBANK-baseline system. The results based on the implementation of works done by Han et. al. [43] and Santos et. al. [26] degrade the word error rates further compared to the BF-FBANK baseline.
In the CHiME-3 dataset, we observed that the significant cause of degradation in the signal quality came from the additive noise sources. On further investigation, we found that the dereverberation model also resulted in smoothing of the spectral variations in the FDLP spectrogram. In order to circumvent this issue, we regularized the MSE loss with a term that encouraged the spectral channels to be uncorrelated. The regularization parameter was kept at 0.05. Using this regularized MSE loss, we further improved the BF-FDLP-Dereverberation system results over the dereverberation approach with MSE loss alone. These experiments suggest that even when the audio data does not have significant late reflection components (like CHiME-3 dataset), the proposed approach improves significantly over the baseline method (average relative improvements of 10.3 % over the baseline BF-FBANK system in the real dev condition and 23.5 % on real eval condtion).

VOiCES corpus ASR
Since the REVERB challenge dataset and CHiME-3 dataset are relatively smaller datasets, we wanted to establish the efficacy of the proposed dereverbaration method in a larger dataset. Thus we experimented with VOiCES challenge dataset. VOiCES corpus [15] is released as part of "The voices from a distance challenge 2019" [16] of Interspeech 2019. For the ASR fixed conditons track, the training set consists of 80-hours subset of LibriSpeech corpus [50]. The training set has close talking microphone recordings from 427 different speakers from quiet environment. The development and evaluation sets consists of 19 hours and 20 hours of distant microphone recordings of varying room, en-vironment and noise conditions. The significant difference between the training set and development/evaluation set makes the challenge even more difficult. We have used the same acoustic model configurations and hence these results reflect the true acoustic mismatch condition in ASR.

Discussion
The WER results for VOiCES corpus is given in Table. 3. As seen, the baseline FDLP, denoted by BF-FDLP, provides at a better WER compared to the baseline FBANK. denoted as BF-FBANK. This is further improved with joint learning based dereverberation. The final WER shows improvement in both development set and evaluation set. A relative WER improvement of 10% in both development set and evaluation set over the baseline FBANK system is observed in these experiments.

Analysis
In

Spectral Correlation Loss
As reported in Table 2 on CHiME-3 dataset, an extra loss function which encourages the spectral bands to be uncorrelated improves the ASR performance on noisy data when the data is corrupted by additive noise with minimal reverberation artifacts. Table 5, 6 shows the effect of the regularization weight, λ on WER in CHiME-3 and REVERB datasets respectively for the spectral correlation loss used in the model learning. The introduction of the spectral correlation loss improves the WER in CHiME-3 dataset. The best results are obtained for a choice of λ = 0.05.
The introduction of spectral correlation loss does not benefit the REVERB challenge dataset. We hypothesize that this may due to the more dominant effect of temporal smearing seen in the REVERB challenge dataset. For the experiments on the VOiCES corpus, the spectral correlation loss is not used.

Discussion on Performance Gains
All the results reported in Table 1, Table 2 and Table 3 use a strong baseline system with GEV based beamforming and weighted prediction error (WPE) based enhancement. Hence, we note that all systems use the same pre-processing pipeline and the gains observed over the baseline system are in addition to these enhancement steps. In addition, we also ensure that the baseline FBANK based system, neural enhancement methods explored in the past and the proposed approach have the same sub-band decomposition, feature normalization, acoustic model and language model settings. In this way, the results reported highlight the effectiveness of the proposed work in suppressing reverberation distortions.
The methods proposed previously based on neural enhancement and dereverberation improve the performance of the baseline system on the REVERB challenge dataset. However, as seen in Table 2, in the presence of additive noise conditions on the CHiME-3 dataset, most of these prior works degrade the performance compared to the BF-FBANK baseline system. In this regard, the method proposed in this paper provides significant performance improvements on all three datasets. Further, the results consistently highlight the performance gains of using the joint neural learning framework.

Summary
In this paper, we propose a new neural model for dereverberation of temporal envelopes and joint learning of the acoustic model to improve the ASR cost.
The joint learning framework combines the envelope dereverberation framework, feature pre-processing and acoustic modeling into a single neural pipeline. This