This letter investigates the impact of spectral compression on the vector Taylor series-based model adaptation algorithm. Unlike mel-frequency cepstral coefficients obtained by the logarithmic compression, the fractional power compression is used for extracting features. Since the relationship between acoustic models for clean and noisy speech depends on nonlinearity of the spectrum, it is important to select an appropriate compressive operator in the model adaptation. In this letter, the dependency of spectral nonlinearity on the speech recognition system is analyzed in various noisy environments. Experimental results confirm that the replacement of the compressive operator improves the performance of the model adaptation.
I. Introduction
In a real-world application scenario such as in a noisy environment, the accuracy of a hidden Markov model based automatic recognition system degrades severely if the statistics of the testing data are not matched with those of the training data.1–10 An effective way of overcoming this problem is introducing a model adaptation technique. The vector Taylor series (VTS) approximation-based adaptation technique is a typical example. In the VTS-based adaptation approach, the static and dynamic Gaussian parameters of acoustic models are updated using the statistical characteristics of noise components in the quefrency domain.1–3 Since the adjusted acoustic models have similar distribution to noisy speech, the automatic speech recognition (ASR) performance in noisy environments improves significantly.
Recently, it was shown that the performance of the VTS model adaptation algorithm could be further increased by introducing generalized cepstral coefficients (GCCs) that use a fractional power compression.5 Interestingly, when the fractional power value of GCCs is 1/3, its compressive operator corresponds almost exactly to the Fletcher loudness transformation.11 However, there is no clear analysis on how to determine the optimal fractional power value of GCCs for the ASR system in noisy environments. Since the statistical distribution of noisy speech varies depending on the nonlinearity of the spectral compressive operator, further analysis on the impact of GCCs in various noisy environments is required.
This letter investigates the impact of the spectral compression on the VTS-based adaptation approach in noisy environments. The model adaptation equation for GCCs is formulated and the distribution of noisy speech is analyzed depending on the nonlinearity of the spectral compressive operator. In addition, the dependency of the fractional power value of GCCs on the ASR performance is analyzed in various noisy environments. Experimental results show that GCCs improve the performance of the VTS-based model adaptation compared to the conventional mel-frequency cepstral coefficients (MFCCs) even if the lower order of the VTS approximation is used. In addition, it is concluded that the recognition accuracy in noisy environments can be significantly improved by appropriately setting the spectral compressive operator of GCCs based on the background noise characteristics.
II. VTS-based model adaptation for GCCs
This section proposes a VTS-based model adaptation algorithm for a GCC-based ASR system. GCCs are the general form of cepstral coefficients that use a generalized logarithmic function for spectral compression.
A. Generalized logarithmic function
The generalized logarithmic function is defined by12,13
where is a real value. As the value of approximates 0, its corresponding curve is close to the logarithmic function.12 The slope of is determined by the values of and , which can be quantified by taking an approximate derivative with respect to :
In addition, the generalized logarithmic function has the following properties:
B. GCCs in noisy environments
It is assumed that clean speech is corrupted by additive noise and channel noise. After taking the short-time Fourier analysis and applying mel-scale filters, the ith filterbank coefficients of the observed speech can be represented by
where , , and represent the ith filterbank coefficients of clean speech, channel noise, and additive noise, respectively. denotes the number of filterbanks. For simplicity, the frame index is omitted. In addition, it is assumed that the length of the channel noise is much shorter than that of the analysis window length.
After applying the generalized logarithmic function, , to Eq. (5) and some manipulation using Eqs. (3) and (4), the generalized logarithmic spectrum of observed speech is written as5
where is the SNR (signal-to-noise ratio)-related variable.
The GCCs of the observed signal are obtained by taking discrete cosine transform (DCT) to the generalized logarithmic spectrum:
The th component of the DCT matrix is represented by where denotes the weighting factor for the DCT computation, which is given by
Given the GCCs of the clean speech , channel noise , and additive noise , Eq. (7) is represented by
where
The th component of is represented by . Equation (9) indicates that the GCCs of the observed signal are represented by those obtained from the clean speech and two additive distortion terms. The first distortion term related to the channel noise effect depends on the values of and , and the second one related to the additive noise effect is determined by and the SNR.
C. VTS based adaptation with GCCs
Assuming that the clean speech, channel noise, and additive noise are Gaussian random variable vectors with means , , and , and covariance matrices , , and , the cepstrum of the observed speech is approximated by a first order Taylor series expansion at :1,3
where
In addition, , , and denote the derivative of in terms of , , and at , respectively. The derivatives of are obtained by
where the th elements of are represented by . The th elements of the derivative of in terms of , , and are calculated by
where is the th element of the DCT matrix and is the th element of . Then, the mean and covariance matrices of the observed speech can be estimated by
Since the dynamic features are denoted as a derivative of the static features in the time axis,14 the mean and covariance matrices of the dynamic features are estimated by
When , the parameters for the model adaptation are represented by
where is a diagonal matrix whose elements are given by . Equations (24) and (25) are equivalent to the conventional VTS-based model adaptation with MFCCs.1,3 This indicates that the proposed model adaptation equations can be regarded as a general form of the conventional approach.
III. Distribution of noisy speech based on spectral nonlinearity
This section investigates the estimated acoustic model of noisy speech based on the nonlinearity of the spectral compressive operator. Assuming that the clean speech and the additive noise have a single Gaussian distribution in the nonlinear spectral domain determined by the value of , Fig. 1 compares the distribution of the noisy signal and its Gaussian approximation when and in two different types of noisy environment: N1 and N2. The mean values of N1 and N2 are 5 dB (SNR: 0 dB), and their variances are 0.5 and 2 dB, respectively.
Note that the similarity between the approximated distribution and the real distribution depends on the value of . When , the additive noise results in the nonlinear distortion due to the logarithmic compression. For this reason, there is a difference between the approximated distribution and the real distribution of noisy speech. Especially, when the variance of noise is small (N1), the difference is significant compared to when its variance is large (N2). On the other hand, it is observed that this difference is reduced by using the fractional power compression (). If , the approximated distribution would be exactly the same as the real distribution of noisy speech due to the linearity.
This indicates that in aspects of a model adaptation algorithm, the features extracted from the linear spectrum are more appropriate than those extracted from the log-spectrum. However, in aspects of ASR systems, the features extracted from the linear spectrum are not proper because the speech has a Gaussian distribution in the log-spectral domain. Accordingly, it is concluded that the ASR performance of the model adaptation in noisy environments could be improved by choosing an appropriate spectral compressive operator.
IV. Experimental results
This section describes the impact of the spectral compression on the performance of an ASR system with the VTS-based model adaptation framework. The recognition experiments are conducted with the Aurora 2 database that is commonly used for evaluating the performance of word-based ASR systems.15 Test data are composed of three different sets: Test set A, Test set B, and Test set C. Test sets A and B include eight different noise types. In Test set C, speech samples are filtered with a modified intermediate reference system characteristic. The SNRs in all test data range from −5 to 20 dB with a step of 5 dB. Each subset in the test data consists of 1001 utterances, and the training set includes 8440 utterances.
The recognizer uses a 39-dimensional feature vector composed of static, delta, and delta-delta features. The acoustic model of each word consists of 16 states and 3 Gaussian mixtures. The mean and covariance matrices of the noise and channel components are estimated using an iterative expectation maximization approach from noisy signals.4
A. Impact of spectral compression
Figure 2 shows the dependency of choosing the value of on the word accuracy in various noise environments. Figure 2(a) depicts the recognition accuracy when the 0th order VTS approximation is used, i.e., it updates only the mean of the static features. Figure 2(b) depicts the result of using the 1st order VTS approximation. The word accuracy of each noise type is calculated by averaging the results obtained from 20 to 0 dB SNR conditions.
Compared to the system of using MFCCs (), the system with GCCs has a much higher accuracy if an appropriate value of is chosen. Note that the optimal fractional power value that maximizes word accuracy depends on noise characteristics. It is well-known that the stationary noise has a relatively small variance in terms of the estimated mean compared to the non-stationary noise. In addition, Fig. 1 shows that when the variance of noise is small, there is a considerable nonlinear effect compared to when its variance is large. This indicates that the fractional power compression is effective especially in the stationary noise environments. For this reason, the optimum for stationary noise conditions is close to 1 compared to the non-stationary noise conditions. The dependency on the SNR condition is also investigated (although corresponding results are not included here), but there is not a large difference in the optimum value of in terms of SNR variations.
In addition, it is observed that the improved accuracy rate of the 0th order approximation is larger than that of the 1st order approximation. The reason for this is that the VTS-approximated value approaches the true value as the order becomes higher, thus the advantage of using the different types of spectral compressive operator diminishes. This indicates that the VTS approximation-based adaptation approach with GCCs can estimate hidden Markov model (HMM) parameters more precisely than the approach with MFCCs, even if the lower order of the VTS approximation is used.
B. Performance evaluation
The performance of the VTS-based adaptation with GCCs is compared with that of three conventional algorithms designed for robust ASR systems: The European Telecommunications Standards Institute advanced front-end (ETSI-AFE), the power-normalized cepstral coefficient (PNCC), and the conventional VTS-based adaptation approach with MFCCs. Note that the ETSI-AFE includes a two-stage mel-warped Wiener filter,9 and the PNCC also includes a noise reduction scheme.7,8 Therefore, it may not be a fair comparison to use these approaches, but they are chosen to show the reference ASR performance of the current technologies.
Figure 3 compares the average word accuracy obtained by averaging the results of all noise types and SNRs (from 0 to 20 dB). The values of for the 1st order approximation are set to 1/20. This value of is obtained from the experimental results shown in Fig. 2. Figure 3 indicates that there is no difference in the recognition accuracy when only the type of spectral compressive operator is replaced. However, the impact of spectral compression becomes much different if the VTS-based model adaptation technique is used. From the experimental results in noisy environments, it is confirmed that the VTS-based adaptation with GCCs is superior to conventional algorithms.
In addition, the results considering the stationarity of the noise are shown in Fig. 3. The fractional power values are set depending on whether the noise is stationary or non-stationary. For the stationary and non-stationary noise conditions, the optimum values of are set to 1/15 and 1/35, respectively. Note that the decision statistics of determining a stationary or non-stationary condition can be easily designed by observing the spectrum variation of noise power spectral density.16 The experimental results show that the performance of the VTS-based model adaptation with GCC is further improved by considering the stationarity of the background noise. The relative word error rate improvement compared to the conventional approach (when ) is 2.45%.
V. Conclusion
In this letter, the impact of GCCs on the VTS-based model adaptation algorithm was investigated for various noisy environments. The optimal spectral compressive operator that maximizes the performance of the model adaptation with GCCs was varied based on the stationarity of the background noise. In addition, it was proven that the model adaptation with GCCs was more effective than the one with MFCCs for speech recognition in noisy environments even if the lower order of the VTS approximation was used. Experimental results also demonstrated the superiority of the VTS-based model adaptation with GCCs when the stationarity of the noise was considered.