It is well-known that the performance of acoustic-to-articulatory inversion improves by smoothing the articulatory trajectories estimated using Gaussian mixture model (GMM) mapping (denoted by GMM + Smoothing). GMM + Smoothing also provides similar performance with GMM mapping using dynamic features, which integrates smoothing directly in the mapping criterion. Due to the separation between smoothing and mapping, what objective criterion GMM + Smoothing optimizes remains unclear. In this work a new integrated smoothness criterion, the smoothed-GMM (SGMM), is proposed. GMM + Smoothing is shown, both analytically and experimentally, to be identical to the asymptotic solution of SGMM suggesting GMM + Smoothing to be a near optimal solution of SGMM.
I. Introduction
The vocal articulators such as jaw, lips, tongue, and velum (VEL) move in a coordinated fashion when a person speaks. The articulators, however, move at a slower rate compared to the vocal tract resonance frequencies. It is sufficient to sample articulatory movements at 200 Hz to capture detailed dynamics of the critical speech articulators1 (compared to sampling the speech signal which requires a rate of 7.6 kHz even for telephone quality). The slow rate of articulatory movement causes the articulatory trajectories to be smooth and low-pass in nature; this is borne out in measurements obtained using techniques such as Electromagnetic Articulography (EMA)2 and UltraSound.3 This fact is exploited in speech modeling, including notably in acoustic-to-articulatory (AtoA) inversion that attempts to recover articulatory details from the speech signal. In particular, inversion performance has been demonstrated to improve by smoothing the estimated articulatory features either in a post-processing step4 or by incorporating smoothness directly in the estimation criterion.4,5
There are several AtoA inversion techniques, and Toutios and Margaritis6 provide a comprehensive summary of them. Among these techniques, we focus on the AtoA inversion based on Gaussian mixture model (GMM) mapping.4 In GMM based mapping, articulatory features are estimated separately in each analysis frame from acoustic features using the minimum mean squared error (MMSE) criterion. This causes the estimated articulatory feature trajectory to be rough and jagged in nature. To obtain a realistic articulatory trajectory from the GMM based estimate, the estimated trajectory is low-pass filtered where the cutoff frequency of the low-pass filter is selected to achieve the best inversion performance.4 Toda et al.4 also proposed a GMM mapping using dynamic features under the maximum-likelihood criterion (thus integrating smoothing directly in the mapping process), which was found to yield a similar performance with GMM mapping followed by low-pass filtering (denoted by GMM + Smoothing). It is important to note that in GMM + Smoothing, the smoothed articulatory features no longer remain optimal in MMSE sense. This suggests that a criterion different from MMSE could correspond to the estimates obtained using GMM + Smoothing which, in fact, yields a better inversion performance than the MMSE criterion and a similar performance with dynamic feature based GMM mapping.
We propose a new smoothness criterion for inversion, called smoothed-GMM (SGMM), which combines smoothness with the information from GMM mapping within the same optimization framework rather than performing them separately. GMM mapping is shown to be a special case of SGMM. It is also analytically shown that GMM + Smoothing matches the solution of SGMM in the limit when the length of the test utterance becomes large. Experimental results on an articulatory database reveal that in practice this asymptotic limit is achieved even for an average utterance length of ∼2.75 s with a frame rate of 100 Hz. Thus, both theoretically and experimentally, GMM + Smoothing turns out to be a near optimal solution of SGMM.
II. The SGMM criterion
Suppose the acoustic and articulatory feature vectors at the nth frame are denoted by xn and yn, respectively. and , where is the ith acoustic feature and is the jth articulatory feature . T denotes the transpose operator. In GMM based AtoA inversion, a GMM is used to model the joint probability given by . are the GMM parameters , where M is the number of mixture components. wi denotes the mixture weight for the ith mixture. is the mean vector of the ith mixture and and denote the mean vectors of the ith mixture for xn and yn, respectively. Similarly denotes the full covariance matrix of ith mixture, which is given by
where and denote the covariance matrices of the ith mixture for xn and yn, respectively, and , represent the cross-covariance matrices of the ith mixture.
Given an acoustic feature vector sequence of length N frames, xn, , the goal of AtoA inversion is to estimate the corresponding articulatory feature vector sequence, , .
where and . Note that .
Toda et al.4 reported that smoothing by low-pass filtering makes the estimated articulatory trajectory more realistic and improves the inversion performance. Here we propose a new smoothness criterion, called SGMM which constrains the estimated trajectory to be smooth to a required degree while estimating the trajectory from the GMM mapping information. Thus, instead of estimating articulatory features in each frame independently (as done in GMM-based inversion), the SGMM criterion estimates the articulatory trajectory for an entire utterance. The jth articulatory feature trajectory is estimated by solving the following optimization problem:
J is the objective function comprised of a convex combination of two terms with the convex weight . is the optimization variable. The second term in J is the total energy of the output of a high-pass filter with impulse response (corresponding to the jth articulator) with as input. By minimizing the output of a high-pass filter, SGMM constrains the solution to be low-pass or smoothly varying in nature. could be designed based on the degree of required smoothness for jth articulator trajectory.
The first term in J is designed so that it utilizes the mapping between acoustic and articulatory spaces using the conditional means with their weights derived from the GMM. The choice of provides a trade-off between the GMM mapping and the smoothness factor. The optimization in Eq. (2) is solved for separately to obtain the estimates of all J articulatory feature trajectories.
III. Solution SGMM criterion based optimization
The objective function J in Eq. (2) is a convex (and quadratic) function of the optimization variables . Thus a global minimum is guaranteed. We define the autocorrelation sequence of the high-pass filter as . For minimization, the partial derivatives of J with respect to are set to zero at to obtain a set of N equations in the following matrix vector form:
where . We can further write the set of equations as , where (since the autocorrelation matrix is symmetric), I is identity matrix, and . is an autocorrelation matrix and hence symmetric toeplitz. Thus, is invertible for any choice of . The estimate of the jth articulatory feature trajectory thus can be obtained as follows:
When in Eq. (4) (i.e., only the second term in J is considered), the estimated trajectory is trivially zero. In other words, when no GMM mapping information is included in the objective function J, the maximally smooth solution is an all zero trajectory. On the other hand when (i.e., no smoothness constraint is imposed on the estimated articulatory trajectory), or , which is identical to the GMM mapping based estimate [Eq. (1)]. Thus, GMM based inversion is a special case of the optimization using the proposed SGMM criterion. For , the estimated trajectory lies between the extremes of the all-zero trajectory and the jagged trajectory obtained using GMM based inversion.
is, in general, an N × N positive semi-definite symmetric toeplitz matrix with its entries coming from the autocorrelation sequence of (i.e., ) with the corresponding spectrum . Rj is also a convolution matrix with the corresponding impulse response . Let . Hence, is an positive definite symmetric toeplitz matrix with the related spectrum . Note that , ; the addition of “1” acts as a regularization ensuring the invertibility of the spectrum (similar to I for the invertibility of ). Using a result from the inverse of the toeplitz matrix [Eq. (5.5) in Ref. 8], it is easy to show that is asymptotically (as ) toeplitz with the corresponding spectrum . Since is a high-pass spectrum and , it is easy to see that is a low-pass spectrum, where ρj controls the stop band attenuation of the low-pass filter. Hence, in the limit , acts as a convolution matrix with a corresponding impulse response of a low-pass filter with spectrum , where is the inverse Fourier transform of . Thus, asymptotically [Eq. (4)] is a low-passed or smoothed version of dj, the GMM based estimate. Thus we prove that the solution of SGMM asymptotically matches GMM + Smoothing.
For illustration, we consider a fifth order rational transfer function of a type II Chebyshev high-pass filter with a 40 dB attenuation at 10 Hz with sampling frequency 100 Hz as shown in Fig. 1. For finite N, we pick the N/2th (N/2+1th for even N) row of as the representative impulse response for . We compute the mean squared error (MSE) between and over the same support as shown in Fig. 1(d) for different values of ρ. It is clear that becomes zero for N = 150 (corresponds to 1.5 s with 100 Hz frame rate) for ρ = 999. For ρ = 99 and 9, becomes zeros even for lower values of N indicating the asymptotic equivalence between and .
(Color online) Illustration of the asymptotic equivalence between and a low-pass convolution matrix with spectrum (index “j” is omitted for simplicity): (a) A high-pass spectrum , (b) , (c) , and (d) MSE between impulse response representing and Qn, the inverse Fourier transform of .
(Color online) Illustration of the asymptotic equivalence between and a low-pass convolution matrix with spectrum (index “j” is omitted for simplicity): (a) A high-pass spectrum , (b) , (c) , and (d) MSE between impulse response representing and Qn, the inverse Fourier transform of .
IV. Experimental evaluation
While we argue in Sec. III that the solution of SGMM asymptotically matches GMM + Smoothing, it is important to note that the low-pass filter in the limit has a frequency response of the form [i.e., infinite impulse response (IIR) filter] in the case of SGMM, but in the case of GMM + Smoothing the low-pass filter can be either finite impulse response or IIR and its frequency response need not have a specific form. We conduct AtoA inversion experiments on an articulatory dataset comprising utterances of different lengths to examine the role that the particular form of a low-pass filter in SGMM may play on the inversion performance, specifically for different choices of N. The dataset and experimental details are described below.
A. Dataset and pre-processing
For the AtoA experiment, we use the Multichannel Articulatory (MOCHA) database9 that contains speech and the corresponding EMA data from one male and one female talker of British English. The EMA data consists of dynamic positions of the EMA sensors in the mid-sagittal plane of the talker. A total of seven sensors are placed on the upper lip (UL), lower lip (LL), lower incisor (JAW), tongue tip (TT), tongue body (TB), tongue dorsum (TD), and VEL. Following the preprocessing steps outlined by Ghosh and Narayanan,5 we obtain parallel acoustic and articulatory data at a frame rate of 100 observations/s. We use 14 dimensional raw EMA features for representing the articulatory space (i.e., X and Y co-ordinates of 7 EMA sensors), namely ULx, LLx, JAWx, TTx, TBx, TDx, VELx, ULy, LLy, JAWy, TTy, TBy, TDy, and VELy. Acoustic features are represented by 39 dimensional Mel-frequency cepstral coefficients (MFCCs) and are computed using 20 msec analysis frame length with 10 msec shift.
B. Experimental setup
AtoA inversion is performed separately on the male and female subjects of the MOCHA corpus using a fivefold cross-validation setup. Inversion performance is measured over all sentences of all folds through average root mean squared error (RMSE) and Pearson correlation coefficient (PCC)10 between the original and estimated articulatory trajectories.
Following the finding by Toda et al.,4 64 mixture component GMMs are used to model the acoustic-articulatory map in the training data separately for each fold. In the case of SGMM, we use a fifth order type II Chebyshev high-pass filter with 40 dB stop band attenuation as with cut-off frequency for the jth articulator. Different values of and were experimented with - and . We report AtoA inversion performance corresponding to the and combination which gives the least average RMSE. In the case of GMM + Smoothing, a fifth order type II Chebyshev low-pass filter with 40 dB stop band attenuation is used for smoothing whose cut-off frequency is also varied over the same range as that for and the best performance among these is reported for each articulator.
C. Results and discussions
Figure 2 shows the AtoA inversion performance in terms of RMSE and PCC for each articulator of both subjects in the MOCHA corpus. It is evident that the inversion performances using SGMM and GMM + Smoothing are not significantly different. Thus, inversion experiments support our theoretical finding that GMM + Smoothing is a near optimal solution of the SGMM criterion. An advantage of using SGMM over GMM + Smoothing is that the solution of SGMM can be computed in a recursive manner.5 Optimal choices of Cj in the case of the male and female subjects turn out to be in the range of 0.005 to 0.1 suggesting low-pass filters with high stop band attenuation to be preferred in SGMM. If the length of a sentence is too short to satisfy the asymptotic limit, the length of the sentence could be increased by appending it with silence and then considering articulatory features only in the segment of interest. It should also be noted that the functional forms of generalized smoothness criterion (GSC)5 and SGMM appear to be similar except that in GSC the training data is used in a non-parametric fashion while SGMM uses parameters of a GMM learned from the training data.
Comparison of SGMM and GMM + Smoothing - error bars indicate average inversion performance with ± one standard deviation.
Comparison of SGMM and GMM + Smoothing - error bars indicate average inversion performance with ± one standard deviation.
Given a high-pass filter in SGMM, one can always find a low-pass filter and perform GMM + Smoothing with to achieve an inversion performance similar to SGMM with . However, the opposite is not true in general. This is because any arbitrary low-pass filter cannot be put in the form . For example, the type II Chebyshev low-pass filter used in AtoA experiments is not in this particular form. In spite of that the AtoA inversion performances using GMM + Smoothing and SGMM turn out to be similar. This suggests that although it could be difficult to find a high-pass filter in SGMM corresponding to an arbitrary low-pass filter in GMM + Smoothing, SGMM with a high-pass filter different from could lead to a similar inversion performance as that of GMM + Smoothing with . For a given A(ω), one could also find a low-pass filter of the form that best approximates A(ω) and then SGMM with the corresponding H(ω) as the high-pass filter will lead to an inversion performance similar to that of GMM + Smoothing with .
V. Conclusions
We present a new unified criterion (SGMM) for estimation and smoothing for AtoA inversion; its solution is shown, both theoretically and experimentally, to be identical to the individually optimized GMM + Smoothing based solution in the limiting case. In practice, these results seem to hold for utterances just a few seconds long. Since in GMM + Smoothing based inversion the GMM mapping and smoothing are performed separately, this finding offers an additional insight as to what underlying criterion is being optimized in GMM + Smoothing.