It is well-known that the performance of acoustic-to-articulatory inversion improves by smoothing the articulatory trajectories estimated using Gaussian mixture model (GMM) mapping (denoted by GMM + Smoothing). GMM + Smoothing also provides similar performance with GMM mapping using dynamic features, which integrates smoothing directly in the mapping criterion. Due to the separation between smoothing and mapping, what objective criterion GMM + Smoothing optimizes remains unclear. In this work a new integrated smoothness criterion, the smoothed-GMM (SGMM), is proposed. GMM + Smoothing is shown, both analytically and experimentally, to be identical to the asymptotic solution of SGMM suggesting GMM + Smoothing to be a near optimal solution of SGMM.

## I. Introduction

The vocal articulators such as jaw, lips, tongue, and velum (VEL) move in a coordinated fashion when a person speaks. The articulators, however, move at a slower rate compared to the vocal tract resonance frequencies. It is sufficient to sample articulatory movements at 200 Hz to capture detailed dynamics of the critical speech articulators^{1} (compared to sampling the speech signal which requires a rate of 7.6 kHz even for telephone quality). The slow rate of articulatory movement causes the articulatory trajectories to be smooth and low-pass in nature; this is borne out in measurements obtained using techniques such as Electromagnetic Articulography (EMA)^{2} and UltraSound.^{3} This fact is exploited in speech modeling, including notably in acoustic-to-articulatory (AtoA) inversion that attempts to recover articulatory details from the speech signal. In particular, inversion performance has been demonstrated to improve by smoothing the estimated articulatory features either in a post-processing step^{4} or by incorporating smoothness directly in the estimation criterion.^{4,5}

There are several AtoA inversion techniques, and Toutios and Margaritis^{6} provide a comprehensive summary of them. Among these techniques, we focus on the AtoA inversion based on Gaussian mixture model (GMM) mapping.^{4} In GMM based mapping, articulatory features are estimated separately in each analysis frame from acoustic features using the minimum mean squared error (MMSE) criterion. This causes the estimated articulatory feature trajectory to be rough and jagged in nature. To obtain a realistic articulatory trajectory from the GMM based estimate, the estimated trajectory is low-pass filtered where the cutoff frequency of the low-pass filter is selected to achieve the best inversion performance.^{4} Toda *et al.*^{4} also proposed a GMM mapping using dynamic features under the maximum-likelihood criterion (thus integrating smoothing directly in the mapping process), which was found to yield a similar performance with GMM mapping followed by low-pass filtering (denoted by GMM + Smoothing). It is important to note that in GMM + Smoothing, the smoothed articulatory features no longer remain optimal in MMSE sense. This suggests that a criterion different from MMSE could correspond to the estimates obtained using GMM + Smoothing which, in fact, yields a better inversion performance than the MMSE criterion and a similar performance with dynamic feature based GMM mapping.

We propose a new smoothness criterion for inversion, called smoothed-GMM (SGMM), which combines smoothness with the information from GMM mapping within the same optimization framework rather than performing them separately. GMM mapping is shown to be a special case of SGMM. It is also analytically shown that GMM + Smoothing matches the solution of SGMM in the limit when the length of the test utterance becomes large. Experimental results on an articulatory database reveal that in practice this asymptotic limit is achieved even for an average utterance length of ∼2.75 s with a frame rate of 100 Hz. Thus, both theoretically and experimentally, GMM + Smoothing turns out to be a near optimal solution of SGMM.

## II. The SGMM criterion

Suppose the acoustic and articulatory feature vectors at the *n*th frame are denoted by **x**_{n} and **y**_{n}, respectively. $xn=[xn1\u2009\u2009xn2\u2009\u2009\cdots \u2009\u2009xnI]T$ and $yn=[yn1\u2009\u2009yn2\u2009\u2009\cdots ynJ]T$, where $xni$ is the *i*th acoustic feature $(1\u2264i\u2264I)$ and $ynj$ is the *j*th articulatory feature $(1\u2264j\u2264J)$. T denotes the transpose operator. In GMM based AtoA inversion, a GMM is used to model the joint probability $p(xn,\u2009\u2009yn|\Theta )$ given by $p(xn,\u2009\u2009yn|\Theta )=\Sigma i=1MwiN(xn,\u2009\u2009yn;\u2009\u2009\mu i,\u2009\u2009\Sigma i)$. $\Theta $ are the GMM parameters ${wi,\u2009\u2009\mu i,\u2009\u2009\Sigma i}i=1M$, where *M* is the number of mixture components. *w _{i}* denotes the mixture weight for the

*i*th mixture. $\mu i=[\mu i(x)T\mu i(y)T]T$ is the mean vector of the

*i*th mixture and $\mu i(x)$ and $\mu i(y)$ denote the mean vectors of the

*i*th mixture for x

_{n}and y

_{n}, respectively. Similarly $\Sigma i$ denotes the full covariance matrix of

*i*th mixture, which is given by

where $\Sigma i(xx)$ and $\Sigma i(yy)$ denote the covariance matrices of the *i*th mixture for **x**_{n} and **y**_{n}, respectively, and $\Sigma i(xy)$, $\Sigma i(yx)$ represent the cross-covariance matrices of the *i*th mixture.

Given an acoustic feature vector sequence of length *N* frames, x_{n}, $1\u2264n\u2264N$, the goal of AtoA inversion is to estimate the corresponding articulatory feature vector sequence, $y\u0302n$, $1\u2264n\u2264N$.

where $p(mi|xn,\Theta )=wiN(xn;\mu i(x),\u2009\Sigma i(xx))/\Sigma j=1MwjN(xn;\u2009\u2009\mu j(x),\u2009\u2009\Sigma j(xx))$ and $E(yn\u2009|xn,mi,\u2009\Theta )=\mu i(y)+\Sigma i(yx)\Sigma i(xx)\u22121(xn\u2212\mu i(x))$. Note that $\Sigma ip(mi\u2009|xn,\u2009\u2009\Theta )=1$.

Toda *et al.*^{4} reported that smoothing $y\u0302n$ by low-pass filtering makes the estimated articulatory trajectory more realistic and improves the inversion performance. Here we propose a new smoothness criterion, called SGMM which constrains the estimated trajectory to be smooth to a required degree while estimating the trajectory from the GMM mapping information. Thus, instead of estimating articulatory features in each frame independently (as done in GMM-based inversion), the SGMM criterion estimates the articulatory trajectory for an entire utterance. The *j*th articulatory feature trajectory is estimated by solving the following optimization problem:

*J* is the objective function comprised of a convex combination of two terms with the convex weight $Cj(0\u2264Cj\u22641)$. $znj$ is the optimization variable. The second term in *J* is the total energy of the output of a high-pass filter with impulse response $hnj$ (corresponding to the *j*th articulator) with $znj$ as input. By minimizing the output of a high-pass filter, SGMM constrains the solution to be low-pass or smoothly varying in nature. $hnj$ could be designed based on the degree of required smoothness for *j*th articulator trajectory.

The first term in *J* is designed so that it utilizes the mapping between acoustic and articulatory spaces using the conditional means with their weights derived from the GMM. The choice of $Cj$ provides a trade-off between the GMM mapping and the smoothness factor. The optimization in Eq. (2) is solved for $j=1,\u2009\u2009\u2026,\u2009\u2009J$ separately to obtain the estimates of all *J* articulatory feature trajectories.

## III. Solution SGMM criterion based optimization

The objective function *J* in Eq. (2) is a convex (and quadratic) function of the optimization variables ${znj;\u20091\u2264n\u2264N}$. Thus a global minimum is guaranteed. We define the autocorrelation sequence of the high-pass filter $hnj$ as $Rl\u2212kj\u225c\Sigma nhn\u2212kjhn\u2212lj$. For minimization, the partial derivatives of *J* with respect to $znj$ are set to zero at $znj=y\u0302nj$ to obtain a set of *N* equations in the following matrix vector form:

where $\Delta lj=\Sigma ip(mi|xl,\u2009\Theta )\u2009\u2009E(ylj|xl,\u2009\u2009mi,\u2009\u2009\Theta )$. We can further write the set of equations as $((1\u2009\u2212Cj)Rj+CjI)y\u0302j=Cjdj$, where $Rj={Rklj}={Rk\u2212lj}={R|k\u2212l|j}$ (since the autocorrelation matrix is symmetric), **I** is $N\xd7N$ identity matrix, $y\u0302j=[y\u03021j,\cdots y\u0302Nj]T$ and $dj=[\Delta 1j,\u2009\u2009\cdots ,\Delta Nj]T$. $Rj$ is an autocorrelation matrix and hence symmetric toeplitz. Thus, $((1\u2009\u2212Cj)\u2009Rj+CjI)$ is invertible for any choice of $Cj(0<Cj<1)$. The estimate of the *j*th articulatory feature trajectory thus can be obtained as follows:

When $Cj=0$ in Eq. (4) (i.e., only the second term in *J* is considered), the estimated trajectory $y\u0302j$ is trivially zero. In other words, when no GMM mapping information is included in the objective function *J*, the maximally smooth solution is an all zero trajectory. On the other hand when $Cj=1$ (i.e., no smoothness constraint is imposed on the estimated articulatory trajectory), $y\u0302j=dj$ or $y\u0302nj=\Delta nj=\Sigma ip(mi|xn,\u2009\u2009\Theta )E(ynj|xn,\u2009\u2009mi,\Theta )$, which is identical to the GMM mapping based estimate [Eq. (1)]. Thus, GMM based inversion is a special case of the optimization using the proposed SGMM criterion. For $0<Cj<\u20091$, the estimated trajectory lies between the extremes of the all-zero trajectory and the jagged trajectory obtained using GMM based inversion.

$Rj$ is, in general, an *N* × *N* positive semi-definite symmetric toeplitz matrix with its entries coming from the autocorrelation sequence of $hnj$ (i.e., $Rnj$) with the corresponding spectrum $|Hj(\omega )|2$. **R*** ^{j}* is also a convolution matrix with the corresponding impulse response $Rnj$. Let $\rho j=(1\u2212Cj)/Cj$. Hence, $I+\rho jRj$ is an $N\xd7N$ positive definite symmetric toeplitz matrix with the related spectrum $1+\rho j|Hj(\omega )|2$. Note that $1+\rho j|Hj(\omega )|2>0$, $\u2200\omega $; the addition of “1” acts as a regularization ensuring the invertibility of the spectrum $1+\rho j|Hj(\omega )|2$ (similar to

**I**for the invertibility of $I+\rho jRj$). Using a result from the inverse of the toeplitz matrix [Eq. (5.5) in Ref. 8], it is easy to show that $(I+\rho jRj)\u22121$ is asymptotically (as $N\u2192\u221e$) toeplitz with the corresponding spectrum $|Gj(\omega )|2=1/(1+\rho j|Hj(\omega )|2)$. Since $|Hj(\omega )|2$ is a high-pass spectrum and $\rho j>0$, it is easy to see that $|Gj(\omega )|2$ is a low-pass spectrum, where

*ρ*controls the stop band attenuation of the low-pass filter. Hence, in the limit $N\u2192\u221e$, $(I+\rho jRj)\u22121$ acts as a convolution matrix with a corresponding impulse response $Qnj$ of a low-pass filter with spectrum $|Gj(\omega )|2$, where $Qnj$ is the inverse Fourier transform of $|Gj(\omega )|2$. Thus, asymptotically $y\u0302j$ [Eq. (4)] is a low-passed or smoothed version of

^{j}**d**

*, the GMM based estimate. Thus we prove that the solution of SGMM asymptotically matches GMM + Smoothing.*

^{j}For illustration, we consider a fifth order rational transfer function $(H(\omega ))$ of a type II Chebyshev high-pass filter with a 40 dB attenuation at 10 Hz with sampling frequency 100 Hz as shown in Fig. 1. For finite *N*, we pick the *N*/2th (*N*/2+1th for even *N*) row of $(I+\rho R)\u22121$ as the representative impulse response $Pn$ for $(I+\rho R)\u22121$. We compute the mean squared error (MSE) $EP\u2212Q$ between $Pn$ and $Qn$ over the same support as shown in Fig. 1(d) for different values of *ρ*. It is clear that $EP\u2212Q$ becomes zero for *N* = 150 (corresponds to 1.5 s with 100 Hz frame rate) for *ρ* = 999. For *ρ* = 99 and 9, $EP\u2212Q$ becomes zeros even for lower values of *N* indicating the asymptotic equivalence between $(I+\rho R)\u22121$ and $|G(\omega )|2$.

## IV. Experimental evaluation

While we argue in Sec. III that the solution of SGMM asymptotically matches GMM + Smoothing, it is important to note that the low-pass filter in the limit has a frequency response of the form $(1+\rho |H(\omega )|2)\u22121$ [i.e., infinite impulse response (IIR) filter] in the case of SGMM, but in the case of GMM + Smoothing the low-pass filter can be either finite impulse response or IIR and its frequency response need not have a specific form. We conduct AtoA inversion experiments on an articulatory dataset comprising utterances of different lengths to examine the role that the particular form of a low-pass filter in SGMM may play on the inversion performance, specifically for different choices of *N.* The dataset and experimental details are described below.

### A. Dataset and pre-processing

For the AtoA experiment, we use the Multichannel Articulatory (MOCHA) database^{9} that contains speech and the corresponding EMA data from one male and one female talker of British English. The EMA data consists of dynamic positions of the EMA sensors in the mid-sagittal plane of the talker. A total of seven sensors are placed on the upper lip (UL), lower lip (LL), lower incisor (JAW), tongue tip (TT), tongue body (TB), tongue dorsum (TD), and VEL. Following the preprocessing steps outlined by Ghosh and Narayanan,^{5} we obtain parallel acoustic and articulatory data at a frame rate of 100 observations/s. We use 14 dimensional raw EMA features for representing the articulatory space (i.e., *X* and *Y* co-ordinates of 7 EMA sensors), namely ULx, LLx, JAWx, TTx, TBx, TDx, VELx, ULy, LLy, JAWy, TTy, TBy, TDy, and VELy. Acoustic features are represented by 39 dimensional Mel-frequency cepstral coefficients (MFCCs) and are computed using 20 msec analysis frame length with 10 msec shift.

### B. Experimental setup

AtoA inversion is performed separately on the male and female subjects of the MOCHA corpus using a fivefold cross-validation setup. Inversion performance is measured over all sentences of all folds through average root mean squared error (RMSE) and Pearson correlation coefficient (PCC)^{10} between the original and estimated articulatory trajectories.

Following the finding by Toda *et al.*,^{4} 64 mixture component GMMs are used to model the acoustic-articulatory map in the training data separately for each fold. In the case of SGMM, we use a fifth order type II Chebyshev high-pass filter with 40 dB stop band attenuation as $hnj$ with cut-off frequency $fcj$ for the *j*th articulator. Different values of $fcj$ and $Cj$ were experimented with -$fcj\u2208{3+0.5\u2009(k\u22121)\u2009Hz,\u2009\u2009k=1,\cdots ,\u200945}$ and $Cj\u2208{0.001,0.005,0.01,0.05,0.1,0.5,0.9,0.99,0.999}$. We report AtoA inversion performance corresponding to the $fcj$ and $Cj$ combination which gives the least average RMSE. In the case of GMM + Smoothing, a fifth order type II Chebyshev low-pass filter with 40 dB stop band attenuation is used for smoothing whose cut-off frequency is also varied over the same range as that for $hnj$ and the best performance among these is reported for each articulator.

### C. Results and discussions

Figure 2 shows the AtoA inversion performance in terms of RMSE and PCC for each articulator of both subjects in the MOCHA corpus. It is evident that the inversion performances using SGMM and GMM + Smoothing are not significantly different. Thus, inversion experiments support our theoretical finding that GMM + Smoothing is a near optimal solution of the SGMM criterion. An advantage of using SGMM over GMM + Smoothing is that the solution of SGMM can be computed in a recursive manner.^{5} Optimal choices of *C ^{j}* in the case of the male and female subjects turn out to be in the range of 0.005 to 0.1 suggesting low-pass filters with high stop band attenuation to be preferred in SGMM. If the length of a sentence is too short to satisfy the asymptotic limit, the length of the sentence could be increased by appending it with silence and then considering articulatory features only in the segment of interest. It should also be noted that the functional forms of generalized smoothness criterion (GSC)

^{5}and SGMM appear to be similar except that in GSC the training data is used in a non-parametric fashion while SGMM uses parameters of a GMM learned from the training data.

Given a high-pass filter $|H(\omega )|2$ in SGMM, one can always find a low-pass filter $|G(\omega )|2=(1+\rho |H(\omega )|2)\u22121$ and perform GMM + Smoothing with $|G(\omega )|2$ to achieve an inversion performance similar to SGMM with $|H(\omega )|2$. However, the opposite is not true in general. This is because any arbitrary low-pass filter $A(\omega )$ cannot be put in the form $(1+\rho |H(\omega )|2)\u22121$. For example, the type II Chebyshev low-pass filter used in AtoA experiments is not in this particular form. In spite of that the AtoA inversion performances using GMM + Smoothing and SGMM turn out to be similar. This suggests that although it could be difficult to find a high-pass filter $H(\omega )$ in SGMM corresponding to an arbitrary low-pass filter $A(\omega )$ in GMM + Smoothing, SGMM with a high-pass filter different from $H(\omega )$ could lead to a similar inversion performance as that of GMM + Smoothing with $A(\omega )$. For a given *A*(*ω*), one could also find a low-pass filter of the form $(1+\rho |H(\omega )|2)\u22121$ that best approximates *A*(*ω*) and then SGMM with the corresponding *H*(*ω*) as the high-pass filter will lead to an inversion performance similar to that of GMM + Smoothing with $A(\omega )$.

## V. Conclusions

We present a new unified criterion (SGMM) for estimation and smoothing for AtoA inversion; its solution is shown, both theoretically and experimentally, to be identical to the individually optimized GMM + Smoothing based solution in the limiting case. In practice, these results seem to hold for utterances just a few seconds long. Since in GMM + Smoothing based inversion the GMM mapping and smoothing are performed separately, this finding offers an additional insight as to what underlying criterion is being optimized in GMM + Smoothing.