It is well-known that the performance of acoustic-to-articulatory inversion improves by smoothing the articulatory trajectories estimated using Gaussian mixture model (GMM) mapping (denoted by GMM + Smoothing). GMM + Smoothing also provides similar performance with GMM mapping using dynamic features, which integrates smoothing directly in the mapping criterion. Due to the separation between smoothing and mapping, what objective criterion GMM + Smoothing optimizes remains unclear. In this work a new integrated smoothness criterion, the smoothed-GMM (SGMM), is proposed. GMM + Smoothing is shown, both analytically and experimentally, to be identical to the asymptotic solution of SGMM suggesting GMM + Smoothing to be a near optimal solution of SGMM.

The vocal articulators such as jaw, lips, tongue, and velum (VEL) move in a coordinated fashion when a person speaks. The articulators, however, move at a slower rate compared to the vocal tract resonance frequencies. It is sufficient to sample articulatory movements at 200 Hz to capture detailed dynamics of the critical speech articulators1 (compared to sampling the speech signal which requires a rate of 7.6 kHz even for telephone quality). The slow rate of articulatory movement causes the articulatory trajectories to be smooth and low-pass in nature; this is borne out in measurements obtained using techniques such as Electromagnetic Articulography (EMA)2 and UltraSound.3 This fact is exploited in speech modeling, including notably in acoustic-to-articulatory (AtoA) inversion that attempts to recover articulatory details from the speech signal. In particular, inversion performance has been demonstrated to improve by smoothing the estimated articulatory features either in a post-processing step4 or by incorporating smoothness directly in the estimation criterion.4,5

There are several AtoA inversion techniques, and Toutios and Margaritis6 provide a comprehensive summary of them. Among these techniques, we focus on the AtoA inversion based on Gaussian mixture model (GMM) mapping.4 In GMM based mapping, articulatory features are estimated separately in each analysis frame from acoustic features using the minimum mean squared error (MMSE) criterion. This causes the estimated articulatory feature trajectory to be rough and jagged in nature. To obtain a realistic articulatory trajectory from the GMM based estimate, the estimated trajectory is low-pass filtered where the cutoff frequency of the low-pass filter is selected to achieve the best inversion performance.4 Toda et al.4 also proposed a GMM mapping using dynamic features under the maximum-likelihood criterion (thus integrating smoothing directly in the mapping process), which was found to yield a similar performance with GMM mapping followed by low-pass filtering (denoted by GMM + Smoothing). It is important to note that in GMM + Smoothing, the smoothed articulatory features no longer remain optimal in MMSE sense. This suggests that a criterion different from MMSE could correspond to the estimates obtained using GMM + Smoothing which, in fact, yields a better inversion performance than the MMSE criterion and a similar performance with dynamic feature based GMM mapping.

We propose a new smoothness criterion for inversion, called smoothed-GMM (SGMM), which combines smoothness with the information from GMM mapping within the same optimization framework rather than performing them separately. GMM mapping is shown to be a special case of SGMM. It is also analytically shown that GMM + Smoothing matches the solution of SGMM in the limit when the length of the test utterance becomes large. Experimental results on an articulatory database reveal that in practice this asymptotic limit is achieved even for an average utterance length of ∼2.75 s with a frame rate of 100 Hz. Thus, both theoretically and experimentally, GMM + Smoothing turns out to be a near optimal solution of SGMM.

Suppose the acoustic and articulatory feature vectors at the nth frame are denoted by xn and yn, respectively. xn=[xn1xn2xnI]T and yn=[yn1yn2ynJ]T, where xni is the ith acoustic feature (1iI) and ynj is the jth articulatory feature (1jJ). T denotes the transpose operator. In GMM based AtoA inversion, a GMM is used to model the joint probability p(xn,yn|Θ) given by p(xn,yn|Θ)=Σi=1MwiN(xn,yn;μi,Σi). Θ are the GMM parameters {wi,μi,Σi}i=1M, where M is the number of mixture components. wi denotes the mixture weight for the ith mixture. μi=[μi(x)Tμi(y)T]T is the mean vector of the ith mixture and μi(x) and μi(y) denote the mean vectors of the ith mixture for xn and yn, respectively. Similarly Σi denotes the full covariance matrix of ith mixture, which is given by

Σi=[Σi(xx)Σi(xy)Σi(yx)Σi(yy)],

where Σi(xx) and Σi(yy) denote the covariance matrices of the ith mixture for xn and yn, respectively, and Σi(xy), Σi(yx) represent the cross-covariance matrices of the ith mixture.

Given an acoustic feature vector sequence of length N frames, xn, 1nN, the goal of AtoA inversion is to estimate the corresponding articulatory feature vector sequence, ŷn, 1nN.

In GMM-based inversion,4ŷn is defined as the MMSE estimate given xn:7 

ŷnE(yn|xn)=i=1Mp(mi|xn,Θ)E(yn|xn,mi,Θ),
(1)

where p(mi|xn,Θ)=wiN(xn;μi(x),Σi(xx))/Σj=1MwjN(xn;μj(x),Σj(xx)) and E(yn|xn,mi,Θ)=μi(y)+Σi(yx)Σi(xx)1(xnμi(x)). Note that Σip(mi|xn,Θ)=1.

Toda et al.4 reported that smoothing ŷn by low-pass filtering makes the estimated articulatory trajectory more realistic and improves the inversion performance. Here we propose a new smoothness criterion, called SGMM which constrains the estimated trajectory to be smooth to a required degree while estimating the trajectory from the GMM mapping information. Thus, instead of estimating articulatory features in each frame independently (as done in GMM-based inversion), the SGMM criterion estimates the articulatory trajectory for an entire utterance. The jth articulatory feature trajectory is estimated by solving the following optimization problem:

{ŷnj;1nN}=argmin{znj}J({znj;1nN})=argmin{znj}Cjnip(mi|xn,Θ)(znjE(ynj|xn,mi,Θ))2+(1Cj)n(kzkjhnkj)2.
(2)

J is the objective function comprised of a convex combination of two terms with the convex weight Cj(0Cj1). znj is the optimization variable. The second term in J is the total energy of the output of a high-pass filter with impulse response hnj (corresponding to the jth articulator) with znj as input. By minimizing the output of a high-pass filter, SGMM constrains the solution to be low-pass or smoothly varying in nature. hnj could be designed based on the degree of required smoothness for jth articulator trajectory.

The first term in J is designed so that it utilizes the mapping between acoustic and articulatory spaces using the conditional means with their weights derived from the GMM. The choice of Cj provides a trade-off between the GMM mapping and the smoothness factor. The optimization in Eq. (2) is solved for j=1,,J separately to obtain the estimates of all J articulatory feature trajectories.

The objective function J in Eq. (2) is a convex (and quadratic) function of the optimization variables {znj;1nN}. Thus a global minimum is guaranteed. We define the autocorrelation sequence of the high-pass filter hnj as RlkjΣnhnkjhnlj. For minimization, the partial derivatives of J with respect to znj are set to zero at znj=ŷnj to obtain a set of N equations in the following matrix vector form:

((1Cj)R0j+Cj(1Cj)R1j(1Cj)RN1j(1Cj)R1j(1Cj)R0j+Cj(1Cj)RN2j(1Cj)R(N1)j(1Cj)R(N2)j(1Cj)R0j+Cj)(ŷ1jŷ2jŷNj)=(CjΔ1jCjΔ2jCjΔNj),
(3)

where Δlj=Σip(mi|xl,Θ)E(ylj|xl,mi,Θ). We can further write the set of equations as ((1Cj)Rj+CjI)ŷj=Cjdj, where Rj={Rklj}={Rklj}={R|kl|j} (since the autocorrelation matrix is symmetric), I is N×N identity matrix, ŷj=[ŷ1j,ŷNj]T and dj=[Δ1j,,ΔNj]T. Rj is an autocorrelation matrix and hence symmetric toeplitz. Thus, ((1Cj)Rj+CjI) is invertible for any choice of Cj(0<Cj<1). The estimate of the jth articulatory feature trajectory thus can be obtained as follows:

ŷj=Cj((1Cj)Rj+CjI)1dj=(I+1CjCjRj)1dj.
(4)

When Cj=0 in Eq. (4) (i.e., only the second term in J is considered), the estimated trajectory ŷj is trivially zero. In other words, when no GMM mapping information is included in the objective function J, the maximally smooth solution is an all zero trajectory. On the other hand when Cj=1 (i.e., no smoothness constraint is imposed on the estimated articulatory trajectory), ŷj=dj or ŷnj=Δnj=Σip(mi|xn,Θ)E(ynj|xn,mi,Θ), which is identical to the GMM mapping based estimate [Eq. (1)]. Thus, GMM based inversion is a special case of the optimization using the proposed SGMM criterion. For 0<Cj<1, the estimated trajectory lies between the extremes of the all-zero trajectory and the jagged trajectory obtained using GMM based inversion.

Rj is, in general, an N × N positive semi-definite symmetric toeplitz matrix with its entries coming from the autocorrelation sequence of hnj (i.e., Rnj) with the corresponding spectrum |Hj(ω)|2. Rj is also a convolution matrix with the corresponding impulse response Rnj. Let ρj=(1Cj)/Cj. Hence, I+ρjRj is an N×N positive definite symmetric toeplitz matrix with the related spectrum 1+ρj|Hj(ω)|2. Note that 1+ρj|Hj(ω)|2>0, ω; the addition of “1” acts as a regularization ensuring the invertibility of the spectrum 1+ρj|Hj(ω)|2 (similar to I for the invertibility of I+ρjRj). Using a result from the inverse of the toeplitz matrix [Eq. (5.5) in Ref. 8], it is easy to show that (I+ρjRj)1 is asymptotically (as N) toeplitz with the corresponding spectrum |Gj(ω)|2=1/(1+ρj|Hj(ω)|2). Since |Hj(ω)|2 is a high-pass spectrum and ρj>0, it is easy to see that |Gj(ω)|2 is a low-pass spectrum, where ρj controls the stop band attenuation of the low-pass filter. Hence, in the limit N, (I+ρjRj)1 acts as a convolution matrix with a corresponding impulse response Qnj of a low-pass filter with spectrum |Gj(ω)|2, where Qnj is the inverse Fourier transform of |Gj(ω)|2. Thus, asymptotically ŷj [Eq. (4)] is a low-passed or smoothed version of dj, the GMM based estimate. Thus we prove that the solution of SGMM asymptotically matches GMM + Smoothing.

For illustration, we consider a fifth order rational transfer function (H(ω)) of a type II Chebyshev high-pass filter with a 40 dB attenuation at 10 Hz with sampling frequency 100 Hz as shown in Fig. 1. For finite N, we pick the N/2th (N/2+1th for even N) row of (I+ρR)1 as the representative impulse response Pn for (I+ρR)1. We compute the mean squared error (MSE) EPQ between Pn and Qn over the same support as shown in Fig. 1(d) for different values of ρ. It is clear that EPQ becomes zero for N = 150 (corresponds to 1.5 s with 100 Hz frame rate) for ρ = 999. For ρ = 99 and 9, EPQ becomes zeros even for lower values of N indicating the asymptotic equivalence between (I+ρR)1 and |G(ω)|2.

FIG. 1.

(Color online) Illustration of the asymptotic equivalence between (I+ρR)1 and a low-pass convolution matrix with spectrum |G(ω)|2=(1+ρ|H(ω)|2)1 (index “j” is omitted for simplicity): (a) A high-pass spectrum H(ω), (b) 1+ρ|H(ω)|2, (c) |G(ω)|2, and (d) MSE between impulse response Pn representing (I+ρR)1 and Qn, the inverse Fourier transform of |G(ω)|2.

FIG. 1.

(Color online) Illustration of the asymptotic equivalence between (I+ρR)1 and a low-pass convolution matrix with spectrum |G(ω)|2=(1+ρ|H(ω)|2)1 (index “j” is omitted for simplicity): (a) A high-pass spectrum H(ω), (b) 1+ρ|H(ω)|2, (c) |G(ω)|2, and (d) MSE between impulse response Pn representing (I+ρR)1 and Qn, the inverse Fourier transform of |G(ω)|2.

Close modal

While we argue in Sec. III that the solution of SGMM asymptotically matches GMM + Smoothing, it is important to note that the low-pass filter in the limit has a frequency response of the form (1+ρ|H(ω)|2)1 [i.e., infinite impulse response (IIR) filter] in the case of SGMM, but in the case of GMM + Smoothing the low-pass filter can be either finite impulse response or IIR and its frequency response need not have a specific form. We conduct AtoA inversion experiments on an articulatory dataset comprising utterances of different lengths to examine the role that the particular form of a low-pass filter in SGMM may play on the inversion performance, specifically for different choices of N. The dataset and experimental details are described below.

For the AtoA experiment, we use the Multichannel Articulatory (MOCHA) database9 that contains speech and the corresponding EMA data from one male and one female talker of British English. The EMA data consists of dynamic positions of the EMA sensors in the mid-sagittal plane of the talker. A total of seven sensors are placed on the upper lip (UL), lower lip (LL), lower incisor (JAW), tongue tip (TT), tongue body (TB), tongue dorsum (TD), and VEL. Following the preprocessing steps outlined by Ghosh and Narayanan,5 we obtain parallel acoustic and articulatory data at a frame rate of 100 observations/s. We use 14 dimensional raw EMA features for representing the articulatory space (i.e., X and Y co-ordinates of 7 EMA sensors), namely ULx, LLx, JAWx, TTx, TBx, TDx, VELx, ULy, LLy, JAWy, TTy, TBy, TDy, and VELy. Acoustic features are represented by 39 dimensional Mel-frequency cepstral coefficients (MFCCs) and are computed using 20 msec analysis frame length with 10 msec shift.

AtoA inversion is performed separately on the male and female subjects of the MOCHA corpus using a fivefold cross-validation setup. Inversion performance is measured over all sentences of all folds through average root mean squared error (RMSE) and Pearson correlation coefficient (PCC)10 between the original and estimated articulatory trajectories.

Following the finding by Toda et al.,4 64 mixture component GMMs are used to model the acoustic-articulatory map in the training data separately for each fold. In the case of SGMM, we use a fifth order type II Chebyshev high-pass filter with 40 dB stop band attenuation as hnj with cut-off frequency fcj for the jth articulator. Different values of fcj and Cj were experimented with -fcj{3+0.5(k1)Hz,k=1,,45} and Cj{0.001,0.005,0.01,0.05,0.1,0.5,0.9,0.99,0.999}. We report AtoA inversion performance corresponding to the fcj and Cj combination which gives the least average RMSE. In the case of GMM + Smoothing, a fifth order type II Chebyshev low-pass filter with 40 dB stop band attenuation is used for smoothing whose cut-off frequency is also varied over the same range as that for hnj and the best performance among these is reported for each articulator.

Figure 2 shows the AtoA inversion performance in terms of RMSE and PCC for each articulator of both subjects in the MOCHA corpus. It is evident that the inversion performances using SGMM and GMM + Smoothing are not significantly different. Thus, inversion experiments support our theoretical finding that GMM + Smoothing is a near optimal solution of the SGMM criterion. An advantage of using SGMM over GMM + Smoothing is that the solution of SGMM can be computed in a recursive manner.5 Optimal choices of Cj in the case of the male and female subjects turn out to be in the range of 0.005 to 0.1 suggesting low-pass filters with high stop band attenuation to be preferred in SGMM. If the length of a sentence is too short to satisfy the asymptotic limit, the length of the sentence could be increased by appending it with silence and then considering articulatory features only in the segment of interest. It should also be noted that the functional forms of generalized smoothness criterion (GSC)5 and SGMM appear to be similar except that in GSC the training data is used in a non-parametric fashion while SGMM uses parameters of a GMM learned from the training data.

FIG. 2.

Comparison of SGMM and GMM + Smoothing - error bars indicate average inversion performance with ± one standard deviation.

FIG. 2.

Comparison of SGMM and GMM + Smoothing - error bars indicate average inversion performance with ± one standard deviation.

Close modal

Given a high-pass filter |H(ω)|2 in SGMM, one can always find a low-pass filter |G(ω)|2=(1+ρ|H(ω)|2)1 and perform GMM + Smoothing with |G(ω)|2 to achieve an inversion performance similar to SGMM with |H(ω)|2. However, the opposite is not true in general. This is because any arbitrary low-pass filter A(ω) cannot be put in the form (1+ρ|H(ω)|2)1. For example, the type II Chebyshev low-pass filter used in AtoA experiments is not in this particular form. In spite of that the AtoA inversion performances using GMM + Smoothing and SGMM turn out to be similar. This suggests that although it could be difficult to find a high-pass filter H(ω) in SGMM corresponding to an arbitrary low-pass filter A(ω) in GMM + Smoothing, SGMM with a high-pass filter different from H(ω) could lead to a similar inversion performance as that of GMM + Smoothing with A(ω). For a given A(ω), one could also find a low-pass filter of the form (1+ρ|H(ω)|2)1 that best approximates A(ω) and then SGMM with the corresponding H(ω) as the high-pass filter will lead to an inversion performance similar to that of GMM + Smoothing with A(ω).

We present a new unified criterion (SGMM) for estimation and smoothing for AtoA inversion; its solution is shown, both theoretically and experimentally, to be identical to the individually optimized GMM + Smoothing based solution in the limiting case. In practice, these results seem to hold for utterances just a few seconds long. Since in GMM + Smoothing based inversion the GMM mapping and smoothing are performed separately, this finding offers an additional insight as to what underlying criterion is being optimized in GMM + Smoothing.

1.
S.
Ouni
and
Y.
Laprie
, “
Studying pharyngealization using an articulograph
,”
International Workshop on Pharyngeals and Pharyngealisation
(
2009
).
2.
S. J.
Perkell
,
M.
Cohen
,
M.
Svirsky
,
M.
Matthies
,
I.
Garabieta
, and
M.
Jackson
, “
Electromagnetic mid-sagittal articulometer systems for transducing speech articulatory movements
,”
J. Acoust. Soc. Am.
92
,
3078
3096
(
1992
).
3.
T.
Shawker
,
M.
Stone
, and
B.
Sonies
, “
Tongue pellet tracking by ultrasound: Development of a reverberation pellet
,”
J. Phonetics
13
,
134
146
(
1985
).
4.
T.
Toda
,
A.
Black
, and
K.
Tokuda
, “
Acoustic-to-articulatory inversion mapping with Gaussian mixture model
,” in
Proceedings of the ICSLP
, Jeju Island, Korea (
2004
), pp.
1129
1132
.
5.
P. K.
Ghosh
and
S. S.
Narayanan
, “
A generalized smoothness criterion for acoustic-to-articulatory inversion
,”
J. Acoust. Soc. Am.
128
(
4
),
2162
2172
(
2010
).
6.
A.
Toutios
and
K.
Margaritis
, “
Acoustic-to-articulatory inversion of speech: A review
,” in
Proceedings of the International 12th TAINN
(
2003
).
7.
F.
Faubel
,
J.
McDonough
, and
D.
Klakow
, “
Bounded conditional mean imputation with Gaussian mixture models: A reconstruction approach to partly occluded features
,”
IEEE Trans. Acoust., Speech, Signal Process.
1
,
3869
3872
(
2009
).
8.
R. M.
Gray
, “
Toeplitz and circulant matrices: A review
,”
Found. Trends Commun. Inf. Theory
2
(
3
),
155
329
(
2005
) (available at http://ee.stanford.edu/ gray/toeplitz.pdf).
9.
A. A.
Wrench
and
H. J.
William
, “
A multichannel articulatory database and its application for automatic speech recognition
,” in
5th Seminar on Speech Production: Models and Data
, Bavaria (
2000
), pp.
305
308
.
10.
D. R.
Cox
and
D. V.
Hinkley
,
Theoretical Statistics
(
Chapman and Hall
,
London
,
1974
), Appendix 3.