A model synthesizing average frequency components from select sentences in an electromagnetic articulography database has been crafted. This revealed the dual roles of the tongue: its dorsum acts like a carrier wave, and the tip acts as a modulation signal within the articulatory realm. This model illuminates anticipatory coarticulation's subtleties during speech planning. It undergoes rigorous, two-stage optimization: statistical estimation and refinement to depict carryover and anticipation. The model's base, rooted in physiological insights, deciphers carryover targets while its upper layer captures anticipation. Optimization has pinpointed unique phonetic targets for each phoneme, providing deep insights into virtual target formation during speech planning. These simulations, aligning closely with empirical data and marked by a mere 0.18 cm average error, along with extensive listening tests attest to the model's accuracy and enhanced speech synthesis quality.

The intricate ballet of speech unfolds on the stage of the vocal tract with the tongue playing the prima ballerina. Its nimble movements weave the tapestry of sound but rarely in isolation. Each sound dances with its neighbors, their motions subtly affecting each other in a complex choreography known as coarticulation. This intricate interplay between phonemes, where the articulation of one segment bleeds into the next, imbues continuous speech with remarkable variability in its acoustic and articulatory landscape (Perkell and Klatt, 1986). Like actors adapting their performance to the scene, phonetic segments readily shed their isolated character, adopting nuanced modifications in response to their immediate linguistic context (Kühnert and Nolan, 1999). These modifications, aptly named coarticulation (Flemming, 2011; Kleber , 2012; Kühnert and Nolan, 1999; Magen, 1997), arise from the intricate overlap of articulatory gestures during speech production. Yet, despite its pervasive presence, coarticulation remains a captivating riddle, demanding further exploration to unveil its secrets.

Coarticulation has long been an area of interest in speech production research and phonetics (Beddor , 2013; Cychosz , 2021; DiCanio, 2012; Recasens and Pallarès, 2001; Zharkova and Hewlett, 2009). Coarticulation models have been developed to bridge the gap between invariant, discrete representations of articulation and acoustics and accurately predict the intricacies of this process. Numerous studies (Hardcastle and Tjaden, 2008; Recasens , 1997; Öhman, 1967) have employed diverse approaches to elucidate the origins, nature, and functions of coarticulation.

From a physiological perspective, variations in phonemes manifest in their acoustic realization during speech production, even if phonetic targets remain invariant at the brain level (Beddor, 2009; Punišić , 2021; Redford, 2019). This is because of changes in articulator movements throughout utterances. Some aspects of coarticulation are planned in advance, anticipatory coarticulation (Katz, 2000; Repp, 1986), while others arise from physical properties of articulators such as inertia and finite acceleration/deceleration of vocal tract tissues and carryover coarticulation (Farnetani and Recasens, 2010; Mok, 2010; Salverda , 2014).

Henke (1966) proposed a look-ahead model to account for anticipatory coarticulation while Öhman (1966) developed the perturbation model based on spectrographic and x-ray studies of English and Swedish subjects. Henke suggested that vocalic utterances can be viewed as continuous movements with consonant gestures superimposed. The look-ahead model emphasizes the temporal order in anticipation, whereas the perturbation model focuses on the principal-subordinate relationship between vowels and consonants. The carrier model proposed by Dang and Honda (2004) and Dang (2004) and the strengths of these two frameworks are established ideas in speech production, capturing coarticulatory effects during the planning stage.

Our study develops a coarticulation model, capturing the planning stage of speech production. We leverage insights from articulatory movements and simulations using a preexisting physiological model, comprehensively depicting the process from targets and muscle activation to synthetic sounds.

To systematically understand coarticulation, we rigorously investigate the principal-subordinate structure of articulation as a fundamental assumption. By constructing a text-independent, generalized articulatory movement, we verify this structure and formulate anticipatory coarticulation. Drawing on statistical insights from the electromagnetic midsagittal articulography (EMMA) data (Kaburagi and Honda, 2002), we integrate our method with a physiological model.

Our analysis reveals anticipatory and carryover coarticulation. Separating these remains an open issue. We propose a two-layer learning procedure with an upper layer for anticipatory and a lower layer for carryover coarticulation. This reduces complexity and allows tailored optimization.

We rigorously assess optimization through objective and subjective evaluations. Objective evaluation uses spatial metrics to quantify correspondence between observed and simulated movements. Subjective evaluation involves listening tests to capture perceptual aspects.

This paper is organized as follows. Section II briefly describes speech production and coarticulation, verifies the principal-subordinate structure, and elucidates the formulation of this model. Section III proposes a simulation-based learning framework and process. Section IV depicts the details and results of the learning experiments and resulting evaluations. Finally, Sec. V provides the conclusions.

The essential goal of modeling articulators and their movements is to explain, understand, and mimic the coherent rationale and mechanisms underlying human speech production. This section, first, briefly introduces the principles of speech production and coarticulation.

A brief flow chart of the procedure used in human speech production is displayed in Fig. 1. We suppose that there is a unique spatial target, referred to as a phonetictarget, corresponding to each phonetic unit of speech. In the flow chart, the articulatory targets of a phonetic unit are generated from a single phonetictarget of this phoneme in the planning stage by the anticipation mechanism, according to contextual variations. These articulatory targets then drive the articulators to produce articulatory movements.

FIG. 1.

(Color online) The simulation framework proposed for estimating phonetic targets in coarticulation.

FIG. 1.

(Color online) The simulation framework proposed for estimating phonetic targets in coarticulation.

Close modal

During natural speech, two forms of coarticulation overlap can be observed: left-to-right (LR, carryover) and right-to-left (RL, anticipatory). The carryover effect manifests through the physiological and kinematical movements of the articulators. In contrast, anticipatory coarticulation generally indicates sophisticated phonological-phonetic processing, wherein the speaker anticipates incoming sounds. To describe anticipatory coarticulation, Henke introduced a phonemic-segment model. Each utterance is described as a matrix of articulatory targets with distinctive features, where certain features change abruptly as the target shifts.

In this study, articulatory targets stem from the phonetic targets of each phoneme through a carrier-modulation framework. The carrier model aims to provide a computational framework to account for coarticulation in planning. Fundamentally, an utterance comprises consonant and vowel streams. Vowels exert global, sustained effects while consonants exhibit local, transient effects. Treating vowels as a carrier wave and consonants as a modulation signal conceptualizes coarticulation as modulation.

The carrier-modulation speculation of the carrier model is similar to the principal-subordinate structure used in Öhman (1967), proposed based on spectrogram analyses in the acoustic domain. This section attempts to verify whether this carrier-modulation structure also exists in the articulatory domain. However, it is difficult to verify the structure using a specific phoneme sequence as articulatory movements are context dependent and impossible to comprehensively include in a short sentence. Therefore, we analyze movement components of speech organs in the frequency domain and reconstruct a generalized articulatory movement by averaging their frequency components. This generalized movement should reflect the inherent properties of the speech organs in a general contextual environment.

The articulatory data used in this study were collected using the EMMA system. Four receiver coils (named T1–T4) were placed on the tongue surface in the midsagittal plane, as well as on the upper lip, lower lip, maxilla incisor, mandible incisor, and velum. The sampling rate was 250 Hz for the articulatory channels and 16 kHz for the acoustic channel. The coordinate origin was located at the maxilla incisor, 0.5 cm above its tip. The speech materials comprised 360 Japanese sentences read by 3 adult male speakers at a normal speech rate. The acoustic signal and articulatory data were recorded simultaneously, although only four tongue points (T1–T4) were used, here, out of the eight observed in the EMMA database.

In this analysis, 352 sentences from the EMMA database were selected to generate text-independent articulatory movements. From each sentence, 2-s segments of speech were extracted. Short-term discrete Fourier transform (DFT) with 256 samples (about 1 s) was applied to the extracted segments, windowed by a Hanning window with a frame shift of about 64 samples. Complex spectra for T1– T3 were obtained by averaging all DFT frames. Figure 2 shows the average amplitude spectra of T1 and T3 on the vertical (Y) dimension, limited to frequencies within 40 Hz. The results indicate distinguishable generalized movements of the tongue tip versus tongue dorsum in this frequency region. To construct a generalized articulatory movement, the average complex spectra are represented in the format of a Fourier series.

FIG. 2.

(Color online) Average spectra of articulatory movements at tongue tip and dorsum.

FIG. 2.

(Color online) Average spectra of articulatory movements at tongue tip and dorsum.

Close modal

In general, vowel production involves global, sustained tongue movement, whereas consonantal movement is more local and temporal relative to vowels. Because apical consonant constrictions are shaped by the tongue tip, T1 roughly represents consonants (C). Meanwhile, T3 represents vowels (V). In Japanese, CV is the basic syllable unit, therefore, we can reasonably suppose that the constructed articulatory movement corresponds to a CVCVCVCV phoneme sequence for this 1-s generalization. Our analysis shows that the tongue dorsum (T3) mainly reflects the vocalic V_V_V_V stream, excluding consonants, while the tongue tip (T1) corresponds to the full CVCVCVCV. If this speculation holds, T1 movement should have about twice as many maximum (or minimum) peaks as T3 in the same period. To test this, we calculated the velocities at T1 and T3, depicted in Fig. 3. At phoneme centers, articulators are in steady state with zero velocity. Figure 3 shows 14 zeros for T1 and 8 for T3; T1 has about twice as many.

FIG. 3.

(Color online) Velocity of reconstructed waveforms at T1 and T3.

FIG. 3.

(Color online) Velocity of reconstructed waveforms at T1 and T3.

Close modal

Based on the above analysis, an utterance can generally be considered as comprising vowel and consonant streams in a principal-subordinate structure. A look-ahead mechanism is applied to realize the interaction between adjacent phonemes within and across these components.

To realize the modulation processing, the first step is constructing the carrier wave, where articulatory movement is considered to be continuous transition between vowels. To construct this wave, if the first and/or last phonemes are not vowels, a target vector for a neutral vowel is added before the first phoneme and/or after the last phoneme. Recasens investigated lingual region sensitivity to different places, manners, and mechanical restrictions, quantifying it with a degree of articulatory constraint (DAC; Recasens, 2002; Recasens and Pallarès, 1999; Recasens , 1997). We borrow this concept to describe phoneme effects on neighbors such that a vowel, V, impact on adjacent vowels depends on its DAC, denoted dv. Because the resultant consonant target Ci involves tug-of-war between adjacent vowel targets at position i, a virtual target Gi results using
(1)
where i and j are indices for the position of consonants and vowels, respectively, and α and β are weighting coefficients related to the tug-of-war in the look-ahead process.
The second process is constructing the resultant consonantal target based on the phonetictarget and virtual target using formula (2). Note that in this step, only the crucial feature is reconstructed, such as the tongue tip target for /t/. Indecisive features remain unchanged as they depend on coarticulation from adjacent vowels,
(2)
where rci is the coefficient of articulatory resistance for the crucial feature of Ci. This coefficient reflects the consonant's capability to resist effects from adjacent vowels, whereas DAC describes a phoneme's capability to affect its neighbors. The larger rci is, the weaker the effects accepted from neighboring vowels. In other words, the magnitude of the virtual target's effect on the consonantal target is inversely proportional to rci.
The effects of consonants on vowels are incorporated through an anticipation mechanism as follows:
(3)
where i and j are indices for the position of consonants and vowels, respectively, as in Eq. (1), and dci is the DAC of consonant Ci. Finally, the planned target sequence is obtained by the summation set of the principal and subordinate components of {{Vj}{Ci}}.

To complete model formulation, the coefficients in formulas (1)–(3) must be quantified. Initially, these values are determined through statistical analysis of articulatory data. Then, they are refined through a subsequent learning process.

1. Estimation of the DAC

To estimate DAC for vowels, we extracted VpCViCVj phoneme sequences from the articulatory data, where Vp and Vj denote preceding and following vowels, respectively, Vi is the central vowel, and C represents possible consonants. The distribution of the average location was investigated across various contexts. Deviation of the central phoneme pulled away from its average position measures the look-ahead action of the following vowel. To suppress carryover effects, a Japanese neutral vowel, /e/, was chosen in all combinations as the preceding vowel. The DAC of each vowel, dvj, was estimated using
(4)
where dvj(θ,ρ) corresponds to the tongue tip and tongue dorsum in the x- and y-dimensions. θ represents either the tongue tip or tongue dorsum, and ρ stands for the x- or y-dimension. vi¯(θ,ρ) is the average location of vowel Vi with preceding vowel /e/. vi is the location of Vi in the contexts with preceding vowel /e/ and following vowel Vj.
Similarly, CpVjCi sequences are used to estimate consonant DAC values via Eq. (5), where Cp is chosen as /k/ for all sequences,
(5)
where the symbols follow the same definitions as in Eq. (4). The larger that d is, the stronger the consonant's DAC is as shown in Eq. (5). Tables I and II present the estimated DAC values for five Japanese vowels and eight consonants.
TABLE I.

DAC values of five Japanese vowels.

Phoneme
Position /a/ /i/ /u/ /e/ /o/
T1x  2.92  2.13  2.00  5.20  1.84 
T1y  1.52  1.00  2.30  5.06  2.25 
T3x  3.01  2.64  1.66  4.80  1.65 
T3y  2.60  3.23  3.08  1.50  1.19 
Phoneme
Position /a/ /i/ /u/ /e/ /o/
T1x  2.92  2.13  2.00  5.20  1.84 
T1y  1.52  1.00  2.30  5.06  2.25 
T3x  3.01  2.64  1.66  4.80  1.65 
T3y  2.60  3.23  3.08  1.50  1.19 
TABLE II.

DAC values of eight consonants.

Phoneme
Position /d/ /g/ /k/ /n/ /r/ /s/ /t/ /w/
T1x  6.49  4.79  3.64  5.62  2.84  7.09  13.75  9.62 
T1y  3.23  4.68  7.16  4.75  9.63  1.72  5.04  5.91 
T3x  5.00  3.35  4.07  3.32  4.16  4.73  7.75  5.09 
T3y  2.79  9.72  8.49  2.11  1.09  1.37  2.21  1.34 
Phoneme
Position /d/ /g/ /k/ /n/ /r/ /s/ /t/ /w/
T1x  6.49  4.79  3.64  5.62  2.84  7.09  13.75  9.62 
T1y  3.23  4.68  7.16  4.75  9.63  1.72  5.04  5.91 
T3x  5.00  3.35  4.07  3.32  4.16  4.73  7.75  5.09 
T3y  2.79  9.72  8.49  2.11  1.09  1.37  2.21  1.34 

2. Weighting coefficient for the planning mechanism

In Eq. (1), α and β are coefficients related to the planning mechanism. To quantify these values, we calculated the displacement of the central consonant caused by the preceding and following vowels in VpCVf sequences using formulas
(6)
(7)
where N is the number of consonants, mi and ki are the numbers of vowel combinations preceding and following consonant i, respectively. c¯i(θ,ρ) is the average location of consonant, ci,p(θ,ρ,vp) is the location of Ci under the condition with the preceding vowel Vp. ci,f(θ,ρ,vf) is the location of Ci under the condition with the following vowel Vf. The weighting coefficient for the look-ahead mechanism is calculated as
(8a)
(8b)
Applying Eqs. (8a) and (8b) to the x- and y-dimensions yields x-dimension coefficients alpha equals 0.33 and beta is 0.67, and y-dimension coefficients α equals 0.46 and β is 0.54. In both dimensions, the following vowel has a stronger effect than the preceding vowel.

3. Coarticulation resistance coefficient

The coarticulation resistance coefficient reflects a consonant's resistance to vowel effects. VjCiVj sequences with identical bilateral vowels were used to estimate. Based on the dataset, the distance between average vowel and consonant positions was calculated along with consonant displacement induced by the bilateral vowels. A consonant's resistance reflects the ratio of its distance to a vowel versus the displacement caused by that vowel. Equation (9) defines the coefficient
(9)
where M is the number of vowels paired with consonant Ci. Table III shows the estimated values for eight consonants, using the crucial control points (tongue tip for apical consonants and tongue dorsum for dorsal consonants).
TABLE III.

Resistance coefficients for eight consonants.

Phoneme
/d/ /n/ /s/ /t/ /r/ /g/ /k/ /w/
x-dimension  9.5  2.1  21.4  11.3  3.1  0.2  0.3  3.9 
y-dimension  15.5  10.9  11  13.5  5.3  2.6  3.8 
Phoneme
/d/ /n/ /s/ /t/ /r/ /g/ /k/ /w/
x-dimension  9.5  2.1  21.4  11.3  3.1  0.2  0.3  3.9 
y-dimension  15.5  10.9  11  13.5  5.3  2.6  3.8 

As mentioned earlier, phonetic targets in articulatory space are assumed to be constant for each phoneme during planning. This model transforms phonetic targets into articulatory targets using contextual information. However, relying solely on observation-based parameters may not yield optimal results, especially because phonetic targets cannot be directly observed. This raises the question of how to obtain optimal parameters and estimate phonetic targets. To tackle this challenge, we employ a model-based learning approach to estimate coefficients and phonetic targets.

Currently, EMMA data provide articulatory movement but neither phonetic targets nor articulatory targets are observable (Katz , 2017; Kim , 2014; Kroos, 2012). However, if a physiological articulatory model existed with identical human functions at physiological and kinematic levels, we could obtain reliable phonetic targets and/or articulatory targets by tuning model inputs to match observations. Based on this, we propose a model-based learning process to estimate phonetic targets and optimize model's parameters. This combines a physiological articulatory model and the carrier coarticulation model. In contrast, traditional learning obtains parameters by minimizing an objective function to best explain a dataset in a maximum likelihood or minimum error sense. However, most learned parameters are highly data dependent and rarely reflect true physical mechanisms involved. To obtain inherent knowledge from observations, physical models must be combined with learning rather than just fitting a black box model. Compared to traditional learning, model-based learning provides physically meaningful parameters.

The physiological articulatory model adopted is a partial three-dimensional (3D) model constructed from volumetric magnetic resonance imaging (MRI) data of a male Japanese speaker. It comprises the tongue, jaw, hyoid bone, and vocal tract walls. Its muscular structure was designed based on MRI measurements and anatomical knowledge. At the physiological level, this model is highly consistent with human speech production mechanisms. Two force types drive the model— target-dependent forces from the EP-Map estimating targets and dynamic forces minimizing distance to target positions. This plausible control method aligns with human speech production. The model naturally provides functionality to emulate anticipatory coarticulation in speech planning. Based on this analogy, we propose an optimization framework using the physiological model, where the learning process mirrors the human production procedure, giving learned parameters clear physical meaning.

This study focuses on modeling anticipatory coarticulation. However, observed articulatory data includes carryover and anticipatory effects, inseparable in the raw data. To isolate anticipation, we split the framework into lower and upper layers. The upper layer contains the anticipatory effects. The lower layer with the physiological articulatory model inherently models carryover effects. The articulatory targets connect the layers, separating anticipation from carryover in simulation.

Analogous to human speech production, differences between observations and model simulations mainly stem from differing brain and model articulatory targets as the physiological model emulates human production. Thus, articulatory targets can be learned by minimizing observation-simulation differences in the lower layer. Furthermore, in the upper layer, phonetic targets and model's coefficients are learned based on the articulatory targets from below.

The lower layer focuses on acquiring phonemic articulatory targets by minimizing the distance between observed and simulated articulatory positions. This learning process approximates model simulations to production data. The articulatory target acquisition is formulated as
(10)
(11)
(12)
where Pv and Pc denote the planned targets for the preceding vowel and central consonant in VCV sequences—six-dimensional (6D) vectors with x- and y-dimensions of the jaw, tongue tip, and tongue dorsum. Note that the following vowel target is not involved in learning. Ov and Oc are the observed EMA movements, whereas Sv and Sc are the simulated vowel and consonant positions, respectively. Because consonants are more sensitive to place of articulation than vowels, a weighting coefficient λ emphasizes consonant locations, which are calculated as
(13)
where N is the number of consonants, which is eight in this study; M is the number of vowels, which is five; std denotes the averaged standard deviation over x- and y-dimensions of the distribution of steady-state positions for each phoneme, and λ was set to 0.3.

The physiological articulatory model used to calculate objective function involves nonanalytical, nonlinear processes from motor commands to articulatory movements. This makes it difficult to analytically relate articulatory targets and outputs. Given this complexity, traditional optimization methods using gradients or higher-order derivatives cannot be readily applied. To address this, we adopt the mesh adaptive direct search (MADS) algorithm (Audet and Dennis, 2006; Audet , 2022), designed for derivative-free optimization.

The MADS algorithm attempts to locate the function Ψ minimizer over domain Ω by using the barrier function in Eq. (14). Ω is the rational region around each phoneme's place of articulation.
(14)
where Tpv,Tpc are the same as those defined in Eq. (12). is equal to 12, corresponding to the dimensions of Tpv and Tpc. MADS is applied to the above barrier function such that
(15)
where ΨΩ:{} are optimal articulatory targets. MADS is an iterative algorithm, searching a set of points called a mesh each iteration. At iteration k, the current mesh Mk is defined as the union,
(16)
where Mk is the mesh at iteration k, Sk is the set of points evaluated in iteration k, which denotes the set of integers, z is a direction vector with 2 elements corresponding to Tpv and Tpc, and Δk+ is the mesh size parameter. MADS aims to find a trial mesh point for each iteration with a lower objective value than the current incumbent ΨΩ(Tpv,k,Tpc,k). If the search succeeds in identifying an improved iteration, the new point will replace the incumbent, and the mesh size will be compressed (in this case, it will be halved); otherwise, the incumbent will be kept, and the mesh size will be extended (in this case, it will be doubled). The details are described in Audet (2022).

The task of the upper layer learning is to optimize the model's coefficients and estimate the phonetic targets for each phoneme.

1. Model reformulation

The correlated coefficients on the upper layer contains clear physical meanings. To decorrelate them, we reformulated the model as follows. The vowel-vowel interaction kj,j+1 represents vowel DACs in a 5×5 matrix, with rows corresponding to five preceding vowels and columns corresponding to five following vowels. Four matrices are constructed for tongue tip and dorsum in x and y. gi describes the reciprocal articulatory resistance rci, forming a 2×8 matrix for the eight consonants' crucial points in x and y. mi,j depicts the vowel-consonant interaction between paired dvj and dci, comprising four 5×8 matrices for tongue tip and dorsum in x and y, which are formulated as
(17a)
(17b)
for the consonant,
(18a)
(18b)
for the vowel,
(19a)
(19b)
The variables in Eqs. (17)–(19) follow the same definitions as introduced in Eq. (4).

2. Objective function

The upper layer objective function is the difference between articulatory targets learned in the lower layer and those calculated from phonetic targets. This is formulated in Eq. (21), where Ci and Vj are lower layer consonant and vowel targets, respectively, while Ci and Vj are upper layer outputs. K is the number of VCV combinations. Minimizing this objective function realizes the upper layer parameters and true phonetic targets,
(20)
(21)
The optimization processing is depicted as
(22)
where γ are the weighting coefficients of the l(·) and f(·). The value of γ was experimentally determined to provide accurate results for consonants and vowels. In the experiment, γ was valued in the range of 0–1 by the step of 0.05. As the result, the optimal value is 0.05 for the apical consonants, and γ=0.95 for vocalic targets.

3. Bi-level optimization

As stated above, a single learning process determines two correlated parameter sets. To reduce this correlation, we adopt a bi-level optimization method in the upper layer, suited for such problems. Bi-level programming stems from Stackelberg games (Li and Sethi, 2017), where a leader makes decisions knowing a follower will respond but cannot control the response (Dempe and Dutta, 2012; Kleinert , 2021; Lodi , 2014). At each level, decision makers optimize their own variables but may be affected by others' variables. Bi-level programming is often used in decomposition (Dempe and Zemkoho, 2013).

We separate the two parameter sets into the phonetic target and the upper layer model's coefficient levels. This decomposes the problem into two subproblems using bi-level optimization, shown in Eq. (23). The first part focuses on optimizing phonetic targets, and the second focuses on optimizing coefficients. Each affects the other through shared variables.

Because the follower level is non-convex, traditional bi-level optimization can fail. We use the MADS method in both levels. In Eq. (23), vectors Vj and Cj are optimized in the leader level, and vectors gi,kj,j+1, and mi,j are treated in the follower level,
(23)
where gi,kj,j+1, and mi,j solve
(24)

The initial values of x are in the rational region, as empirically determined by the physiological articulatory model. Each xi and yi is also limited to a rational region constrained by a boundary with a 1.5 cm radius from their initial positions.

To obtain the phonetic targets and optimize model's coefficients, we conducted learning experiments using the above algorithms. The Nippon Telegraph and Telephone EMMA database provided observation articulatory data. We extracted 153 VCV combinations with five Japanese vowels (/a/, /i/, /u/, /e/, /o/) and eight consonants (/d/, /g/, /k/, /n/, /r/, /s/, /t/, /w/). Phonemes were represented by 6D vectors of jaw, tongue tip, and tongue dorsum positions during stability.

We calculate the average distance mean X¯ and standard deviation s for each vowel and consonant across the simulation and observation pairs n. Using a “Z” value of 1.960 from the standard normal table, we compute the 95% confidence interval (CI) with the corresponding equation. We then count all points within this interval and use Matplotlib's ellipse function (Hunter, 2007) to draw an ellipse around them,
(25)

1. Results on the lower layer

On the lower layer, we learned articulatory targets from the VCV combinations through 90 optimization iterations. Figure 5 shows the error monotonically decreasing over iterations, reaching an average of 0.065 cm—close to EMMA measurement accuracy.

Figure 4 shows the distributions of the learned articulatory targets for vowels and consonants obtained from the lower layer optimization process. In Figs. 4 and 5, crosses indicate observed articulator positions from the EMMA database while diamonds show the corresponding simulated positions after optimization.

FIG. 4.

(Color online) Distributions of observed and simulated articulatory movements of five vowels (a) and eight consonants (b) in the lower layer. The diamonds denote the simulations; –cross marks show the observations. The ellipses were referred to the 95% confidence interval to cover the articulatory targets.

FIG. 4.

(Color online) Distributions of observed and simulated articulatory movements of five vowels (a) and eight consonants (b) in the lower layer. The diamonds denote the simulations; –cross marks show the observations. The ellipses were referred to the 95% confidence interval to cover the articulatory targets.

Close modal
FIG. 5.

Error curves in learning process. The curve (a) is the error curve on upper layer learning process and (b) is the error curve on the lower layer.

FIG. 5.

Error curves in learning process. The curve (a) is the error curve on upper layer learning process and (b) is the error curve on the lower layer.

Close modal

For the vowel targets in Fig. 4(a), the dashed ellipses represent 95% CIs calculated to cover the target distributions. It can be clearly observed that the optimized model produces simulated vowel articulator positions that closely match the actual observed positions from EMMA data. The two distributions align very closely, demonstrating that the optimization successfully converges on vowel targets consistent with the real articulatory measurements.

Similarly, Fig. 4(b) displays the learned articulatory target distributions for consonants. A key observation is that most of the crucial point targets for consonants involving closures, such as /d/, /g/, and /k/, lie beyond the region of the hard palate. This suggests that to achieve complete closure, the articulatory targets for consonants must extend beyond the hard palate anatomy itself. This phenomenon highlights how the optimization process is able to extract meaningful physical knowledge about speech articulation biomechanics. Overall, the matched distributions in Figs. 4 and 5 indicate the successful learning of articulatory targets that closely agree with real speech production.

2. Results on the upper layer

In the upper layer, we used the bi-level decomposition optimization method with 100 MADS iterations at each leader and follower level per loop. Figure 5 also shows the averaged error over x- and y-dimensions for the tongue tip and dorsum. The optimization error converged after exceeding 2000 total iterations. The final average error between lower layer learned targets and upper layer calculated targets was0.176 cm, a reasonable value. The unsmooth points early in the error curve arise from level switching in the bi-level method.

1. Evaluating optimization performance through simulations

Figure 6 shows the distribution of observed and simulated articulatory movements for vowels [Fig. 6(a)] and consonants [Fig. 6(b)] obtained through the proposed modeling framework. The observed trajectories were recorded during speech production using EMMA with sensors on the tongue, lips, and jaw. These real EMMA data are plotted as red crosses, providing ground truth configurations for different phonemes. The blue diamonds represent corresponding simulated trajectories from the model.

FIG. 6.

(Color online) Distribution of observed and simulated articulatory movements of five vowels (a) and eight consonants (b) obtained via the whole framework. The blue diamonds denote the simulations; red cross marks show the observations. The last panel on the bottom right depicts the learned phonetic targets for five vowels.

FIG. 6.

(Color online) Distribution of observed and simulated articulatory movements of five vowels (a) and eight consonants (b) obtained via the whole framework. The blue diamonds denote the simulations; red cross marks show the observations. The last panel on the bottom right depicts the learned phonetic targets for five vowels.

Close modal

Qualitatively, the simulations closely match the real observations, indicating the model accurately maps acoustics to articulations. Quantitative evaluation shows the average error between simulated and observed trajectories is only 1.2 mm, demonstrating excellent performance. The bottom right panel shows the model learned underlying phonetic targets—the desired configurations to produce clear phonemes—evidenced by clustered vowel distributions.

The final panels in Figs. 6(a) and 6(b) plot the learned targets for comparison. Targets for plosives /d/, /t/, /n/, /r/, /k/, /g/ with tongue-palate closure lie beyond the hard palate boundary, whereas fricative /s/ and semivowel /w/ targets are inside the tract. This implies that targets should virtualize beyond the palate to achieve closure, which is consistent with prior hypotheses (Löfqvist and Gracco, 2002; Fuchs 2001).

Overall, the proposed framework successfully acquires a robust model simulating realistic articulatory movements from audio, enabling diverse applications.

2. Subjective evaluation

To evaluate the optimized articulatory targets and model's parameters subjectively, we conducted listening tests by synthesizing speech using an articulatory model-based synthesizer. Three distinct conditions were used for synthesis: condition 1 employed the original observed articulatory targets from the EMMA database without applying the upper layer model. Condition 2 used the articulatory targets learned through optimization in the lower layer but still without the upper layer model. Finally, condition 3 synthesized speech using the fully optimized targets and model from the upper layer learning.

The speech materials consisted of the 153 VCV combinations extracted earlier, out of which 40 combinations were randomly selected to be used in the listening test. We adopted a paired A-B comparison methodology, where 18 volunteer subjects listened to the samples through binaural headphones at a comfortable loudness level within a soundproof auditory room. In the A-B test, speech samples from two of the conditions were presented simultaneously for each VCV combination. The subject was asked to choose the sample that sounded more natural to their perception. Choices were marked by the subjects, who could replay the pair of samples multiple times before finalizing the selection, however, reselecting once marked was not permitted.

In the comparison between conditions 1 and 2, the samples synthesized from the learned articulatory targets (76%) were preferred over those from the original observed targets (11%) by a large margin. The remaining 13% were rated as neutral i.e., the subject could not reliably choose between the two. In the test between conditions 2 and 3, the samples with the optimized upper model (68%) were strongly favored over those without it (15%) with 16% neutral ratings. The full results are presented numerically in Tables IV and V. These subjective listening tests clearly demonstrate the improvement in perceived naturalness and speech quality obtained by optimizing and implementing the upper model beyond just using the observed articulatory data.

TABLE IV.

The average values of choice rate of trial 1 and trial 2.

Trial1 Trial2 Neutral
Average percentage (%)  10.97  76.11  12.92 
Standard deviation of percentage (%)  2.99  4.64  3.12 
Trial1 Trial2 Neutral
Average percentage (%)  10.97  76.11  12.92 
Standard deviation of percentage (%)  2.99  4.64  3.12 
TABLE V.

The average values of choice rate of trial 2 and trial 3.

Trial2 Trial3 Neutral
Average percentage (%)  15.56  68.61  15.83 
Standard deviation of percentage (%)  2.79  3.45  2.27 
Trial2 Trial3 Neutral
Average percentage (%)  15.56  68.61  15.83 
Standard deviation of percentage (%)  2.79  3.45  2.27 

In this study, we validated the carrier-modulation structure, also known as the principle-subordinate structure, of articulatory movements in the articulatory space through the construction of a generalized tongue movement. Building on this observation, we proposed a computational formulation to elucidate the underlying mechanism of anticipatory coarticulation between the tongue tip and tongue dorsum.

To obtain the phonetic targets in the phonetic planning stage and refine the model's parameters, we introduced a novel optimization framework based on a physiological articulatory model. By integrating two physical models derived from human speech production mechanisms, this framework ensures that the learned parameters possess explicit physical meanings, going beyond mere fitting of observations with a black box model. Remarkably, our findings provide explicit evidence for the commonly held hypothesis that consonants with closure typically exhibit overshoot targets over the hard palate.

The distributions of simulated articulatory movements exhibit strong agreement with EMMA data-based observations, with an average error of 0.18 cm. Furthermore, results from A-B comparison listening tests demonstrated that the sound quality of the learned phonetic targets surpassed that of estimated targets. These findings underscore the capability of our model to effectively model coarticulation in the planning stage, enabling the physiological articulatory model to generate articulatory movements and speech sounds that mimic human speech production.

A significant limitation of the present study is the use of isolated syllables, which fails to fully capture the complexity of natural, continuous speech production. In real speech, intricate interactions occur between sounds as coarticulation takes place in anticipatory and carryover manners across phoneme sequences. To address this limitation and better decode the intricate dancing of the tongue, future research will expand experiments to incorporate more diverse and naturalistic speech sequences. This will involve modeling a wider range of phonetic contexts and analyzing more complex speech patterns to gain a deeper understanding of the dynamics of anticipatory and carryover coarticulation during connected speech.

We gratefully acknowledge Nippon Telegraph and Telephone Communication Laboratories for providing the articulatory data used in this study.

The authors have no conflicts to disclose.

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Figure 7 shows the combination of the atati (top) and ukati (bottom) by one speaker. We list samples of the EMMA data and one frame (Fig. 8) of MRI images.

FIG. 7.

(Color online) Examples of the EMMA data. (a) shows the signal of the syllable of /atati/ and the positions of the tongue, and (b) shows the /ukati/ syllable.

FIG. 7.

(Color online) Examples of the EMMA data. (a) shows the signal of the syllable of /atati/ and the positions of the tongue, and (b) shows the /ukati/ syllable.

Close modal
FIG. 8.

(Color online) One frame of the MRI data for building the planned phonetic target model showing the midsagittal view of the vocal tract.

FIG. 8.

(Color online) One frame of the MRI data for building the planned phonetic target model showing the midsagittal view of the vocal tract.

Close modal
1.
Audet
,
C.
, and
Dennis
,
J. E.
(
2006
). “
Mesh adaptive direct search algorithms for constrained optimization
,”
SIAM J. Optim.
17
(
1
),
188
217
.
2.
Audet
,
C.
,
Le Digabel
,
S.
,
Montplaisir
,
V. R.
, and
Tribes
,
C.
(
2022
). “
Algorithm 1027: Nomad version 4: Nonlinear optimization with the MADS algorithm
,”
ACM Trans. Math. Softw.
48
(
3
),
35
.
3.
Beddor
,
P. S.
(
2009
). “
A coarticulatory path to sound change
,”
Language
85
(4),
785
821
.
4.
Beddor
,
P. S.
,
McGowan
,
K. B.
,
Boland
,
J. E.
,
Coetzee
,
A. W.
, and
Brasher
,
A.
(
2013
). “
The time course of perception of coarticulation
,”
J. Acoust. Soc. Am.
133
(
4
),
2350
2366
.
5.
Cychosz
,
M.
,
Munson
,
B.
, and
Edwards
,
J. R.
(
2021
). “
Practice and experience predict coarticulation in child speech
,”
Lang. Learn. Devel.
17
(
4
),
366
396
.
6.
Dang
,
J.
, and
Honda
,
K.
(
2004
). “
Construction and control of a physiological articulatory model
,”
J. Acoust. Soc. Am.
115
(
2
),
853
870
.
7.
Dang
,
J.
,
Honda
,
K.
, and
Perrier
,
P.
(
2004
). “
Implement of coarticulation in physiological articulatory model
,” in
International Congress of Acoustics
, April 4–9, Kyoto, Japan, Vol. 2, pp.
1335
1336
.
8.
Dempe
,
S.
, and
Dutta
,
J.
(
2012
). “
Is bilevel programming a special case of a mathematical program with complementarity constraints?
,”
Math. Program.
131
,
37
48
.
9.
Dempe
,
S.
, and
Zemkoho
,
A. B.
(
2013
). “
The bilevel programming problem: Reformulations, constraint qualifications and optimality conditions
,”
Math. Program.
138
,
447
473
.
10.
DiCanio
,
C. T.
(
2012
). “
Coarticulation between tone and glottal consonants in Itunyoso Trique
,”
J. Phon.
40
(
1
),
162
176
.
11.
Farnetani
,
E.
, and
Recasens
,
D.
(
2010
). “
Coarticulation and connected speech processes
,” in
The Handbook of Phonetic Sciences
, 2nd ed., edited by
J. H.
William
,
L.
John
, and
E. G.
Fiona
(Wiley, Hoboken, NJ), pp.
317
342
.
12.
Flemming
,
E.
(
2011
). “
The grammar of coarticulation
,” in
La Coarticulation: Indices, Direction et Representation (Coarticulation: Cues, Direction and Representation)
, edited by In M. Embarki and C. Dodane (L'Harmattan, Montpellier, France), pp. 23-26.
13.
Fuchs
,
S.
,
Perrier
,
P.
, and
Mooshammer
,
C.
(
2001
). “
The role of the palate in tongue kinematics: An experimental assessment in VC sequences from EPG and EMMA data
,” in Proceedings of Eurospeech, September 3–7, Aalborg, Denmark, pp. 1487–1490.
13.
Hardcastle
,
B.
, and
Tjaden
,
K.
(
2008
). “
Coarticulation and speech impairment
,” in
The Handbook of Clinical Linguistics
, edited by J. B. Martin, M. Nicole, and S. Elizabeth (Wiley, Hoboken, NJ), pp.
506
524
.
14.
Henke
,
W. L.
(
1966
). “
Dynamic articulatory model of speech production using computer simulation
,” Ph.D. thesis, Massachusetts Institute of Technology, Cambridge, MA.
14.
Hunter
,
J.
D.
(
2007
). “Matplotlib: A 2D graphics environment,”
Comput. Sci. Eng.
9
(
3
),
90
95
.
15.
Kaburagi
,
T.
, and
Honda
,
M.
(
2002
). “
Electromagnetic articulograph based on a nonparametric representation of the magnetic field
,”
J. Acoust. Soc. Am.
111
(
3
),
1414
1421
.
16.
Katz
,
W. F.
(
2000
). “
Anticipatory coarticulation and aphasia: Implications for phonetic theories
,”
J. Phon.
28
(
3
),
313
334
.
17.
Katz
,
W. F.
,
Mehta
,
S.
,
Wood
,
M.
, and
Wang
,
J.
(
2017
). “
Using electromagnetic articulography with a tongue lateral sensor to discriminate manner of articulation
,”
J. Acoust. Soc. Am.
141
(
1
),
EL57
EL63
.
18.
Kim
,
J.
,
Lammert
,
A. C.
,
Kumar Ghosh
,
P.
, and
Narayanan
,
S. S.
(
2014
). “
Co-registration of speech production datasets from electromagnetic articulography and real-time magnetic resonance imaging
,”
J. Acoust. Soc. Am.
135
(
2
),
EL115
EL121
.
19.
Kleber
,
F.
,
Harrington
,
J.
, and
Reubold
,
U.
(
2012
). “
The relationship between the perception and production of coarticulation during a sound change in progress
,”
Lang. Speech
55
(
3
),
383
405
.
20.
Kleinert
,
T.
,
Labbé
,
M.
,
Ljubić
,
I.
, and
Schmidt
,
M.
(
2021
). “
A survey on mixed-integer programming techniques in bilevel optimization
,”
EURO J. Comput. Optim.
9
,
100007
.
21.
Kroos
,
C.
(
2012
). “
Evaluation of the measurement precision in three-dimensional Electromagnetic Articulography (Carstens AG500)
,”
J. Phon.
40
(
3
),
453
465
.
22.
Kühnert
,
B.
, and
Nolan
,
F.
(
1999
). “
The origin of coarticulation
,” in
Coarticulation: Theory, Data and Techniques
, edited by W. J. Hardcastle and N. Hewlett (Cambridge University Press, Cambridge, UK), pp.
7
30
.
23.
Li
,
T.
, and
Sethi
,
S. P.
(
2017
). “
A review of dynamic Stackelberg game models
,”
Discrete Contin. Dyn. Syst.-B
22
(
1
),
125
159
.
24.
Lodi
,
A.
,
Ralphs
,
T. K.
, and
Woeginger
,
G. J.
(
2014
). “
Bilevel programming and the separation problem
,”
Math. Program.
146
(
1-2
),
437
458
.
25.
Löfqvist
,
A
. and
Gracco
,
V. L.
(
2002
). “
Control of oral closure in lingual stop consonant production
,”
J. Acoust. Soc. Am.
111
(
6
),
2811
2827
.
25.
Magen
,
H. S.
(
1997
). “
The extent of vowel-to-vowel coarticulation in English
,”
J. Phon.
25
(
2
),
187
205
.
26.
Mok
,
P. K. P.
(
2010
). “
Language-specific realizations of syllable structure and vowel-to-vowel coarticulation
,”
J. Acoust. Soc. Am.
128
(
3
),
1346
1356
.
27.
Öhman
,
S. E. G.
(
1966
). “
Coarticulation in VCV utterances: Spectrographic measurements
,”
J. Acoust. Soc. Am.
39
(
1
),
151
168
.
28.
Öhman
,
S. E. G.
(
1967
). “
Numerical model of coarticulation
,”
J. Acoust. Soc. Am.
41
(
2
),
310
320
.
29.
Perkell
,
J. S.
, and
Klatt
,
D. H.
(
1986
).
Invariance and Variability in Speech Processes
(
Psychology
, Hove, UK).
30.
Punišić
,
S.
,
Jovičić
,
S. T.
,
Subotić
,
M.
, and
Stojanović
,
J.
(
2021
). “
Psychoacoustic evaluation of acoustic features distortion in fricative consonant of speech
,”
Appl. Acoust.
171
,
107564
.
31.
Recasens
,
D.
(
2002
). “
An EMA study of VCV coarticulatory direction
,”
J. Acoust. Soc. Am.
111
(
6
),
2828
2841
.
32.
Recasens
,
D.
, and
Pallarès
,
M. D.
(
1999
). “
A study of /J/ and /r/ in the light of the ‘DAC’ coarticulation model
,”
J. Phon.
27
(
2
),
143
169
.
33.
Recasens
,
D.
, and
Pallarès
,
M. D.
(
2001
). “
Coarticulation, assimilation and blending in Catalan consonant clusters
,”
J. Phon.
29
(
3
),
273
301
.
34.
Recasens
,
D.
,
Pallarès
,
M. D.
, and
Fontdevila
,
J.
(
1997
). “
A model of lingual coarticulation based on articulatory constraints
,”
J. Acoust. Soc. Am.
102
(
1
),
544
561
.
35.
Redford
,
M. A.
(
2019
). “
Speech production from a developmental perspective
,”
J. Speech. Lang. Hear. Res.
62
(
8S
),
2946
2962
.
36.
Repp
,
B. H.
(
1986
). “
Some observations on the development of anticipatory coarticulation
,”
J. Acoust. Soc. Am.
79
(
5
),
1616
1619
.
37.
Salverda
,
A. P.
,
Kleinschmidt
,
D.
, and
Tanenhaus
,
M. K.
(
2014
). “
Immediate effects of anticipatory coarticulation in spoken-word recognition
,”
J. Mem. Lang.
71
(
1
),
145
163
.
38.
Zharkova
,
N.
, and
Hewlett
,
N.
(
2009
). “
Measuring lingual coarticulation from midsagittal tongue contours: Description and example calculations using English /t/ and /a/
,”
J. Phon.
37
(
2
),
248
256
.