A model synthesizing average frequency components from select sentences in an electromagnetic articulography database has been crafted. This revealed the dual roles of the tongue: its dorsum acts like a carrier wave, and the tip acts as a modulation signal within the articulatory realm. This model illuminates anticipatory coarticulation's subtleties during speech planning. It undergoes rigorous, two-stage optimization: statistical estimation and refinement to depict carryover and anticipation. The model's base, rooted in physiological insights, deciphers carryover targets while its upper layer captures anticipation. Optimization has pinpointed unique phonetic targets for each phoneme, providing deep insights into virtual target formation during speech planning. These simulations, aligning closely with empirical data and marked by a mere 0.18 cm average error, along with extensive listening tests attest to the model's accuracy and enhanced speech synthesis quality.
I. INTRODUCTION
The intricate ballet of speech unfolds on the stage of the vocal tract with the tongue playing the prima ballerina. Its nimble movements weave the tapestry of sound but rarely in isolation. Each sound dances with its neighbors, their motions subtly affecting each other in a complex choreography known as coarticulation. This intricate interplay between phonemes, where the articulation of one segment bleeds into the next, imbues continuous speech with remarkable variability in its acoustic and articulatory landscape (Perkell and Klatt, 1986). Like actors adapting their performance to the scene, phonetic segments readily shed their isolated character, adopting nuanced modifications in response to their immediate linguistic context (Kühnert and Nolan, 1999). These modifications, aptly named coarticulation (Flemming, 2011; Kleber , 2012; Kühnert and Nolan, 1999; Magen, 1997), arise from the intricate overlap of articulatory gestures during speech production. Yet, despite its pervasive presence, coarticulation remains a captivating riddle, demanding further exploration to unveil its secrets.
Coarticulation has long been an area of interest in speech production research and phonetics (Beddor , 2013; Cychosz , 2021; DiCanio, 2012; Recasens and Pallarès, 2001; Zharkova and Hewlett, 2009). Coarticulation models have been developed to bridge the gap between invariant, discrete representations of articulation and acoustics and accurately predict the intricacies of this process. Numerous studies (Hardcastle and Tjaden, 2008; Recasens , 1997; Öhman, 1967) have employed diverse approaches to elucidate the origins, nature, and functions of coarticulation.
From a physiological perspective, variations in phonemes manifest in their acoustic realization during speech production, even if phonetic targets remain invariant at the brain level (Beddor, 2009; Punišić , 2021; Redford, 2019). This is because of changes in articulator movements throughout utterances. Some aspects of coarticulation are planned in advance, anticipatory coarticulation (Katz, 2000; Repp, 1986), while others arise from physical properties of articulators such as inertia and finite acceleration/deceleration of vocal tract tissues and carryover coarticulation (Farnetani and Recasens, 2010; Mok, 2010; Salverda , 2014).
Henke (1966) proposed a look-ahead model to account for anticipatory coarticulation while Öhman (1966) developed the perturbation model based on spectrographic and x-ray studies of English and Swedish subjects. Henke suggested that vocalic utterances can be viewed as continuous movements with consonant gestures superimposed. The look-ahead model emphasizes the temporal order in anticipation, whereas the perturbation model focuses on the principal-subordinate relationship between vowels and consonants. The carrier model proposed by Dang and Honda (2004) and Dang (2004) and the strengths of these two frameworks are established ideas in speech production, capturing coarticulatory effects during the planning stage.
Our study develops a coarticulation model, capturing the planning stage of speech production. We leverage insights from articulatory movements and simulations using a preexisting physiological model, comprehensively depicting the process from targets and muscle activation to synthetic sounds.
To systematically understand coarticulation, we rigorously investigate the principal-subordinate structure of articulation as a fundamental assumption. By constructing a text-independent, generalized articulatory movement, we verify this structure and formulate anticipatory coarticulation. Drawing on statistical insights from the electromagnetic midsagittal articulography (EMMA) data (Kaburagi and Honda, 2002), we integrate our method with a physiological model.
Our analysis reveals anticipatory and carryover coarticulation. Separating these remains an open issue. We propose a two-layer learning procedure with an upper layer for anticipatory and a lower layer for carryover coarticulation. This reduces complexity and allows tailored optimization.
We rigorously assess optimization through objective and subjective evaluations. Objective evaluation uses spatial metrics to quantify correspondence between observed and simulated movements. Subjective evaluation involves listening tests to capture perceptual aspects.
This paper is organized as follows. Section II briefly describes speech production and coarticulation, verifies the principal-subordinate structure, and elucidates the formulation of this model. Section III proposes a simulation-based learning framework and process. Section IV depicts the details and results of the learning experiments and resulting evaluations. Finally, Sec. V provides the conclusions.
II. MODELING COARTICULATION
The essential goal of modeling articulators and their movements is to explain, understand, and mimic the coherent rationale and mechanisms underlying human speech production. This section, first, briefly introduces the principles of speech production and coarticulation.
A. Speeech production and coarticulation
A brief flow chart of the procedure used in human speech production is displayed in Fig. 1. We suppose that there is a unique spatial target, referred to as a phonetictarget, corresponding to each phonetic unit of speech. In the flow chart, the articulatory targets of a phonetic unit are generated from a single phonetictarget of this phoneme in the planning stage by the anticipation mechanism, according to contextual variations. These articulatory targets then drive the articulators to produce articulatory movements.
(Color online) The simulation framework proposed for estimating phonetic targets in coarticulation.
(Color online) The simulation framework proposed for estimating phonetic targets in coarticulation.
During natural speech, two forms of coarticulation overlap can be observed: left-to-right (LR, carryover) and right-to-left (RL, anticipatory). The carryover effect manifests through the physiological and kinematical movements of the articulators. In contrast, anticipatory coarticulation generally indicates sophisticated phonological-phonetic processing, wherein the speaker anticipates incoming sounds. To describe anticipatory coarticulation, Henke introduced a phonemic-segment model. Each utterance is described as a matrix of articulatory targets with distinctive features, where certain features change abruptly as the target shifts.
In this study, articulatory targets stem from the phonetic targets of each phoneme through a carrier-modulation framework. The carrier model aims to provide a computational framework to account for coarticulation in planning. Fundamentally, an utterance comprises consonant and vowel streams. Vowels exert global, sustained effects while consonants exhibit local, transient effects. Treating vowels as a carrier wave and consonants as a modulation signal conceptualizes coarticulation as modulation.
B. Verification of principal-subordinate movement structure
The carrier-modulation speculation of the carrier model is similar to the principal-subordinate structure used in Öhman (1967), proposed based on spectrogram analyses in the acoustic domain. This section attempts to verify whether this carrier-modulation structure also exists in the articulatory domain. However, it is difficult to verify the structure using a specific phoneme sequence as articulatory movements are context dependent and impossible to comprehensively include in a short sentence. Therefore, we analyze movement components of speech organs in the frequency domain and reconstruct a generalized articulatory movement by averaging their frequency components. This generalized movement should reflect the inherent properties of the speech organs in a general contextual environment.
The articulatory data used in this study were collected using the EMMA system. Four receiver coils (named T1–T4) were placed on the tongue surface in the midsagittal plane, as well as on the upper lip, lower lip, maxilla incisor, mandible incisor, and velum. The sampling rate was 250 Hz for the articulatory channels and 16 kHz for the acoustic channel. The coordinate origin was located at the maxilla incisor, 0.5 cm above its tip. The speech materials comprised 360 Japanese sentences read by 3 adult male speakers at a normal speech rate. The acoustic signal and articulatory data were recorded simultaneously, although only four tongue points (T1–T4) were used, here, out of the eight observed in the EMMA database.
In this analysis, 352 sentences from the EMMA database were selected to generate text-independent articulatory movements. From each sentence, 2-s segments of speech were extracted. Short-term discrete Fourier transform (DFT) with 256 samples (about 1 s) was applied to the extracted segments, windowed by a Hanning window with a frame shift of about 64 samples. Complex spectra for T1– T3 were obtained by averaging all DFT frames. Figure 2 shows the average amplitude spectra of T1 and T3 on the vertical (Y) dimension, limited to frequencies within 40 Hz. The results indicate distinguishable generalized movements of the tongue tip versus tongue dorsum in this frequency region. To construct a generalized articulatory movement, the average complex spectra are represented in the format of a Fourier series.
(Color online) Average spectra of articulatory movements at tongue tip and dorsum.
(Color online) Average spectra of articulatory movements at tongue tip and dorsum.
In general, vowel production involves global, sustained tongue movement, whereas consonantal movement is more local and temporal relative to vowels. Because apical consonant constrictions are shaped by the tongue tip, T1 roughly represents consonants (C). Meanwhile, T3 represents vowels (V). In Japanese, CV is the basic syllable unit, therefore, we can reasonably suppose that the constructed articulatory movement corresponds to a CVCVCVCV phoneme sequence for this 1-s generalization. Our analysis shows that the tongue dorsum (T3) mainly reflects the vocalic V_V_V_V stream, excluding consonants, while the tongue tip (T1) corresponds to the full CVCVCVCV. If this speculation holds, T1 movement should have about twice as many maximum (or minimum) peaks as T3 in the same period. To test this, we calculated the velocities at T1 and T3, depicted in Fig. 3. At phoneme centers, articulators are in steady state with zero velocity. Figure 3 shows 14 zeros for T1 and 8 for T3; T1 has about twice as many.
C. Modeling anticipatory coarticulation
Based on the above analysis, an utterance can generally be considered as comprising vowel and consonant streams in a principal-subordinate structure. A look-ahead mechanism is applied to realize the interaction between adjacent phonemes within and across these components.
D. Parameterization of model coefficients
1. Estimation of the DAC
DAC values of five Japanese vowels.
. | Phoneme . | ||||
---|---|---|---|---|---|
Position . | /a/ . | /i/ . | /u/ . | /e/ . | /o/ . |
T1x | 2.92 | 2.13 | 2.00 | 5.20 | 1.84 |
T1y | 1.52 | 1.00 | 2.30 | 5.06 | 2.25 |
T3x | 3.01 | 2.64 | 1.66 | 4.80 | 1.65 |
T3y | 2.60 | 3.23 | 3.08 | 1.50 | 1.19 |
. | Phoneme . | ||||
---|---|---|---|---|---|
Position . | /a/ . | /i/ . | /u/ . | /e/ . | /o/ . |
T1x | 2.92 | 2.13 | 2.00 | 5.20 | 1.84 |
T1y | 1.52 | 1.00 | 2.30 | 5.06 | 2.25 |
T3x | 3.01 | 2.64 | 1.66 | 4.80 | 1.65 |
T3y | 2.60 | 3.23 | 3.08 | 1.50 | 1.19 |
DAC values of eight consonants.
. | Phoneme . | |||||||
---|---|---|---|---|---|---|---|---|
Position . | /d/ . | /g/ . | /k/ . | /n/ . | /r/ . | /s/ . | /t/ . | /w/ . |
T1x | 6.49 | 4.79 | 3.64 | 5.62 | 2.84 | 7.09 | 13.75 | 9.62 |
T1y | 3.23 | 4.68 | 7.16 | 4.75 | 9.63 | 1.72 | 5.04 | 5.91 |
T3x | 5.00 | 3.35 | 4.07 | 3.32 | 4.16 | 4.73 | 7.75 | 5.09 |
T3y | 2.79 | 9.72 | 8.49 | 2.11 | 1.09 | 1.37 | 2.21 | 1.34 |
. | Phoneme . | |||||||
---|---|---|---|---|---|---|---|---|
Position . | /d/ . | /g/ . | /k/ . | /n/ . | /r/ . | /s/ . | /t/ . | /w/ . |
T1x | 6.49 | 4.79 | 3.64 | 5.62 | 2.84 | 7.09 | 13.75 | 9.62 |
T1y | 3.23 | 4.68 | 7.16 | 4.75 | 9.63 | 1.72 | 5.04 | 5.91 |
T3x | 5.00 | 3.35 | 4.07 | 3.32 | 4.16 | 4.73 | 7.75 | 5.09 |
T3y | 2.79 | 9.72 | 8.49 | 2.11 | 1.09 | 1.37 | 2.21 | 1.34 |
2. Weighting coefficient for the planning mechanism
3. Coarticulation resistance coefficient
Resistance coefficients for eight consonants.
. | Phoneme . | |||||||
---|---|---|---|---|---|---|---|---|
. | /d/ . | /n/ . | /s/ . | /t/ . | /r/ . | /g/ . | /k/ . | /w/ . |
x-dimension | 9.5 | 2.1 | 21.4 | 11.3 | 3.1 | 0.2 | 0.3 | 3.9 |
y-dimension | 15.5 | 10.9 | 11 | 13.5 | 5.3 | 2 | 2.6 | 3.8 |
. | Phoneme . | |||||||
---|---|---|---|---|---|---|---|---|
. | /d/ . | /n/ . | /s/ . | /t/ . | /r/ . | /g/ . | /k/ . | /w/ . |
x-dimension | 9.5 | 2.1 | 21.4 | 11.3 | 3.1 | 0.2 | 0.3 | 3.9 |
y-dimension | 15.5 | 10.9 | 11 | 13.5 | 5.3 | 2 | 2.6 | 3.8 |
III. FRAMEWORK OF MODEL-BASED LEARNING PROCESSING
As mentioned earlier, phonetic targets in articulatory space are assumed to be constant for each phoneme during planning. This model transforms phonetic targets into articulatory targets using contextual information. However, relying solely on observation-based parameters may not yield optimal results, especially because phonetic targets cannot be directly observed. This raises the question of how to obtain optimal parameters and estimate phonetic targets. To tackle this challenge, we employ a model-based learning approach to estimate coefficients and phonetic targets.
A. Optimization framework
Currently, EMMA data provide articulatory movement but neither phonetic targets nor articulatory targets are observable (Katz , 2017; Kim , 2014; Kroos, 2012). However, if a physiological articulatory model existed with identical human functions at physiological and kinematic levels, we could obtain reliable phonetic targets and/or articulatory targets by tuning model inputs to match observations. Based on this, we propose a model-based learning process to estimate phonetic targets and optimize model's parameters. This combines a physiological articulatory model and the carrier coarticulation model. In contrast, traditional learning obtains parameters by minimizing an objective function to best explain a dataset in a maximum likelihood or minimum error sense. However, most learned parameters are highly data dependent and rarely reflect true physical mechanisms involved. To obtain inherent knowledge from observations, physical models must be combined with learning rather than just fitting a black box model. Compared to traditional learning, model-based learning provides physically meaningful parameters.
The physiological articulatory model adopted is a partial three-dimensional (3D) model constructed from volumetric magnetic resonance imaging (MRI) data of a male Japanese speaker. It comprises the tongue, jaw, hyoid bone, and vocal tract walls. Its muscular structure was designed based on MRI measurements and anatomical knowledge. At the physiological level, this model is highly consistent with human speech production mechanisms. Two force types drive the model— target-dependent forces from the EP-Map estimating targets and dynamic forces minimizing distance to target positions. This plausible control method aligns with human speech production. The model naturally provides functionality to emulate anticipatory coarticulation in speech planning. Based on this analogy, we propose an optimization framework using the physiological model, where the learning process mirrors the human production procedure, giving learned parameters clear physical meaning.
This study focuses on modeling anticipatory coarticulation. However, observed articulatory data includes carryover and anticipatory effects, inseparable in the raw data. To isolate anticipation, we split the framework into lower and upper layers. The upper layer contains the anticipatory effects. The lower layer with the physiological articulatory model inherently models carryover effects. The articulatory targets connect the layers, separating anticipation from carryover in simulation.
Analogous to human speech production, differences between observations and model simulations mainly stem from differing brain and model articulatory targets as the physiological model emulates human production. Thus, articulatory targets can be learned by minimizing observation-simulation differences in the lower layer. Furthermore, in the upper layer, phonetic targets and model's coefficients are learned based on the articulatory targets from below.
B. Learning articulatory targets on the lower layer
The physiological articulatory model used to calculate objective function involves nonanalytical, nonlinear processes from motor commands to articulatory movements. This makes it difficult to analytically relate articulatory targets and outputs. Given this complexity, traditional optimization methods using gradients or higher-order derivatives cannot be readily applied. To address this, we adopt the mesh adaptive direct search (MADS) algorithm (Audet and Dennis, 2006; Audet , 2022), designed for derivative-free optimization.
C. Optimizing model and estimating phonetic targets
The task of the upper layer learning is to optimize the model's coefficients and estimate the phonetic targets for each phoneme.
1. Model reformulation
2. Objective function
3. Bi-level optimization
As stated above, a single learning process determines two correlated parameter sets. To reduce this correlation, we adopt a bi-level optimization method in the upper layer, suited for such problems. Bi-level programming stems from Stackelberg games (Li and Sethi, 2017), where a leader makes decisions knowing a follower will respond but cannot control the response (Dempe and Dutta, 2012; Kleinert , 2021; Lodi , 2014). At each level, decision makers optimize their own variables but may be affected by others' variables. Bi-level programming is often used in decomposition (Dempe and Zemkoho, 2013).
We separate the two parameter sets into the phonetic target and the upper layer model's coefficient levels. This decomposes the problem into two subproblems using bi-level optimization, shown in Eq. (23). The first part focuses on optimizing phonetic targets, and the second focuses on optimizing coefficients. Each affects the other through shared variables.
The initial values of x are in the rational region, as empirically determined by the physiological articulatory model. Each xi and yi is also limited to a rational region constrained by a boundary with a 1.5 cm radius from their initial positions.
IV. RESULTS AND EVALUATIONS
A. Experiments for learning phonetic targets and model optimizing
To obtain the phonetic targets and optimize model's coefficients, we conducted learning experiments using the above algorithms. The Nippon Telegraph and Telephone EMMA database provided observation articulatory data. We extracted 153 VCV combinations with five Japanese vowels (/a/, /i/, /u/, /e/, /o/) and eight consonants (/d/, /g/, /k/, /n/, /r/, /s/, /t/, /w/). Phonemes were represented by 6D vectors of jaw, tongue tip, and tongue dorsum positions during stability.
1. Results on the lower layer
On the lower layer, we learned articulatory targets from the VCV combinations through 90 optimization iterations. Figure 5 shows the error monotonically decreasing over iterations, reaching an average of 0.065 cm—close to EMMA measurement accuracy.
Figure 4 shows the distributions of the learned articulatory targets for vowels and consonants obtained from the lower layer optimization process. In Figs. 4 and 5, crosses indicate observed articulator positions from the EMMA database while diamonds show the corresponding simulated positions after optimization.
(Color online) Distributions of observed and simulated articulatory movements of five vowels (a) and eight consonants (b) in the lower layer. The diamonds denote the simulations; –cross marks show the observations. The ellipses were referred to the 95% confidence interval to cover the articulatory targets.
(Color online) Distributions of observed and simulated articulatory movements of five vowels (a) and eight consonants (b) in the lower layer. The diamonds denote the simulations; –cross marks show the observations. The ellipses were referred to the 95% confidence interval to cover the articulatory targets.
Error curves in learning process. The curve (a) is the error curve on upper layer learning process and (b) is the error curve on the lower layer.
Error curves in learning process. The curve (a) is the error curve on upper layer learning process and (b) is the error curve on the lower layer.
For the vowel targets in Fig. 4(a), the dashed ellipses represent 95% CIs calculated to cover the target distributions. It can be clearly observed that the optimized model produces simulated vowel articulator positions that closely match the actual observed positions from EMMA data. The two distributions align very closely, demonstrating that the optimization successfully converges on vowel targets consistent with the real articulatory measurements.
Similarly, Fig. 4(b) displays the learned articulatory target distributions for consonants. A key observation is that most of the crucial point targets for consonants involving closures, such as /d/, /g/, and /k/, lie beyond the region of the hard palate. This suggests that to achieve complete closure, the articulatory targets for consonants must extend beyond the hard palate anatomy itself. This phenomenon highlights how the optimization process is able to extract meaningful physical knowledge about speech articulation biomechanics. Overall, the matched distributions in Figs. 4 and 5 indicate the successful learning of articulatory targets that closely agree with real speech production.
2. Results on the upper layer
In the upper layer, we used the bi-level decomposition optimization method with 100 MADS iterations at each leader and follower level per loop. Figure 5 also shows the averaged error over x- and y-dimensions for the tongue tip and dorsum. The optimization error converged after exceeding 2000 total iterations. The final average error between lower layer learned targets and upper layer calculated targets was0.176 cm, a reasonable value. The unsmooth points early in the error curve arise from level switching in the bi-level method.
B. Evaluations
1. Evaluating optimization performance through simulations
Figure 6 shows the distribution of observed and simulated articulatory movements for vowels [Fig. 6(a)] and consonants [Fig. 6(b)] obtained through the proposed modeling framework. The observed trajectories were recorded during speech production using EMMA with sensors on the tongue, lips, and jaw. These real EMMA data are plotted as red crosses, providing ground truth configurations for different phonemes. The blue diamonds represent corresponding simulated trajectories from the model.
(Color online) Distribution of observed and simulated articulatory movements of five vowels (a) and eight consonants (b) obtained via the whole framework. The blue diamonds denote the simulations; red cross marks show the observations. The last panel on the bottom right depicts the learned phonetic targets for five vowels.
(Color online) Distribution of observed and simulated articulatory movements of five vowels (a) and eight consonants (b) obtained via the whole framework. The blue diamonds denote the simulations; red cross marks show the observations. The last panel on the bottom right depicts the learned phonetic targets for five vowels.
Qualitatively, the simulations closely match the real observations, indicating the model accurately maps acoustics to articulations. Quantitative evaluation shows the average error between simulated and observed trajectories is only 1.2 mm, demonstrating excellent performance. The bottom right panel shows the model learned underlying phonetic targets—the desired configurations to produce clear phonemes—evidenced by clustered vowel distributions.
The final panels in Figs. 6(a) and 6(b) plot the learned targets for comparison. Targets for plosives /d/, /t/, /n/, /r/, /k/, /g/ with tongue-palate closure lie beyond the hard palate boundary, whereas fricative /s/ and semivowel /w/ targets are inside the tract. This implies that targets should virtualize beyond the palate to achieve closure, which is consistent with prior hypotheses (Löfqvist and Gracco, 2002; Fuchs 2001).
Overall, the proposed framework successfully acquires a robust model simulating realistic articulatory movements from audio, enabling diverse applications.
2. Subjective evaluation
To evaluate the optimized articulatory targets and model's parameters subjectively, we conducted listening tests by synthesizing speech using an articulatory model-based synthesizer. Three distinct conditions were used for synthesis: condition 1 employed the original observed articulatory targets from the EMMA database without applying the upper layer model. Condition 2 used the articulatory targets learned through optimization in the lower layer but still without the upper layer model. Finally, condition 3 synthesized speech using the fully optimized targets and model from the upper layer learning.
The speech materials consisted of the 153 VCV combinations extracted earlier, out of which 40 combinations were randomly selected to be used in the listening test. We adopted a paired A-B comparison methodology, where 18 volunteer subjects listened to the samples through binaural headphones at a comfortable loudness level within a soundproof auditory room. In the A-B test, speech samples from two of the conditions were presented simultaneously for each VCV combination. The subject was asked to choose the sample that sounded more natural to their perception. Choices were marked by the subjects, who could replay the pair of samples multiple times before finalizing the selection, however, reselecting once marked was not permitted.
In the comparison between conditions 1 and 2, the samples synthesized from the learned articulatory targets (76%) were preferred over those from the original observed targets (11%) by a large margin. The remaining 13% were rated as neutral i.e., the subject could not reliably choose between the two. In the test between conditions 2 and 3, the samples with the optimized upper model (68%) were strongly favored over those without it (15%) with 16% neutral ratings. The full results are presented numerically in Tables IV and V. These subjective listening tests clearly demonstrate the improvement in perceived naturalness and speech quality obtained by optimizing and implementing the upper model beyond just using the observed articulatory data.
The average values of choice rate of trial 1 and trial 2.
. | Trial1 . | Trial2 . | Neutral . |
---|---|---|---|
Average percentage (%) | 10.97 | 76.11 | 12.92 |
Standard deviation of percentage (%) | 2.99 | 4.64 | 3.12 |
. | Trial1 . | Trial2 . | Neutral . |
---|---|---|---|
Average percentage (%) | 10.97 | 76.11 | 12.92 |
Standard deviation of percentage (%) | 2.99 | 4.64 | 3.12 |
The average values of choice rate of trial 2 and trial 3.
. | Trial2 . | Trial3 . | Neutral . |
---|---|---|---|
Average percentage (%) | 15.56 | 68.61 | 15.83 |
Standard deviation of percentage (%) | 2.79 | 3.45 | 2.27 |
. | Trial2 . | Trial3 . | Neutral . |
---|---|---|---|
Average percentage (%) | 15.56 | 68.61 | 15.83 |
Standard deviation of percentage (%) | 2.79 | 3.45 | 2.27 |
V. CONCLUSION
In this study, we validated the carrier-modulation structure, also known as the principle-subordinate structure, of articulatory movements in the articulatory space through the construction of a generalized tongue movement. Building on this observation, we proposed a computational formulation to elucidate the underlying mechanism of anticipatory coarticulation between the tongue tip and tongue dorsum.
To obtain the phonetic targets in the phonetic planning stage and refine the model's parameters, we introduced a novel optimization framework based on a physiological articulatory model. By integrating two physical models derived from human speech production mechanisms, this framework ensures that the learned parameters possess explicit physical meanings, going beyond mere fitting of observations with a black box model. Remarkably, our findings provide explicit evidence for the commonly held hypothesis that consonants with closure typically exhibit overshoot targets over the hard palate.
The distributions of simulated articulatory movements exhibit strong agreement with EMMA data-based observations, with an average error of 0.18 cm. Furthermore, results from A-B comparison listening tests demonstrated that the sound quality of the learned phonetic targets surpassed that of estimated targets. These findings underscore the capability of our model to effectively model coarticulation in the planning stage, enabling the physiological articulatory model to generate articulatory movements and speech sounds that mimic human speech production.
A significant limitation of the present study is the use of isolated syllables, which fails to fully capture the complexity of natural, continuous speech production. In real speech, intricate interactions occur between sounds as coarticulation takes place in anticipatory and carryover manners across phoneme sequences. To address this limitation and better decode the intricate dancing of the tongue, future research will expand experiments to incorporate more diverse and naturalistic speech sequences. This will involve modeling a wider range of phonetic contexts and analyzing more complex speech patterns to gain a deeper understanding of the dynamics of anticipatory and carryover coarticulation during connected speech.
ACKNOWLEDGMENTS
We gratefully acknowledge Nippon Telegraph and Telephone Communication Laboratories for providing the articulatory data used in this study.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding author upon reasonable request.
APPENDIX
Figure 7 shows the combination of the atati (top) and ukati (bottom) by one speaker. We list samples of the EMMA data and one frame (Fig. 8) of MRI images.
(Color online) Examples of the EMMA data. (a) shows the signal of the syllable of /atati/ and the positions of the tongue, and (b) shows the /ukati/ syllable.
(Color online) Examples of the EMMA data. (a) shows the signal of the syllable of /atati/ and the positions of the tongue, and (b) shows the /ukati/ syllable.
(Color online) One frame of the MRI data for building the planned phonetic target model showing the midsagittal view of the vocal tract.
(Color online) One frame of the MRI data for building the planned phonetic target model showing the midsagittal view of the vocal tract.