Identification and quantification of speech variations in velar production across various phonological environments have always been an interesting topic in speech motor control studies. Dynamic magnetic resonance imaging has become a favorable tool for visualizing articulatory deformations and providing quantitative insights into speech activities over time. Based on this modality, it is proposed to employ a workflow of image analysis techniques to uncover potential deformation variations in the human tongue caused by changes in phonological environments by altering the placement of velar consonants in utterances. The speech deformations of four human subjects in three different consonant positions were estimated from magnetic resonance images using a spatiotemporal tracking method before being warped via image registration into a common space—a dynamic atlas space constructed using four-dimensional alignments—for normalized quantitative comparisons. Statistical tests and principal component analyses were conducted on the magnitude of deformations, consonant-specific deformations, and internal muscle strains. The results revealed an overall decrease in deformation intensity following the initial consonant production, indicating potential muscle adaptation behaviors at a later temporal position in one speech utterance.

Ongoing advancements in modern medical imaging technologies, such as ultrasound, computed tomography (CT), and magnetic resonance imaging (MRI), have provided speech researchers with a variety of tools for visualization and analysis (Al-Hammuri , 2022; Koike , 2017; Lim , 2019). MRI, in particular, has proven to be a reliable, noninvasive tool for capturing articulatory deformations, enabling quantitative assessments of speech imaging data in motor control studies (Janssen and Mendieta, 2020; Xing , 2016). Recently, faster and more robust dynamic MRI methods have been proposed, allowing for real-time visualization of speech motion (Fu , 2017; Jin , 2023; Lim , 2019; Ramanarayanan , 2018). Additionally, spatiotemporal atlas-based methods now enable four-dimensional (4D) assessments of articulatory motion within study populations (Woo , 2018; Woo , 2019; Xing , 2021). A primary advantage of an atlas approach is its ability to facilitate subject normalization and quantify variability caused by differences in tongue shape, physical space, acquisition parameters, and speech tempo (Duchateau , 2011; Roland , 1994). Moreover, machine learning methods, such as dimensionality reduction and matrix factorization, can be applied to the aligned data to identify the unique characteristics of each individual within the atlas space (Xing , 2019). Notably, another benefit of a dynamic atlas lies in the spatiotemporal alignment process in 4D (three dimensions plus time), which minimizes anatomical tongue variations, standardizing them to a common shape, and temporally aligns the speech rhythms of different subjects. Consequently, all subjects' critical speech waveform positions are synchronized in the common temporal space, enabling individualized statistical comparisons (Woo , 2019; Xing , 2023).

These technologies have generated interest in characterizing subject-specific phonetic differences during speech. In phonology, the phonetic environment of a speech sound refers to the influence of adjacent and surrounding sounds on its articulation (Hayes, 2011). Variations in syllable placement and phonetic environments impact speech production, a central focus for speech-language pathologists and motor control researchers (Birkholz , 2011; Coker, 1976; Ingram and Babatsouli, 2024; Johnson , 1993). Identifying and quantifying these variations is essential for advancing clinical procedures in diagnosing and treating motor speech disorders (Lowit , 2014; Niimi, 2000). There has been much effort in studying the biomechanic patterns of the tongue movement in velar consonant production (Iskarous, 2005; Kent and Moll, 1972; Perkell, 1969; Perrier , 2003). However, to date, no study from an imaging perspective has quantitatively characterized the subtle subject-specific variations as they manifest in temporal muscular deformations.

Even with the availability of all imaging and alignment tools, distinguishing subtle motion variations in different phonetic environments remains challenging. In this work, we propose a method to address this issue and demonstrate its effectiveness through a comprehensive analysis of tongue articulations during velar consonant production. Following a previously proposed 4D atlas construction protocol, we estimate the tissue deformations of four human subjects, each producing the consonant /k/ in three different utterance positions within a word. These deformations are represented in the form of three-dimensional (3D) deformation fields derived from a sequence of dynamic MR images. All fields are subsequently warped into a common atlas space constructed using a corresponding sequence of cine dynamic MR images, where one subject is selected as the common reference. We conduct statistical tests on the magnitude of the deformation fields as well as the inferior-superior (inf-sup) component of the deformation fields—indicating the intensity of consonant /k/-specific motions—to observe potential deformation differences subject to phonetic environment changes. Furthermore, internal tissue strain is computed to measure local compression and expansion along the inf-sup direction, followed by a principal component analysis (PCA) performed on the strains of all subjects. Results from the motion fields and strains reveal a smaller amount of overall deformation in the post-syllabic /k/ than its initial pre-syllabic production, allowing it to be classified into a separate group in the PCA space.

The specific aims of this study were (1) to demonstrate the advantage of dynamic MRI in capturing subtle deformation differences in speech; (2) to identify and quantify these features in an atlas-based framework; and (3) to characterize tongue deformation differences during velar consonants in three different phonetic contexts. All of these will help reveal potential muscle adaptation or relaxation processes as consonants are produced over time. This approach also shows the capability to extract complex and concealed tissue motion features and uncover unique syllable characteristics from a notably high-dimensional imaging dataset.

Four healthy subjects participated in the study in accordance with the Institutional Review Board (IRB) protocol HCR-HP-00042060-16 of University of Maryland, Baltimore. All MRI sessions were conducted on a Siemens Prisma 3T system (Siemens Medical Solutions, Erlangen, Germany) using a 64-channel head and neck coil. Each scanning session includes two 1-s speech tasks pronouncing the utterances “akouk” (/ə ku:k/) and “asouk” (/ə su:k/), performed in a series of controlled repeating motion cycles. Each 1-s speech cycle was followed by a 1-s breathing cycle. Multiple repetitions were employed to ensure full coverage of the k-space for each specific image slice at each time frame. Meanwhile, a specifically designed audio signal was played through the headphone to the subject inside the scanner. It was a rhythmic metronome prompt that had a 1-s repetition cycle which played “beeps” at desired phonetic locations. The subject was pretrained to speak following this background cue by carefully timing their vowel and consonant positions, ensuring consistency among all repetitions. Eventually, a dynamic collection of parallel sagittal slices and another collection of parallel axial slices covering the tongue region were acquired to capture tissue motion in all three directions (anterior-posterior and inf-sup components in sagittal slices and left-right component in axial slices). Cine and tagged MR modalities were collected, leading to a typical ∼75-min session, including resting time for each subject. Image parameters were set as follows: field of view, 240 × 240 mm; image resolution, 1.9 × 1.9 mm; slice thickness, 6.0 mm; and frame rate, 26 frames/s. Note that for one speech task, multiple repetition cycles collectively yielded one 26-frame sequence, representing an averaged motion cycle from all of them, as each single repetition was able to effectively scan only a small part of the k-space. Thus, unlike high-speech cine MR modalities that capture a real-time video of the entire speech, tagged dynamic MRI was not affected by high-speed sampling artifacts such as aliasing or stroboscopic effects.

The two speech utterances were specifically designed to contain three pronunciations of the consonant /k/, as was planned for this study. For “akouk,” the anticipated tongue positioning throughout the utterance included transitions from initialization to /ə/—going to the neutral schwa position, from /ə/ to /k/—rising to reach the palate, from /k/ to /u:/—lowering back, and from /u:/ to /k/—rising again to reach the palate. The whole utterance included two upward thrusts for /k/ along the inf-sup direction (+z motion component by the scanner coordinates), where the second one occurred shortly after the first in less than 1 s. Similarly, for “asouk,” the anticipated tongue positioning included transitions from initialization to the schwa, from /ə/ to /s/—moving forward with a stretched tongue tip, from /s/ to /u:/—retracting backward, and from /u:/ to /k/—rising to reach the palate. Notably, the difference for this /k/ consonant was that it followed a forward motion instead of repeating a leading upward motion. We seek to identify any variation among the productions of these three /k/ consonants in the tissue deformations.

As previously mentioned, the tagged MR slice sequences were acquired in two directions—sagittal and axial—to include all motion components from the three cardinal directions. An example of the tagged data in the mid-sagittal slice for all four subjects at a selected time frame is illustrated in Fig. 1(a). Throughout speech, the tags underwent distortion over time in tandem with the tongue's deformation, storing local motion components within the deformed tag patterns, a fundamental characteristic of tagged MRI (Parthasarathy , 2007). For each subject s in speech task 1 “akouk,” we apply the phase vector incompressible registration algorithm to estimate a 3D deformation field of the tongue at each time frame (Xing , 2017), denoted as us1(X,t), representing a volume of 3D vectors sampled at image coordinates X and time frame t. Its associated transformation function μs1(X,t) is described as
(1)
FIG. 1.

(Color online) (a) MR image slices of four subjects in the mid-sagittal plane and the combined cine atlas during the schwa and (b) calculated 3D deformation fields in the tongue region at the first /k/ during “akouk” of the four subjects are shown. The fields are displayed with multiple layer depths. Darker vectors represent more inf-sup motion while lighter vectors represent more anterior-posterior motion.

FIG. 1.

(Color online) (a) MR image slices of four subjects in the mid-sagittal plane and the combined cine atlas during the schwa and (b) calculated 3D deformation fields in the tongue region at the first /k/ during “akouk” of the four subjects are shown. The fields are displayed with multiple layer depths. Darker vectors represent more inf-sup motion while lighter vectors represent more anterior-posterior motion.

Close modal

Note that the image coordinates X are samples from the subject's original physical space where the MR data are acquired as opposed to a common atlas space. Consequently, Fig. 1(b) illustrates the corresponding deformation fields for all subjects, each displaying a distinct tongue shape as they are currently independent and unnormalized. Similarly, for speech task 2 “asouk,” us2(X,t) and μs2X,t can be computed in a similar manner.

Figure 1(b) presents a specific time frame during which the first /k/ in “akouk” is pronounced. An upward motion is evident in all subjects, indicating the tongue's tip reaching toward the palate. Aside from the fact that all four tongues exhibit varying shapes and coordinates, it is also noticeable that there are variations in the intensity and distribution of the deformation fields. For instance, subject 1 displays predominantly smaller motion except at the top of the tongue, whereas subject 2 exhibits overall greater motion. This clearly illustrates that conducting quantitative comparisons is infeasible without intersubject spatiotemporal normalization.

For the study population, we follow the core principles of previously proposed 4D atlas construction methods (Woo , 2019; Xing , 2021), which encompass two main steps: temporal alignment and spatial alignment. The data acquisition process also yields a paired sequence of untagged cine MR images that capture anatomical details. A speech-language expert manually identified the time instants of all critical syllables—namely, the schwa, the pre- and post-syllabic /k/, and the /s/—for each subject. Images at these critical time frames serve as anchor points within a designated reference temporal space (using subject 1 as the reference in this study), whereas all other subjects' cine images in between the anchor points are interpolated based on their original cine data. Further mathematical details can be found in Xing (2021).

For spatial alignment, at each aligned time frame, all subjects' cine images are fused together to construct one atlas frame. The fusion process involves multiple image registration-based warping steps, employing the symmetric image normalization method (Avants , 2008). We choose any subject as the initial template (in this case, subject 1). We establish a warp from the initial template to each subject. The mean of all these warp transformations is then applied to the initial template, creating a more refined template. We iteratively repeat this process to find updated warps and updated templates until convergence. All final warps are averaged to create the final template—the atlas space, which provides an unbiased, centered, and averaged spatial position for all subjects to align. Repeating the spatial alignment process for each time frame yields a 4D dynamic atlas sequence. Additional details can be found in Woo (2019). An exemplar of such an atlas at the schwa time frame is shown in Fig. 1(a).

Once the atlas is established, all tongues' deformation fields can be warped into this common space. A warping transformation is found by applying the standard 3D diffeomorphic image registration between each subject's cine image and the atlas using the ANTs registration tool (Avants , 2009), yielding the warp function ϕs1· and its symmetric inverse warp ϕs_inv1·. It was mathematically established (Ehrhardt , 2011) that the deformation fields in the atlas space can be obtained through the composition
(2)
where νs1Xa,t=Xa+vs1(Xa,t) is the deformation field defined on the grid location Xa in the atlas space. Traversing all subjects s{1,2,3,4} warps all deformation fields into the atlas space. Figure 2 shows such fields, demonstrating uniformity in tongue shape and spatial coordinates across all subjects. Note that vs2(Xb,t), the deformation fields of speech task 2 aligned in the grid location Xb in the atlas space for speech task 2, are found using similar procedures.
FIG. 2.

(Color online) Deformation fields of the tongue for four subjects when articulating /k/ in three distinct syllable positions. All four tongues are normalized in the atlas space, sharing the same shape and coordinates. The fields are plotted in three dimensions with multiple layers of depths. Darker vectors represent more inf-sup motion and lighter vectors represent more anterior-posterior motion.

FIG. 2.

(Color online) Deformation fields of the tongue for four subjects when articulating /k/ in three distinct syllable positions. All four tongues are normalized in the atlas space, sharing the same shape and coordinates. The fields are plotted in three dimensions with multiple layers of depths. Darker vectors represent more inf-sup motion and lighter vectors represent more anterior-posterior motion.

Close modal
One remaining alignment issue pertains to accommodating variations observed across different speech tasks. Even for the same subject, because a brief break was taken between the scanning sessions of the two speech tasks, changes in tongue position could introduce further discrepancies. This necessitates the independent construction of one atlas for each speech task. As before, ANTs registration was used to find the warp function ϕ· and its symmetric inverse warp ϕ_inv· between the two atlases using the atlas of speech task 1 as the reference. Deformation fields for speech task 2 in the reference atlas space are computed using the composition
(3)
where ωs2Xa,t=Xa+ws2(Xa,t) is the deformation field defined on the grid location Xa in the reference atlas space. By traversing all s, all fields in speech task 2 are subsequently warped into the same space as task 1.

Ultimately, with all deformation fields warped and aligned (see Fig. 2), quantitative analyses, such as statistical tests, can be conducted on the mathematical properties of these fields. This study particularly focuses on the magnitude of deformation vectors and the vertical inf-sup deformation component that signifies the intensity of the primary upward motion for /k/.

Strain in tissue deformation analysis is a commonly used measure that quantifies local shape changes caused by applied forces (Schlenker , 1978). It serves as a reliable indicator for potential internal muscle activation. Strain can be calculated using the deformation fields in tensor form, expressing the ratio of local length change by projecting the tensors along specific directions of interest. In practice, to simplify the description, we collectively denote all of the aligned deformation fields us1(Xa,t) and ws2(Xa,t) as u(Xa). The Lagrangian strain tensor in the common atlas space Xa is derived as
(4)
and the Lagrangian strain tensors can then be projected onto any direction of interest dXa, such as the inf-sup direction, using the quadratic form
(5)

In addition to projection, the three principal strains serve as indicators of the extent of the largest expansion and compression in the local tissue deformation. They can be computed by identifying the three principal values of the Lagrangian strain tensor.

Unlike deformation fields, strain embodies more of a “hidden” feature than the directly observable deformations. The value of strain is a ratio of local length change; thus, it is not directly comparable across different subjects for determining larger or smaller values in absolute terms. Instead, we employ PCA (Wold , 1987) to simplify strain data and reduce its dimensions to find a major principal component (PC) space. PCA is a statistical technique that transforms data from a high-dimensional space into a lower-dimensional space while preserving as much variability (key features with information) as possible. As an unsupervised machine learning algorithm, it excels at identifying and revealing hidden features innate and wrapped within largely high-dimensional data, which is a precise characteristic of our estimated deformation motion field dataset. At time frame t, we stack strain values ets(Xa) for subject s as ets=ets(:). PCA is applied to the matrix A=[et1,et2,et3,et4], yielding a new 3D PC space characterized by new PC directions [k1,k2,k3]. Typically, the first one or two PCs encapsulate most PC energy. By re-projecting all strain values into the PC directions, latent features hidden within the projection values can become evident.

To simplify our description, we categorize the three locations of the consonant /k/ into three groups: group 1, the pre-syllabic first /k/ in “akouk”; group 2, the post-syllabic second /k/ in “akouk”; and group 3, the single /k/ in “asouk,” which is also post-syllabic. Visual examination of the results of the spatiotemporal alignment can be directly conducted using Fig. 2. Similar to Fig. 1(b), the fields are 3D-rendered and shown from the left side of the tongue with multiple vector layers in depth. The tongue is oriented to the left in the setup, showing evident upward motion associated with producing consonant /k/. Qualitative assessment of Fig. 2 reveals that despite normalizing all subjects into the same atlas space with identical shapes, their range of motion varies depending on the subject. However, each subject tends to maintain a consistent deformation pattern. For example, subject 1 predominantly uses the top part of the tongue while subject 2 employs more of their tongue body, including larger tongue top movements. Implications related to coarticulation will be discussed in Sec. III B. It can be observed that a pre-syllabic /k/, as in group 1, typically demands larger deformations compared to a post-syllabic /k/ as in groups 2 and 3 for all subjects. This is particularly evident in subjects 1, 2, and 3, whose fields in the first row of Fig. 2 exhibit higher intensity. However, the intensity difference in their second and third rows is not as pronounced, suggesting similarities in producing a post-syllabic /k/ following the /u:/ vowel.

To quantitatively compare the deformation fields from different consonant /k/ positions, we extracted the magnitude of all deformation vectors and plotted them in Fig. 3, grouping them by consonant and separating each subject. Note that individual-level data in Fig. 3 are from all of the image voxels, which are used to explore the high heterogeneity of deformation magnitudes with statistics aggregated at each time point in the entire tongue region. Across all subjects, group 1 consistently exhibits larger magnitudes than groups 2 and 3. However, when comparing groups 2 and 3, the magnitude difference is less clear, especially from the mean (represented by the cross symbols) and median (middle bars of each box) values. Specifically, group 2 shows larger median values than group 3 for subjects 2 and 3 but lower or similar medians for subjects 1 and 4. The mean, median, and standard deviation of the deformation magnitude are listed in Table I.

FIG. 3.

(Color online) Boxplot of all image voxels' deformation field magnitude of four subjects when articulating /k/ in three phonetic environments. Each subject is shown with group 1 (pre-syllabic /k/ of “akouk”), group 2 (post-syllabic /k/ of “akouk”), and group 3 (/k/ of “asouk”).

FIG. 3.

(Color online) Boxplot of all image voxels' deformation field magnitude of four subjects when articulating /k/ in three phonetic environments. Each subject is shown with group 1 (pre-syllabic /k/ of “akouk”), group 2 (post-syllabic /k/ of “akouk”), and group 3 (/k/ of “asouk”).

Close modal
TABLE I.

The mean, standard deviation (st.d.), and median of the deformation magnitude of all three /k/ consonants in four subjects. The unit of deformation is mm. G1, pre-syllabic /k/ in “akouk”; G2, post-syllabic /k/ in “akouk”; and G3, /k/ in “asouk.” Bold entries mean the highest value among the three groups.

Subject 1 Subject 2 Subject 3 Subject 4
G1 G2 G3 G1 G2 G3 G1 G2 G3 G1 G2 G3
Mag (mean)  2.7  2.5  2.5  4.0  3.7  3.3  4.5  3.5  3.0  2.8  2.6  2.4 
Mag (st.d.)  1.5  1.5  1.7  2.2  2.0  1.9  2.3  1.7  1.8  1.4  1.4  1.5 
Mag (median)  2.7  2.3  2.6  4.4  4.0  3.7  4.3  3.7  3.3  2.8  2.4  2.4 
Subject 1 Subject 2 Subject 3 Subject 4
G1 G2 G3 G1 G2 G3 G1 G2 G3 G1 G2 G3
Mag (mean)  2.7  2.5  2.5  4.0  3.7  3.3  4.5  3.5  3.0  2.8  2.6  2.4 
Mag (st.d.)  1.5  1.5  1.7  2.2  2.0  1.9  2.3  1.7  1.8  1.4  1.4  1.5 
Mag (median)  2.7  2.3  2.6  4.4  4.0  3.7  4.3  3.7  3.3  2.8  2.4  2.4 

We conducted hypothesis tests to assess whether any of the /k/ sample groups were drawn from distinct distributions. Specifically, we tested the hypothesis that group 1 (pre-syllabic /k/) requires a significantly more intense deformation magnitude in pronunciation compared to group 2 (post-syllabic /k/). The null hypothesis states that there is no statistical difference among the groups based on this hypothesis. Because every subject had been aligned in the atlas space, we used every voxel position's deformation magnitude as samples. Similarly, we conducted other combinations of t-tests comparing groups 1 and 3 (pre-syllabic /k/ of “akouk” and /k/ of “asouk”) and groups 2 and 3 (post-syllabic /k/ of “akouk” and /k/ of “asouk”). The hypotheses were that the former requires a significantly more intense deformation magnitude than the latter within each pair. In addition, analysis of variance (ANOVA) was performed across all three /k/ groups to determine if at least one group significantly differed from the others. We considered a resulting p-value of less than a 0.05 threshold to indicate significance. To enhance statistical power, we employed one-sided t-tests, assuming that the first group being tested had larger overall values compared to the second group. The results from all test combinations are also listed in Table II.

TABLE II.

p-values and effect sizes (EF) from statistical tests comparing deformation magnitude of three /k/ consonants in four subjects. ANOVA is used for three-group comparison and t-test is used for pairwise comparison between each two-group pair. Group 1, the first /k/ in “akouk”; group 2, the second /k/ in “akouk”; group 3, the /k/ in “asouk.”

ANOVA groups 1, 2, and 3 Two-sample t-test one-sided
Groups 1 and 2 Groups 1 and 3 Groups 2 and 3
Sub 1 p  4.0 × 10−59  5.6 × 10−36  6.8 × 10−54  9.5 × 10−5 
EF  0.01 (η²)  12.5  15.4  3.7 
Sub 2 p  1.3 × 10−253  7.9 × 10−33  1.8 × 10−247  7.8 × 10−110 
EF  0.02 (η²)  11.9  33.8  22.3 
Sub 3 p  <4.9 × 10−324  <4.9 × 10−324  <4.9 × 10−324  1.8 × 10−185 
EF  0.09 (η²)  43.8  67.0  29.2 
Sub 4 p  2.4 × 10−136  6.0 × 10−33  6.9 × 10−134  2.3 × 10−40 
EF  0.01 (η²)  11.9  24.7  13.3 
ANOVA groups 1, 2, and 3 Two-sample t-test one-sided
Groups 1 and 2 Groups 1 and 3 Groups 2 and 3
Sub 1 p  4.0 × 10−59  5.6 × 10−36  6.8 × 10−54  9.5 × 10−5 
EF  0.01 (η²)  12.5  15.4  3.7 
Sub 2 p  1.3 × 10−253  7.9 × 10−33  1.8 × 10−247  7.8 × 10−110 
EF  0.02 (η²)  11.9  33.8  22.3 
Sub 3 p  <4.9 × 10−324  <4.9 × 10−324  <4.9 × 10−324  1.8 × 10−185 
EF  0.09 (η²)  43.8  67.0  29.2 
Sub 4 p  2.4 × 10−136  6.0 × 10−33  6.9 × 10−134  2.3 × 10−40 
EF  0.01 (η²)  11.9  24.7  13.3 

In addition to analyzing the magnitude of deformation fields, we also examined the vertical inf-sup component along the z axis, representing the elevating motion for producing /k/. The inf-sup deformation component is a prominent characteristic for consonant /k/, and the result is plotted in Fig. 4, categorized by consonant groups and subjects. Consistently, group 1 exhibits larger vertical deformations compared to groups 2 and 3. However, distinguishing between groups 2 and 3 based on their mean (crosses) and median (middle bars) values remains inconclusive. The mean, median, and standard deviation of the inf-sup deformation are listed in Table III. Additionally, the same statistical tests were applied to the inf-sup deformation components. ANOVA and paired t-tests were performed, and their corresponding results are detailed in Table IV.

FIG. 4.

(Color online) Boxplot of all image voxels' vertical inf-sup deformation displacement of four subjects when articulating /k/ in three distinct syllable positions. Each subject is categorized with group 1 (pre-syllabic /k/ in “akouk”), group 2 (post-syllabic /k/ in “akouk”), and group 3 (/k/ in “asouk”).

FIG. 4.

(Color online) Boxplot of all image voxels' vertical inf-sup deformation displacement of four subjects when articulating /k/ in three distinct syllable positions. Each subject is categorized with group 1 (pre-syllabic /k/ in “akouk”), group 2 (post-syllabic /k/ in “akouk”), and group 3 (/k/ in “asouk”).

Close modal
TABLE III.

The mean, standard deviation (st.d.), and median of the inf-sup deformation of all three /k/ consonants in four subjects. The unit of deformation is mm. G1, pre-syllabic /k/ in “akouk”; G2, post-syllabic /k/ in “akouk”; G3, /k/ in “asouk.” Bold entries mean the highest value among the three groups.

Subject 1 Subject 2 Subject 3 Subject 4
G1 G2 G3 G1 G2 G3 G1 G2 G3 G1 G2 G3
Inf-sup (mean)  2.3  1.8  1.8  3.4  2.6  2.8  3.8  2.8  2.3  2.5  2.2  2.2 
Inf-sup (st.d.)  1.5  1.4  1.5  1.9  1.6  1.7  2.5  1.6  1.5  1.4  1.4  1.5 
Inf-sup (median)  2.1  1.6  1.6  3.8  2.9  3.2  3.5  2.7  2.6  2.5  1.9  2.2 
Subject 1 Subject 2 Subject 3 Subject 4
G1 G2 G3 G1 G2 G3 G1 G2 G3 G1 G2 G3
Inf-sup (mean)  2.3  1.8  1.8  3.4  2.6  2.8  3.8  2.8  2.3  2.5  2.2  2.2 
Inf-sup (st.d.)  1.5  1.4  1.5  1.9  1.6  1.7  2.5  1.6  1.5  1.4  1.4  1.5 
Inf-sup (median)  2.1  1.6  1.6  3.8  2.9  3.2  3.5  2.7  2.6  2.5  1.9  2.2 
TABLE IV.

p-values and effect sizes (EF) from statistical tests comparing the inf-sup deformation of three /k/ consonants in four subjects. ANOVA is used for three-group comparison and t-test is used for pairwise comparison between each two-group pair. Group 1, the first /k/ in “akouk”; group 2, the second /k/ in “akouk”; group 3, the /k/ in “asouk.”

ANOVA groups 1, 2, and 3 Two-sample t-test one-sided
Groups 1 and 2 Groups 1 and 3 Groups 2 and 3
Sub 1 p  2.4 × 10−90  1.1 × 10−223  1.2 × 10−214  0.35 
EF  0.02 (η²)  32.1  31.5  0.4 
Sub 2 p  <4.9 × 10−324  <4.9 × 10−324  9.1 × 10−204  1.00 
EF  0.04 (η²)  43.3  30.6  −12.9 
Sub 3 p  <4.9 × 10−324  <4.9 × 10−324  <4.9 × 10−324  7.6 × 10−161 
EF  0.10 (η²)  48.2  70.2  27.1 
Sub 4 p  5.2 × 10−140  2.1 × 10−123  1.4 × 10−93  0.99 
EF  0.01 (η²)  23.7  20.5  −2.5 
ANOVA groups 1, 2, and 3 Two-sample t-test one-sided
Groups 1 and 2 Groups 1 and 3 Groups 2 and 3
Sub 1 p  2.4 × 10−90  1.1 × 10−223  1.2 × 10−214  0.35 
EF  0.02 (η²)  32.1  31.5  0.4 
Sub 2 p  <4.9 × 10−324  <4.9 × 10−324  9.1 × 10−204  1.00 
EF  0.04 (η²)  43.3  30.6  −12.9 
Sub 3 p  <4.9 × 10−324  <4.9 × 10−324  <4.9 × 10−324  7.6 × 10−161 
EF  0.10 (η²)  48.2  70.2  27.1 
Sub 4 p  5.2 × 10−140  2.1 × 10−123  1.4 × 10−93  0.99 
EF  0.01 (η²)  23.7  20.5  −2.5 

Finally, we computed the Lagrangian strain tensors and examined their projected strain values from two perspectives: the principal strains derived from singular value decomposition and the strain projected along the vertical inf-sup direction. In parallel with the deformation field analyses, subjecting the strain values to the same statistical tests did not reveal any significant differences between the consonant groups. As stated in Sec. II, this outcome was anticipated due to the intrinsic nature of strain, which represents ratios rather than absolute quantities. Therefore, we employed PCA to explore potential latent features within the strain data.

Regarding the principal strains, the newly established PC space spanned three dimensions with weights of 1104.2 for PC1, 427.4 for PC2, and 62.3 for PC3. These weights represent the amount of data variance each PC direction accounted for, where PC1 and PC2 together encompassed 96.1% of the total variance. We projected the principal strains of all three consonant groups onto this new space. Figure 5 illustrates that group 1 exhibits a larger dispersion of data points, whereas groups 2 and 3 are more compactly clustered. We computed the variance within each group of four data points, measuring the extent of data point scattering from their mean center value. Group 1 returned a variance of 1195.4, the largest among the three, contrasting with those of group 2 (512.6) and group 3 (63.2). Additionally, we conducted PCA on the strain along the inf-sup direction, which measures the localized tissue change ratio along the z axis. Figure 5 depicts the results, and Table V lists the PC space weights and data point variances.

FIG. 5.

(Color online) PCA results of deformation strains performed on two scenarios: using all three principal strains and using strains projected along the inf-sup direction. Group 1 as dots, pre-syllabic /k/ in “akouk”; group 2 as squares, post-syllabic /k/ in “akouk”; group 3 as diamonds, the /k/ in “asouk.” Each group has four markers from the four subjects projected onto the first two PCs.

FIG. 5.

(Color online) PCA results of deformation strains performed on two scenarios: using all three principal strains and using strains projected along the inf-sup direction. Group 1 as dots, pre-syllabic /k/ in “akouk”; group 2 as squares, post-syllabic /k/ in “akouk”; group 3 as diamonds, the /k/ in “asouk.” Each group has four markers from the four subjects projected onto the first two PCs.

Close modal
TABLE V.

Weight of the first two PCs and data point variance of three /k/ consonants in four subjects. PCA is conducted for four data types: principal strain in three dimensions, strain projected along the inf-sup direction, deformation magnitude, and deformation along the inf-sup direction. G1, pre-syllabic /k/ in “akouk”; G2, post-syllabic /k/ in “akouk”; G3, the /k/ in “asouk.”.

Strain: all three principals Strain: along inf-sup Deformation magnitude Inf-sup deformation
PCs 1 and 2 weight  96.1%  96.2%  96.4%  95.7% 
G1 variance  1195.4  517.4  3.0 × 104 mm2  2.7 × 104 mm2 
G2 variance  512.6  219.0  7.5 × 103 mm2  8.4 × 103 mm2 
G3 variance  63.2  38.0  1.3 × 103 mm2  1.9 × 103 mm2 
Strain: all three principals Strain: along inf-sup Deformation magnitude Inf-sup deformation
PCs 1 and 2 weight  96.1%  96.2%  96.4%  95.7% 
G1 variance  1195.4  517.4  3.0 × 104 mm2  2.7 × 104 mm2 
G2 variance  512.6  219.0  7.5 × 103 mm2  8.4 × 103 mm2 
G3 variance  63.2  38.0  1.3 × 103 mm2  1.9 × 103 mm2 

PCA successfully uncovered latent data patterns, prompting us to reanalyze the deformation field data by performing two separate PCAs: one on the deformation magnitude and another on the vertical deformation component. The results are illustrated in Fig. 6, and Table V lists their respective PC space weights and data point variances as measures of dispersion. Similar to the strain analysis, we observed that group 1 exhibited larger dispersion and variances compared to groups 2 and 3.

FIG. 6.

(Color online) PCA results of deformation fields performed on two scenarios: using the deformation magnitude and using the inf-sup deformation component. Group 1 as circles, pre-syllabic /k/ in “akouk”; group 2 as squares, post-syllabic /k/ in “akouk”; group 3 as diamonds, the /k/ in “asouk.” Each group has four markers from the four subjects projected onto PC1 and PC2.

FIG. 6.

(Color online) PCA results of deformation fields performed on two scenarios: using the deformation magnitude and using the inf-sup deformation component. Group 1 as circles, pre-syllabic /k/ in “akouk”; group 2 as squares, post-syllabic /k/ in “akouk”; group 3 as diamonds, the /k/ in “asouk.” Each group has four markers from the four subjects projected onto PC1 and PC2.

Close modal

The key strength of the proposed method lies in its ability to facilitate direct numerical calculation and quantitative comparison among subjects acquired at different times, spaces, and across different speech tasks. As depicted in Fig. 1(b), despite varying tongue shapes among the four subjects, all exhibit a general upward motion during the production of consonant /k/. This functional similarity forms the basis for potential statistical analyses when we treat the subjects as varying samples from a general distribution. The primary variation observed lies in the range and distribution of the deformation fields, highlighting subject-specific behaviors—for instance, subject 1 predominantly uses the top of the tongue while subject 2 engages the entire tongue, and so forth. Once normalized, Fig. 2 demonstrates that our atlas-based method aligns all tongues into the same space with uniform shape while preserving subject-specific behaviors. A direct similarity in the deformation fields between Figs. 1 and 2 persists even after deforming them into the atlas space, demonstrating the atlas' capability of maintaining the deformation pattern's consistency after space alteration, which is crucial for subsequent analyses.

The leading factor that induces subject-specific speech behaviors during the same speech task remains uncertain. Dialect, compensation, speaking habits, muscle anatomy, etc. could all contribute to variations in the deformation fields. By closely observing the dataset over time, we can identify the impact of speaking habits in certain cases, such as with subject 2. Their tongue tends to retract more intensely to reach the schwa position and prepare for the subsequent utterance. Because temporal alignment matches all schwa pronunciations with time frame 1 as the reference, a more retracted schwa position mathematically results in larger deformations across the entire tongue in all subsequent time frames. It is important to note that the volume-preserving properties of the motion estimation algorithm ensure that the intensity range of the deformation fields remains unchanged after being warped into the atlas space. This preserves their original length in millimeters, enabling consistent intersubject comparisons.

It is worth noting that although our study is limited to the tongue region due to tag replacement in MRI, Fig. 2 potentially reveals more information beyond the tongue itself. Velar coarticulation occurs when velar stop consonants and vowels are spoken in different ways depending on the context (McClean, 1973). Coarticulation between vowels and velar stop consonants can result in stresses in syllables, and the consonant /k/ is a typical example (Al-Bamerni and Bladon, 1982; Frisch and Wodzinski, 2016). The coarticulatory patterns suggest that each velar-vowel combination results in a distinct closure location for the velum. Figure 2 shows largely distinctive patterns among all four subjects when pronouncing the same consonant, which could indicate the impact of individual behaviors. Group 1 is largely different from the other two. In English, a voiceless stop consonant, e.g., /p/, /t/, or /k/, is expressed differently when it is in prestress position, i.e., when it begins a stressed syllable. A one syllable word in isolation, such as our utterance, is always a stressed syllable. Group 1 /k/ is aspirated and strong due to its position in the syllable, which gives a puff of air following the release. This occurs when the tongue is held against the palate more tightly and the tongue body is more compressed. Table III suggests more compression. For groups 2 and 3, the /k/'s are both in post-syllabic position, yielding a weaker consonant than group 1, which is a less strong, shorter, unaspirated release. The tongue body is more anterior in the prestress /s/ than the prestress /k/, which could affect the final /k/ tongue position. However, the intervening /u:/ may overcome the differences with its own requirements.

Having justified our experimental setup, our primary observation is that a pre-syllabic consonant /k/ (group 1) requires more effort and exhibits larger deformations compared to a post-syllabic /k/ (group 2), which is produced later following another mid-utterance vowel. Even when the post-syllabic /k/ (group 3) follows a different pre-consonant /s/ (group 3), its deformation intensity is reduced. This observation is visually confirmed by Fig. 2, which shows larger fields in row 1 than in rows 2 and 3, and quantitatively supported by Figs. 3 and 4, where group 1 exhibits higher values compared to groups 2 and 3. It is worth noting that while the fields of subject 4 may not be as visually prominent as the others, its boxplot confirms the same observation. Statistically, according to Tables II and III, all ANOVA tests confirm significant differences among the three /k/ deformations, and all one-tailed t-tests confirm that the deformation intensity of group 1 is significantly higher than that of groups 2 and 3, respectively.

However, the comparison between groups 2 and 3 is less consistent. In terms of deformation magnitude, group 2 consistently shows higher values than group 3 with statistical significance. However, regarding the vertical z-component signifying the /k/ motion, the conclusion varies across subjects. Group 2 exhibits larger inf-sup components than group 3 only for subject 3 but lower inf-sup components for subjects 2 and 4. The p-values for these comparisons are notably high (greater than 0.95), suggesting insignificant differences between the groups. The p-value of 0.35 for subject 1 further complicates comparison between the two groups. In summary, when comparing the two post-syllabic /k/'s, regardless of whether they follow a leading /k/ or a leading /s/, no significant difference in deformation can be conclusively drawn between groups 2 and 3.

On the other hand, the PCA results provide intriguing insights. For strains and deformation fields, PCA categorizes the three groups distinctly. Table IV illustrates that group 1 consistently exhibits the largest data point variance (scattering) while group 3 displays the smallest variance. PC 1 and PC 2 collectively represent approximately 95% of the dataset's variance across all four configurations, emphasizing their significance in Figs. 5 and 6, where data are projected onto their two-dimensional plane. Visually, groups 2 and 3 are more tightly clustered within the data point range of group 1. Examining the PC spaces further reveals that groups 2 and 3 have distinctly grouped data points, suggesting potentially unique characteristics within each group. These specific characteristics remain unidentified in this study but may be uncovered through deeper data analysis. Note that PCA extracts a key feature space that identifies the most significant patterns in the dataset. The fact that these significant patterns can distinguish the three groups further strengthens the statistical results in the previous tests. In short, PCA proves capable of uncovering hidden features that extend beyond what direct magnitude comparisons and statistical tests can reveal.

The observation regarding changes in strategies when pronouncing the same consonant is open to discussion. One plausible explanation is that the later positioning of the consonant is a direct influencing factor. After completing an utterance, the tongue naturally relaxes or adjusts in anticipation of returning to a neutral position. As a result, the final consonant requires less effort, leading to smaller deformations. This phenomenon appears independent of the initial consonant; whether /k/ or /s/ was previously produced does not affect the reduced effort of the final consonant. However, the vowel directly connecting to /k/ may be another reason for this phenomenon. The more intense /k/ is connected to the schwa, whereas the less intense /k/ is connected to the /u:/ vowel. The tongue positions itself differently for these two vowels (one lower and one higher), forming two different initial set-off standings. We need to emphasize that all deformation fields depicted in this work are being compared to a common neutral position, not the vowel before /k/. Thus, the reduced intensity is an absolute observation instead of a relative one. This confirms a direct impact of phonetic environment changes on speech production. There could be other contributors to the articulation differences as well. The subjects might have different palate heights, jaw sizes, and other anatomical differences in the oral cavity that could impact the motion, and the palate was demonstrated to play an important role in speech sound production (Stone , 2019). Finally, factors such as the phonetics surrounding the /k/, the length of the utterance, and whether the /k/ leads into another vowel may all contribute to these observations. Further exploration with a larger dataset beyond the current four subjects could provide deeper insights into this process. Moreover, refining the design of speech tasks to include more vowels and consonants would help elucidate the effects of additional factors. This study was also unable to show the coarticulatory effects of the prestress consonant as that is more likely to be in the anterior-posterior direction. Our goal is to highlight the potential of atlas-based imaging methods to advance research in speech-language studies with greater detail. In the future, with the collection of more complex speech samples from more subjects, the atlas itself may evolve to encompass various utterances—an atlas of atlases—enhancing computational capabilities for studies involving more complex speech tasks and diverse utterance designs.

We presented a workflow of image analysis methods designed to facilitate direct quantification of deformation variations in human speech sound production. Deformation fields of consonant /k/ in three different positions were estimated from dynamic MRI before constructing a spatiotemporal atlas. Statistical tests and PCAs were conducted on the deformation fields and internal muscle strains, revealing an overall decrease in deformation intensity over time. We demonstrated the strength of using dynamic MRI in capturing, identifying, and quantifying subtle deformation changes in speech. The proposed analysis method holds promise for examining complex high-dimensional data structures, offering a valuable tool for studying human articulatory motor control.

Research reported in this publication was supported by the National Institute on Deafness and Other Communication Disorders of the National Institutes of Health under Award No. R01DC018511. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

The authors have no conflicts to disclose.

All speech scanning sessions for participating subjects were performed in accordance with research proposals approved by the IRB. Informed consent was obtained from all participants in this study.

The data that support the findings of this study are available from the corresponding author upon reasonable request.

1.
Al-Bamerni
,
A.
, and
Bladon
,
A.
(
1982
). “
One-stage and two-stage temporal patterns of velar coarticulation
,”
J. Acoust. Soc. Am.
72
(
S1
),
S104
.
2.
Al-Hammuri
,
K.
,
Gebali
,
F.
,
Thirumarai Chelvan
,
I.
, and
Kanan
,
A.
(
2022
). “
Tongue contour tracking and segmentation in lingual ultrasound for speech recognition: A review
,”
Diagnostics
12
(
11
),
2811
.
3.
Avants
,
B. B.
,
Epstein
,
C. L.
,
Grossman
,
M.
, and
Gee
,
J. C.
(
2008
). “
Symmetric diffeomorphic image registration with cross-correlation: Evaluating automated labeling of elderly and neurodegenerative brain
,”
Med. Image Anal.
12
(
1
),
26
41
.
4.
Avants
,
B. B.
,
Tustison
,
N.
, and
Song
,
G.
(
2009
). “
Advanced normalization tools (ANTS)
,”
Insight J.
2
(
365
),
1
35
.
5.
Birkholz
,
P.
,
Kroger
,
B. J.
, and
Neuschaefer-Rube
,
C.
(
2011
). “
Model-based reproduction of articulatory trajectories for consonant–vowel sequences
,”
IEEE Trans. Audio. Speech Lang. Process.
19
(
5
),
1422
1433
.
6.
Coker
,
C. H.
(
1976
). “
A model of articulatory dynamics and control
,”
Proc. IEEE
64
(
4
),
452
460
.
7.
Duchateau
,
N.
,
De Craene
,
M.
,
Piella
,
G.
,
Silva
,
E.
,
Doltra
,
A.
,
Sitges
,
M.
,
Bijnens
,
B.-H.
, and
Frangi
,
A. F.
(
2011
). “
A spatiotemporal statistical atlas of motion for the quantification of abnormal myocardial tissue velocities
,”
Med. Image Anal.
15
(
3
),
316
328
.
8.
Ehrhardt
,
J.
,
Werner
,
R.
,
Schmidt-Richberg
,
A.
, and
Handels
,
H.
(
2011
). “
Statistical modeling of 4D respiratory lung motion using diffeomorphic image registration
,”
IEEE Trans. Med. Imaging
30
(
2
),
251
265
.
9.
Frisch
,
S. A.
, and
Wodzinski
,
S. M.
(
2016
). “
Velar–vowel coarticulation in a virtual target model of stop production
,”
J. Phonet.
56
,
52
65
.
10.
Fu
,
M.
,
Barlaz
,
M. S.
,
Holtrop
,
J. L.
,
Perry
,
J. L.
,
Kuehn
,
D. P.
,
Shosted
,
R. K.
,
Liang
,
Z.-P.
, and
Sutton
,
B. P.
(
2017
). “
High-frame-rate full-vocal-tract 3D dynamic speech imaging
,”
Magn. Reson. Med.
77
(
4
),
1619
1629
.
11.
Hayes
,
B.
(
2011
).
Introductory Phonology
(
Wiley
,
New York
).
12.
Ingram
,
D.
, and
Babatsouli
,
E.
(
2024
). “
Cross-linguistic phonological acquisition
,” in
The Handbook of Clinical Linguistics
, 2nd ed. (John Wiley & Sons, New York), pp.
407
419
.
13.
Iskarous
,
K.
(
2005
). “
Patterns of tongue movement
,”
J. Phonet.
33
(
4
),
363
381
.
14.
Janssen
,
N.
, and
Mendieta
,
C. C. R.
(
2020
). “
The dynamics of speech motor control revealed with time-resolved fMRI
,”
Cerebral Cortex
30
(
1
),
241
255
.
15.
Jin
,
R.
,
Shosted
,
R. K.
,
Xing
,
F.
,
Gilbert
,
I. R.
,
Perry
,
J. L.
,
Woo
,
J.
,
Liang
,
Z.-P.
, and
Sutton
,
B. P.
(
2023
). “
Enhancing linguistic research through 2‐mm isotropic 3D dynamic speech MRI optimized by sparse temporal sampling and low‐rank reconstruction
,”
Magn. Reson. Med.
89
(
2
),
652
664
.
16.
Johnson
,
K.
,
Ladefoged
,
P.
, and
Lindau
,
M.
(
1993
). “
Individual differences in vowel production
,”
J. Acoust. Soc. Am.
94
(
2
),
701
714
.
17.
Kent
,
R. D.
, and
Moll
,
K. L.
(
1972
). “
Cinefluorographic analyses of selected lingual consonants
,”
J. Speech Hear. Res.
15
(
3
),
453
473
.
18.
Koike
,
N.
,
Ii
,
S.
,
Yoshinaga
,
T.
,
Nozaki
,
K.
, and
Wada
,
S.
(
2017
). “
Model-based inverse estimation for active contraction stresses of tongue muscles using 3D surface shape in speech production
,”
J. Biomech.
64
,
69
76
.
19.
Lim
,
Y.
,
Zhu
,
Y.
,
Lingala
,
S. G.
,
Byrd
,
D.
,
Narayanan
,
S.
, and
Nayak
,
K. S.
(
2019
). “
3D dynamic MRI of the vocal tract during natural speech
,”
Magn. Reson. Med.
81
(
3
),
1511
1520
.
20.
Lowit
,
A.
,
Miller
,
N.
, and
Kuschmann
,
A.
(
2014
).
Motor Speech Disorder. A Cross-Language Perspective
(
Multilingual Matters
,
Bristol, UK
), Chap. 3.
21.
McClean
,
M.
(
1973
). “
Forward coarticulation of velar movement at marked junctural boundaries
,”
J. Speech Hear. Res.
16
(
2
),
286
296
.
22.
Niimi
,
M. N. S.
(
2000
). “
Changes over time in dysarthric patients with amyotrophic lateral sclerosis (ALS): A study of changes in speaking rate and maximum repetition rate (MRR)
,”
Clin. Linguist. Phonet.
14
(
7
),
485
497
.
23.
Parthasarathy
,
V.
,
Prince
,
J. L.
,
Stone
,
M.
,
Murano
,
E. Z.
, and
NessAiver
,
M.
(
2007
). “
Measuring tongue motion from tagged cine-MRI using harmonic phase (HARP) processing
,”
J. Acoust. Soc. Am.
121
(
1
),
491
504
.
24.
Perkell
,
J. S.
(
1969
). “
Physiology of speech production: Results and implications of a quantitative cineradiographic study
,”
MIT
research monograph, No 53.
25.
Perrier
,
P.
,
Payan
,
Y.
,
Zandipour
,
M.
, and
Perkell
,
J.
(
2003
). “
Influences of tongue biomechanics on speech movements during the production of velar stop consonants: A modeling study
,”
J. Acoust. Soc. Am.
114
(
3
),
1582
1599
.
26.
Ramanarayanan
,
V.
,
Tilsen
,
S.
,
Proctor
,
M.
,
Töger
,
J.
,
Goldstein
,
L.
,
Nayak
,
K. S.
, and
Narayanan
,
S.
(
2018
). “
Analysis of speech production real-time MRI
,”
Comput. Speech Lang.
52
,
1
22
.
27.
Roland
,
P. E.
,
Graufelds
,
C. J.
,
Wǎhlin
,
J.
,
Ingelman
,
L.
,
Andersson
,
M.
,
Ledberg
,
A.
,
Pedersen
,
J.
,
Åkerman
,
S.
,
Dabringhaus
,
A.
, and
Zilles
,
K.
(
1994
). “
Human brain atlas: For high‐resolution functional and anatomical mapping
,”
Human Brain Mapp.
1
(
3
),
173
184
.
28.
Schlenker
,
J. L.
,
Gibbs
,
G. V.
, and
Boisen
,
M. B.
(
1978
). “
Strain-tensor components expressed in terms of lattice parameters
,”
Acta Cryst. A
34
(
1
),
52
54
.
29.
Stone
,
M.
,
Gomez
,
A. D.
,
Zhuo
,
J.
,
Tchouaga
,
A. L.
, and
Prince
,
J. L.
(
2019
). “
Quantifying tongue tip shape in apical and laminal /s/: Contributions of palate shape
,”
J. Speech. Lang. Hear. Res.
62
(
9
),
3149
3159
.
30.
Wold
,
S.
,
Esbensen
,
K.
, and
Geladi
,
P.
(
1987
). “
Principal component analysis
,”
Chemom. Intell. Lab. Syst.
2
(
1-3
),
37
52
.
31.
Woo
,
J.
,
Xing
,
F.
,
Lee
,
J.
,
Stone
,
M.
, and
Prince
,
J. L.
(
2018
). “
A spatio-temporal atlas and statistical model of the tongue during speech from cine-MRI
,”
Comput. Methods Biomech. Biomed. Eng. Imaging Vis.
6
(
5
),
520
531
.
32.
Woo
,
J.
,
Xing
,
F.
,
Stone
,
M.
,
Green
,
J.
,
Reese
,
T. G.
,
Brady
,
T. J.
,
Wedeen
,
V. J.
,
Prince
,
J. L.
, and
El Fakhri
,
G.
(
2019
). “
Speech map: A statistical multimodal atlas of 4D tongue motion during speech from tagged and cine MR images
,”
Comput. Methods Biomech. Biomed. Eng. Imaging Vis.
7
(
4
),
361
373
.
33.
Xing
,
F.
,
Jin
,
R.
,
Gilbert
,
I.
,
El Fakhri
,
G.
,
Perry
,
J.
,
Sutton
,
B.
, and
Woo
,
J.
(
2023
). “
Quantifying velopharyngeal motion variation in speech sound production using an audio-informed dynamic MRI atlas
,” in
Medical Imaging 2023: Image Processing
, edited by O. Colliot and I. Išgum (Curran Associates, Inc., Red Hook, NY), Vol.
12464
, pp.
606
611
.
34.
Xing
,
F.
,
Jin
,
R.
,
Gilbert
,
I. R.
,
Perry
,
J. L.
,
Sutton
,
B. P.
,
Liu
,
X.
,
El Fakhri
,
G.
,
Shosted
,
R. K.
, and
Woo
,
J.
(
2021
). “
4D magnetic resonance imaging atlas construction using temporally aligned audio waveforms in speech
,”
J. Acoust. Soc. Am.
150
(
5
),
3500
3508
.
35.
Xing
,
F.
,
Stone
,
M.
,
Goldsmith
,
T.
,
Prince
,
J. L.
,
El Fakhri
,
G.
, and
Woo
,
J.
(
2019
). “
Atlas-based tongue muscle correlation analysis from tagged and high-resolution magnetic resonance imaging
,”
J. Speech Lang. Hear. Res.
62
(
7
),
2258
2269
.
36.
Xing
,
F.
,
Woo
,
J.
,
Gomez
,
A. D.
,
Pham
,
D. L.
,
Bayly
,
P. V.
,
Stone
,
M.
, and
Prince
,
J. L.
(
2017
). “
Phase vector incompressible registration algorithm for motion estimation from tagged magnetic resonance images
,”
IEEE Trans. Med. Imaging
36
(
10
),
2116
2128
37.
Xing
,
F.
,
Woo
,
J.
,
Lee
,
J.
,
Murano
,
E. Z.
,
Stone
,
M.
, and
Prince
,
J. L.
(
2016
). “
Analysis of 3-D tongue motion from tagged and cine magnetic resonance images
,”
J. Speech Lang. Hear. Res.
59
(
3
),
468
479
.