Identification and quantification of speech variations in velar production across various phonological environments have always been an interesting topic in speech motor control studies. Dynamic magnetic resonance imaging has become a favorable tool for visualizing articulatory deformations and providing quantitative insights into speech activities over time. Based on this modality, it is proposed to employ a workflow of image analysis techniques to uncover potential deformation variations in the human tongue caused by changes in phonological environments by altering the placement of velar consonants in utterances. The speech deformations of four human subjects in three different consonant positions were estimated from magnetic resonance images using a spatiotemporal tracking method before being warped via image registration into a common space—a dynamic atlas space constructed using four-dimensional alignments—for normalized quantitative comparisons. Statistical tests and principal component analyses were conducted on the magnitude of deformations, consonant-specific deformations, and internal muscle strains. The results revealed an overall decrease in deformation intensity following the initial consonant production, indicating potential muscle adaptation behaviors at a later temporal position in one speech utterance.
I. INTRODUCTION
Ongoing advancements in modern medical imaging technologies, such as ultrasound, computed tomography (CT), and magnetic resonance imaging (MRI), have provided speech researchers with a variety of tools for visualization and analysis (Al-Hammuri , 2022; Koike , 2017; Lim , 2019). MRI, in particular, has proven to be a reliable, noninvasive tool for capturing articulatory deformations, enabling quantitative assessments of speech imaging data in motor control studies (Janssen and Mendieta, 2020; Xing , 2016). Recently, faster and more robust dynamic MRI methods have been proposed, allowing for real-time visualization of speech motion (Fu , 2017; Jin , 2023; Lim , 2019; Ramanarayanan , 2018). Additionally, spatiotemporal atlas-based methods now enable four-dimensional (4D) assessments of articulatory motion within study populations (Woo , 2018; Woo , 2019; Xing , 2021). A primary advantage of an atlas approach is its ability to facilitate subject normalization and quantify variability caused by differences in tongue shape, physical space, acquisition parameters, and speech tempo (Duchateau , 2011; Roland , 1994). Moreover, machine learning methods, such as dimensionality reduction and matrix factorization, can be applied to the aligned data to identify the unique characteristics of each individual within the atlas space (Xing , 2019). Notably, another benefit of a dynamic atlas lies in the spatiotemporal alignment process in 4D (three dimensions plus time), which minimizes anatomical tongue variations, standardizing them to a common shape, and temporally aligns the speech rhythms of different subjects. Consequently, all subjects' critical speech waveform positions are synchronized in the common temporal space, enabling individualized statistical comparisons (Woo , 2019; Xing , 2023).
These technologies have generated interest in characterizing subject-specific phonetic differences during speech. In phonology, the phonetic environment of a speech sound refers to the influence of adjacent and surrounding sounds on its articulation (Hayes, 2011). Variations in syllable placement and phonetic environments impact speech production, a central focus for speech-language pathologists and motor control researchers (Birkholz , 2011; Coker, 1976; Ingram and Babatsouli, 2024; Johnson , 1993). Identifying and quantifying these variations is essential for advancing clinical procedures in diagnosing and treating motor speech disorders (Lowit , 2014; Niimi, 2000). There has been much effort in studying the biomechanic patterns of the tongue movement in velar consonant production (Iskarous, 2005; Kent and Moll, 1972; Perkell, 1969; Perrier , 2003). However, to date, no study from an imaging perspective has quantitatively characterized the subtle subject-specific variations as they manifest in temporal muscular deformations.
Even with the availability of all imaging and alignment tools, distinguishing subtle motion variations in different phonetic environments remains challenging. In this work, we propose a method to address this issue and demonstrate its effectiveness through a comprehensive analysis of tongue articulations during velar consonant production. Following a previously proposed 4D atlas construction protocol, we estimate the tissue deformations of four human subjects, each producing the consonant /k/ in three different utterance positions within a word. These deformations are represented in the form of three-dimensional (3D) deformation fields derived from a sequence of dynamic MR images. All fields are subsequently warped into a common atlas space constructed using a corresponding sequence of cine dynamic MR images, where one subject is selected as the common reference. We conduct statistical tests on the magnitude of the deformation fields as well as the inferior-superior (inf-sup) component of the deformation fields—indicating the intensity of consonant /k/-specific motions—to observe potential deformation differences subject to phonetic environment changes. Furthermore, internal tissue strain is computed to measure local compression and expansion along the inf-sup direction, followed by a principal component analysis (PCA) performed on the strains of all subjects. Results from the motion fields and strains reveal a smaller amount of overall deformation in the post-syllabic /k/ than its initial pre-syllabic production, allowing it to be classified into a separate group in the PCA space.
The specific aims of this study were (1) to demonstrate the advantage of dynamic MRI in capturing subtle deformation differences in speech; (2) to identify and quantify these features in an atlas-based framework; and (3) to characterize tongue deformation differences during velar consonants in three different phonetic contexts. All of these will help reveal potential muscle adaptation or relaxation processes as consonants are produced over time. This approach also shows the capability to extract complex and concealed tissue motion features and uncover unique syllable characteristics from a notably high-dimensional imaging dataset.
II. METHODS
A. Data acquisition and speech task
Four healthy subjects participated in the study in accordance with the Institutional Review Board (IRB) protocol HCR-HP-00042060-16 of University of Maryland, Baltimore. All MRI sessions were conducted on a Siemens Prisma 3T system (Siemens Medical Solutions, Erlangen, Germany) using a 64-channel head and neck coil. Each scanning session includes two 1-s speech tasks pronouncing the utterances “akouk” (/ə ku:k/) and “asouk” (/ə su:k/), performed in a series of controlled repeating motion cycles. Each 1-s speech cycle was followed by a 1-s breathing cycle. Multiple repetitions were employed to ensure full coverage of the k-space for each specific image slice at each time frame. Meanwhile, a specifically designed audio signal was played through the headphone to the subject inside the scanner. It was a rhythmic metronome prompt that had a 1-s repetition cycle which played “beeps” at desired phonetic locations. The subject was pretrained to speak following this background cue by carefully timing their vowel and consonant positions, ensuring consistency among all repetitions. Eventually, a dynamic collection of parallel sagittal slices and another collection of parallel axial slices covering the tongue region were acquired to capture tissue motion in all three directions (anterior-posterior and inf-sup components in sagittal slices and left-right component in axial slices). Cine and tagged MR modalities were collected, leading to a typical ∼75-min session, including resting time for each subject. Image parameters were set as follows: field of view, 240 × 240 mm; image resolution, 1.9 × 1.9 mm; slice thickness, 6.0 mm; and frame rate, 26 frames/s. Note that for one speech task, multiple repetition cycles collectively yielded one 26-frame sequence, representing an averaged motion cycle from all of them, as each single repetition was able to effectively scan only a small part of the k-space. Thus, unlike high-speech cine MR modalities that capture a real-time video of the entire speech, tagged dynamic MRI was not affected by high-speed sampling artifacts such as aliasing or stroboscopic effects.
The two speech utterances were specifically designed to contain three pronunciations of the consonant /k/, as was planned for this study. For “akouk,” the anticipated tongue positioning throughout the utterance included transitions from initialization to /ə/—going to the neutral schwa position, from /ə/ to /k/—rising to reach the palate, from /k/ to /u:/—lowering back, and from /u:/ to /k/—rising again to reach the palate. The whole utterance included two upward thrusts for /k/ along the inf-sup direction (+z motion component by the scanner coordinates), where the second one occurred shortly after the first in less than 1 s. Similarly, for “asouk,” the anticipated tongue positioning included transitions from initialization to the schwa, from /ə/ to /s/—moving forward with a stretched tongue tip, from /s/ to /u:/—retracting backward, and from /u:/ to /k/—rising to reach the palate. Notably, the difference for this /k/ consonant was that it followed a forward motion instead of repeating a leading upward motion. We seek to identify any variation among the productions of these three /k/ consonants in the tissue deformations.
B. Deformation estimation from MR images
Note that the image coordinates are samples from the subject's original physical space where the MR data are acquired as opposed to a common atlas space. Consequently, Fig. 1(b) illustrates the corresponding deformation fields for all subjects, each displaying a distinct tongue shape as they are currently independent and unnormalized. Similarly, for speech task 2 “asouk,” and can be computed in a similar manner.
Figure 1(b) presents a specific time frame during which the first /k/ in “akouk” is pronounced. An upward motion is evident in all subjects, indicating the tongue's tip reaching toward the palate. Aside from the fact that all four tongues exhibit varying shapes and coordinates, it is also noticeable that there are variations in the intensity and distribution of the deformation fields. For instance, subject 1 displays predominantly smaller motion except at the top of the tongue, whereas subject 2 exhibits overall greater motion. This clearly illustrates that conducting quantitative comparisons is infeasible without intersubject spatiotemporal normalization.
C. Atlas-based deformation alignment
For the study population, we follow the core principles of previously proposed 4D atlas construction methods (Woo , 2019; Xing , 2021), which encompass two main steps: temporal alignment and spatial alignment. The data acquisition process also yields a paired sequence of untagged cine MR images that capture anatomical details. A speech-language expert manually identified the time instants of all critical syllables—namely, the schwa, the pre- and post-syllabic /k/, and the /s/—for each subject. Images at these critical time frames serve as anchor points within a designated reference temporal space (using subject 1 as the reference in this study), whereas all other subjects' cine images in between the anchor points are interpolated based on their original cine data. Further mathematical details can be found in Xing (2021).
For spatial alignment, at each aligned time frame, all subjects' cine images are fused together to construct one atlas frame. The fusion process involves multiple image registration-based warping steps, employing the symmetric image normalization method (Avants , 2008). We choose any subject as the initial template (in this case, subject 1). We establish a warp from the initial template to each subject. The mean of all these warp transformations is then applied to the initial template, creating a more refined template. We iteratively repeat this process to find updated warps and updated templates until convergence. All final warps are averaged to create the final template—the atlas space, which provides an unbiased, centered, and averaged spatial position for all subjects to align. Repeating the spatial alignment process for each time frame yields a 4D dynamic atlas sequence. Additional details can be found in Woo (2019). An exemplar of such an atlas at the schwa time frame is shown in Fig. 1(a).
Ultimately, with all deformation fields warped and aligned (see Fig. 2), quantitative analyses, such as statistical tests, can be conducted on the mathematical properties of these fields. This study particularly focuses on the magnitude of deformation vectors and the vertical inf-sup deformation component that signifies the intensity of the primary upward motion for /k/.
D. Strain computation
In addition to projection, the three principal strains serve as indicators of the extent of the largest expansion and compression in the local tissue deformation. They can be computed by identifying the three principal values of the Lagrangian strain tensor.
E. PCA
Unlike deformation fields, strain embodies more of a “hidden” feature than the directly observable deformations. The value of strain is a ratio of local length change; thus, it is not directly comparable across different subjects for determining larger or smaller values in absolute terms. Instead, we employ PCA (Wold , 1987) to simplify strain data and reduce its dimensions to find a major principal component (PC) space. PCA is a statistical technique that transforms data from a high-dimensional space into a lower-dimensional space while preserving as much variability (key features with information) as possible. As an unsupervised machine learning algorithm, it excels at identifying and revealing hidden features innate and wrapped within largely high-dimensional data, which is a precise characteristic of our estimated deformation motion field dataset. At time frame , we stack strain values for subject as . PCA is applied to the matrix , yielding a new 3D PC space characterized by new PC directions . Typically, the first one or two PCs encapsulate most PC energy. By re-projecting all strain values into the PC directions, latent features hidden within the projection values can become evident.
III. RESULTS
A. Deformation field visualization
To simplify our description, we categorize the three locations of the consonant /k/ into three groups: group 1, the pre-syllabic first /k/ in “akouk”; group 2, the post-syllabic second /k/ in “akouk”; and group 3, the single /k/ in “asouk,” which is also post-syllabic. Visual examination of the results of the spatiotemporal alignment can be directly conducted using Fig. 2. Similar to Fig. 1(b), the fields are 3D-rendered and shown from the left side of the tongue with multiple vector layers in depth. The tongue is oriented to the left in the setup, showing evident upward motion associated with producing consonant /k/. Qualitative assessment of Fig. 2 reveals that despite normalizing all subjects into the same atlas space with identical shapes, their range of motion varies depending on the subject. However, each subject tends to maintain a consistent deformation pattern. For example, subject 1 predominantly uses the top part of the tongue while subject 2 employs more of their tongue body, including larger tongue top movements. Implications related to coarticulation will be discussed in Sec. III B. It can be observed that a pre-syllabic /k/, as in group 1, typically demands larger deformations compared to a post-syllabic /k/ as in groups 2 and 3 for all subjects. This is particularly evident in subjects 1, 2, and 3, whose fields in the first row of Fig. 2 exhibit higher intensity. However, the intensity difference in their second and third rows is not as pronounced, suggesting similarities in producing a post-syllabic /k/ following the /u:/ vowel.
B. Deformation magnitude analysis
To quantitatively compare the deformation fields from different consonant /k/ positions, we extracted the magnitude of all deformation vectors and plotted them in Fig. 3, grouping them by consonant and separating each subject. Note that individual-level data in Fig. 3 are from all of the image voxels, which are used to explore the high heterogeneity of deformation magnitudes with statistics aggregated at each time point in the entire tongue region. Across all subjects, group 1 consistently exhibits larger magnitudes than groups 2 and 3. However, when comparing groups 2 and 3, the magnitude difference is less clear, especially from the mean (represented by the cross symbols) and median (middle bars of each box) values. Specifically, group 2 shows larger median values than group 3 for subjects 2 and 3 but lower or similar medians for subjects 1 and 4. The mean, median, and standard deviation of the deformation magnitude are listed in Table I.
. | Subject 1 . | Subject 2 . | Subject 3 . | Subject 4 . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
G1 . | G2 . | G3 . | G1 . | G2 . | G3 . | G1 . | G2 . | G3 . | G1 . | G2 . | G3 . | |
Mag (mean) | 2.7 | 2.5 | 2.5 | 4.0 | 3.7 | 3.3 | 4.5 | 3.5 | 3.0 | 2.8 | 2.6 | 2.4 |
Mag (st.d.) | 1.5 | 1.5 | 1.7 | 2.2 | 2.0 | 1.9 | 2.3 | 1.7 | 1.8 | 1.4 | 1.4 | 1.5 |
Mag (median) | 2.7 | 2.3 | 2.6 | 4.4 | 4.0 | 3.7 | 4.3 | 3.7 | 3.3 | 2.8 | 2.4 | 2.4 |
. | Subject 1 . | Subject 2 . | Subject 3 . | Subject 4 . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
G1 . | G2 . | G3 . | G1 . | G2 . | G3 . | G1 . | G2 . | G3 . | G1 . | G2 . | G3 . | |
Mag (mean) | 2.7 | 2.5 | 2.5 | 4.0 | 3.7 | 3.3 | 4.5 | 3.5 | 3.0 | 2.8 | 2.6 | 2.4 |
Mag (st.d.) | 1.5 | 1.5 | 1.7 | 2.2 | 2.0 | 1.9 | 2.3 | 1.7 | 1.8 | 1.4 | 1.4 | 1.5 |
Mag (median) | 2.7 | 2.3 | 2.6 | 4.4 | 4.0 | 3.7 | 4.3 | 3.7 | 3.3 | 2.8 | 2.4 | 2.4 |
We conducted hypothesis tests to assess whether any of the /k/ sample groups were drawn from distinct distributions. Specifically, we tested the hypothesis that group 1 (pre-syllabic /k/) requires a significantly more intense deformation magnitude in pronunciation compared to group 2 (post-syllabic /k/). The null hypothesis states that there is no statistical difference among the groups based on this hypothesis. Because every subject had been aligned in the atlas space, we used every voxel position's deformation magnitude as samples. Similarly, we conducted other combinations of t-tests comparing groups 1 and 3 (pre-syllabic /k/ of “akouk” and /k/ of “asouk”) and groups 2 and 3 (post-syllabic /k/ of “akouk” and /k/ of “asouk”). The hypotheses were that the former requires a significantly more intense deformation magnitude than the latter within each pair. In addition, analysis of variance (ANOVA) was performed across all three /k/ groups to determine if at least one group significantly differed from the others. We considered a resulting p-value of less than a 0.05 threshold to indicate significance. To enhance statistical power, we employed one-sided t-tests, assuming that the first group being tested had larger overall values compared to the second group. The results from all test combinations are also listed in Table II.
. | ANOVA groups 1, 2, and 3 . | Two-sample t-test one-sided . | ||
---|---|---|---|---|
Groups 1 and 2 . | Groups 1 and 3 . | Groups 2 and 3 . | ||
Sub 1 p | 4.0 × 10−59 | 5.6 × 10−36 | 6.8 × 10−54 | 9.5 × 10−5 |
EF | 0.01 (η²) | 12.5 | 15.4 | 3.7 |
Sub 2 p | 1.3 × 10−253 | 7.9 × 10−33 | 1.8 × 10−247 | 7.8 × 10−110 |
EF | 0.02 (η²) | 11.9 | 33.8 | 22.3 |
Sub 3 p | <4.9 × 10−324 | <4.9 × 10−324 | <4.9 × 10−324 | 1.8 × 10−185 |
EF | 0.09 (η²) | 43.8 | 67.0 | 29.2 |
Sub 4 p | 2.4 × 10−136 | 6.0 × 10−33 | 6.9 × 10−134 | 2.3 × 10−40 |
EF | 0.01 (η²) | 11.9 | 24.7 | 13.3 |
. | ANOVA groups 1, 2, and 3 . | Two-sample t-test one-sided . | ||
---|---|---|---|---|
Groups 1 and 2 . | Groups 1 and 3 . | Groups 2 and 3 . | ||
Sub 1 p | 4.0 × 10−59 | 5.6 × 10−36 | 6.8 × 10−54 | 9.5 × 10−5 |
EF | 0.01 (η²) | 12.5 | 15.4 | 3.7 |
Sub 2 p | 1.3 × 10−253 | 7.9 × 10−33 | 1.8 × 10−247 | 7.8 × 10−110 |
EF | 0.02 (η²) | 11.9 | 33.8 | 22.3 |
Sub 3 p | <4.9 × 10−324 | <4.9 × 10−324 | <4.9 × 10−324 | 1.8 × 10−185 |
EF | 0.09 (η²) | 43.8 | 67.0 | 29.2 |
Sub 4 p | 2.4 × 10−136 | 6.0 × 10−33 | 6.9 × 10−134 | 2.3 × 10−40 |
EF | 0.01 (η²) | 11.9 | 24.7 | 13.3 |
In addition to analyzing the magnitude of deformation fields, we also examined the vertical inf-sup component along the z axis, representing the elevating motion for producing /k/. The inf-sup deformation component is a prominent characteristic for consonant /k/, and the result is plotted in Fig. 4, categorized by consonant groups and subjects. Consistently, group 1 exhibits larger vertical deformations compared to groups 2 and 3. However, distinguishing between groups 2 and 3 based on their mean (crosses) and median (middle bars) values remains inconclusive. The mean, median, and standard deviation of the inf-sup deformation are listed in Table III. Additionally, the same statistical tests were applied to the inf-sup deformation components. ANOVA and paired t-tests were performed, and their corresponding results are detailed in Table IV.
. | Subject 1 . | Subject 2 . | Subject 3 . | Subject 4 . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
G1 . | G2 . | G3 . | G1 . | G2 . | G3 . | G1 . | G2 . | G3 . | G1 . | G2 . | G3 . | |
Inf-sup (mean) | 2.3 | 1.8 | 1.8 | 3.4 | 2.6 | 2.8 | 3.8 | 2.8 | 2.3 | 2.5 | 2.2 | 2.2 |
Inf-sup (st.d.) | 1.5 | 1.4 | 1.5 | 1.9 | 1.6 | 1.7 | 2.5 | 1.6 | 1.5 | 1.4 | 1.4 | 1.5 |
Inf-sup (median) | 2.1 | 1.6 | 1.6 | 3.8 | 2.9 | 3.2 | 3.5 | 2.7 | 2.6 | 2.5 | 1.9 | 2.2 |
. | Subject 1 . | Subject 2 . | Subject 3 . | Subject 4 . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
G1 . | G2 . | G3 . | G1 . | G2 . | G3 . | G1 . | G2 . | G3 . | G1 . | G2 . | G3 . | |
Inf-sup (mean) | 2.3 | 1.8 | 1.8 | 3.4 | 2.6 | 2.8 | 3.8 | 2.8 | 2.3 | 2.5 | 2.2 | 2.2 |
Inf-sup (st.d.) | 1.5 | 1.4 | 1.5 | 1.9 | 1.6 | 1.7 | 2.5 | 1.6 | 1.5 | 1.4 | 1.4 | 1.5 |
Inf-sup (median) | 2.1 | 1.6 | 1.6 | 3.8 | 2.9 | 3.2 | 3.5 | 2.7 | 2.6 | 2.5 | 1.9 | 2.2 |
. | ANOVA groups 1, 2, and 3 . | Two-sample t-test one-sided . | ||
---|---|---|---|---|
Groups 1 and 2 . | Groups 1 and 3 . | Groups 2 and 3 . | ||
Sub 1 p | 2.4 × 10−90 | 1.1 × 10−223 | 1.2 × 10−214 | 0.35 |
EF | 0.02 (η²) | 32.1 | 31.5 | 0.4 |
Sub 2 p | <4.9 × 10−324 | <4.9 × 10−324 | 9.1 × 10−204 | 1.00 |
EF | 0.04 (η²) | 43.3 | 30.6 | −12.9 |
Sub 3 p | <4.9 × 10−324 | <4.9 × 10−324 | <4.9 × 10−324 | 7.6 × 10−161 |
EF | 0.10 (η²) | 48.2 | 70.2 | 27.1 |
Sub 4 p | 5.2 × 10−140 | 2.1 × 10−123 | 1.4 × 10−93 | 0.99 |
EF | 0.01 (η²) | 23.7 | 20.5 | −2.5 |
. | ANOVA groups 1, 2, and 3 . | Two-sample t-test one-sided . | ||
---|---|---|---|---|
Groups 1 and 2 . | Groups 1 and 3 . | Groups 2 and 3 . | ||
Sub 1 p | 2.4 × 10−90 | 1.1 × 10−223 | 1.2 × 10−214 | 0.35 |
EF | 0.02 (η²) | 32.1 | 31.5 | 0.4 |
Sub 2 p | <4.9 × 10−324 | <4.9 × 10−324 | 9.1 × 10−204 | 1.00 |
EF | 0.04 (η²) | 43.3 | 30.6 | −12.9 |
Sub 3 p | <4.9 × 10−324 | <4.9 × 10−324 | <4.9 × 10−324 | 7.6 × 10−161 |
EF | 0.10 (η²) | 48.2 | 70.2 | 27.1 |
Sub 4 p | 5.2 × 10−140 | 2.1 × 10−123 | 1.4 × 10−93 | 0.99 |
EF | 0.01 (η²) | 23.7 | 20.5 | −2.5 |
C. Strain computation and PCA results
Finally, we computed the Lagrangian strain tensors and examined their projected strain values from two perspectives: the principal strains derived from singular value decomposition and the strain projected along the vertical inf-sup direction. In parallel with the deformation field analyses, subjecting the strain values to the same statistical tests did not reveal any significant differences between the consonant groups. As stated in Sec. II, this outcome was anticipated due to the intrinsic nature of strain, which represents ratios rather than absolute quantities. Therefore, we employed PCA to explore potential latent features within the strain data.
Regarding the principal strains, the newly established PC space spanned three dimensions with weights of 1104.2 for PC1, 427.4 for PC2, and 62.3 for PC3. These weights represent the amount of data variance each PC direction accounted for, where PC1 and PC2 together encompassed 96.1% of the total variance. We projected the principal strains of all three consonant groups onto this new space. Figure 5 illustrates that group 1 exhibits a larger dispersion of data points, whereas groups 2 and 3 are more compactly clustered. We computed the variance within each group of four data points, measuring the extent of data point scattering from their mean center value. Group 1 returned a variance of 1195.4, the largest among the three, contrasting with those of group 2 (512.6) and group 3 (63.2). Additionally, we conducted PCA on the strain along the inf-sup direction, which measures the localized tissue change ratio along the z axis. Figure 5 depicts the results, and Table V lists the PC space weights and data point variances.
. | Strain: all three principals . | Strain: along inf-sup . | Deformation magnitude . | Inf-sup deformation . |
---|---|---|---|---|
PCs 1 and 2 weight | 96.1% | 96.2% | 96.4% | 95.7% |
G1 variance | 1195.4 | 517.4 | 3.0 × 104 mm2 | 2.7 × 104 mm2 |
G2 variance | 512.6 | 219.0 | 7.5 × 103 mm2 | 8.4 × 103 mm2 |
G3 variance | 63.2 | 38.0 | 1.3 × 103 mm2 | 1.9 × 103 mm2 |
. | Strain: all three principals . | Strain: along inf-sup . | Deformation magnitude . | Inf-sup deformation . |
---|---|---|---|---|
PCs 1 and 2 weight | 96.1% | 96.2% | 96.4% | 95.7% |
G1 variance | 1195.4 | 517.4 | 3.0 × 104 mm2 | 2.7 × 104 mm2 |
G2 variance | 512.6 | 219.0 | 7.5 × 103 mm2 | 8.4 × 103 mm2 |
G3 variance | 63.2 | 38.0 | 1.3 × 103 mm2 | 1.9 × 103 mm2 |
PCA successfully uncovered latent data patterns, prompting us to reanalyze the deformation field data by performing two separate PCAs: one on the deformation magnitude and another on the vertical deformation component. The results are illustrated in Fig. 6, and Table V lists their respective PC space weights and data point variances as measures of dispersion. Similar to the strain analysis, we observed that group 1 exhibited larger dispersion and variances compared to groups 2 and 3.
IV. DISCUSSION
The key strength of the proposed method lies in its ability to facilitate direct numerical calculation and quantitative comparison among subjects acquired at different times, spaces, and across different speech tasks. As depicted in Fig. 1(b), despite varying tongue shapes among the four subjects, all exhibit a general upward motion during the production of consonant /k/. This functional similarity forms the basis for potential statistical analyses when we treat the subjects as varying samples from a general distribution. The primary variation observed lies in the range and distribution of the deformation fields, highlighting subject-specific behaviors—for instance, subject 1 predominantly uses the top of the tongue while subject 2 engages the entire tongue, and so forth. Once normalized, Fig. 2 demonstrates that our atlas-based method aligns all tongues into the same space with uniform shape while preserving subject-specific behaviors. A direct similarity in the deformation fields between Figs. 1 and 2 persists even after deforming them into the atlas space, demonstrating the atlas' capability of maintaining the deformation pattern's consistency after space alteration, which is crucial for subsequent analyses.
The leading factor that induces subject-specific speech behaviors during the same speech task remains uncertain. Dialect, compensation, speaking habits, muscle anatomy, etc. could all contribute to variations in the deformation fields. By closely observing the dataset over time, we can identify the impact of speaking habits in certain cases, such as with subject 2. Their tongue tends to retract more intensely to reach the schwa position and prepare for the subsequent utterance. Because temporal alignment matches all schwa pronunciations with time frame 1 as the reference, a more retracted schwa position mathematically results in larger deformations across the entire tongue in all subsequent time frames. It is important to note that the volume-preserving properties of the motion estimation algorithm ensure that the intensity range of the deformation fields remains unchanged after being warped into the atlas space. This preserves their original length in millimeters, enabling consistent intersubject comparisons.
It is worth noting that although our study is limited to the tongue region due to tag replacement in MRI, Fig. 2 potentially reveals more information beyond the tongue itself. Velar coarticulation occurs when velar stop consonants and vowels are spoken in different ways depending on the context (McClean, 1973). Coarticulation between vowels and velar stop consonants can result in stresses in syllables, and the consonant /k/ is a typical example (Al-Bamerni and Bladon, 1982; Frisch and Wodzinski, 2016). The coarticulatory patterns suggest that each velar-vowel combination results in a distinct closure location for the velum. Figure 2 shows largely distinctive patterns among all four subjects when pronouncing the same consonant, which could indicate the impact of individual behaviors. Group 1 is largely different from the other two. In English, a voiceless stop consonant, e.g., /p/, /t/, or /k/, is expressed differently when it is in prestress position, i.e., when it begins a stressed syllable. A one syllable word in isolation, such as our utterance, is always a stressed syllable. Group 1 /k/ is aspirated and strong due to its position in the syllable, which gives a puff of air following the release. This occurs when the tongue is held against the palate more tightly and the tongue body is more compressed. Table III suggests more compression. For groups 2 and 3, the /k/'s are both in post-syllabic position, yielding a weaker consonant than group 1, which is a less strong, shorter, unaspirated release. The tongue body is more anterior in the prestress /s/ than the prestress /k/, which could affect the final /k/ tongue position. However, the intervening /u:/ may overcome the differences with its own requirements.
Having justified our experimental setup, our primary observation is that a pre-syllabic consonant /k/ (group 1) requires more effort and exhibits larger deformations compared to a post-syllabic /k/ (group 2), which is produced later following another mid-utterance vowel. Even when the post-syllabic /k/ (group 3) follows a different pre-consonant /s/ (group 3), its deformation intensity is reduced. This observation is visually confirmed by Fig. 2, which shows larger fields in row 1 than in rows 2 and 3, and quantitatively supported by Figs. 3 and 4, where group 1 exhibits higher values compared to groups 2 and 3. It is worth noting that while the fields of subject 4 may not be as visually prominent as the others, its boxplot confirms the same observation. Statistically, according to Tables II and III, all ANOVA tests confirm significant differences among the three /k/ deformations, and all one-tailed t-tests confirm that the deformation intensity of group 1 is significantly higher than that of groups 2 and 3, respectively.
However, the comparison between groups 2 and 3 is less consistent. In terms of deformation magnitude, group 2 consistently shows higher values than group 3 with statistical significance. However, regarding the vertical z-component signifying the /k/ motion, the conclusion varies across subjects. Group 2 exhibits larger inf-sup components than group 3 only for subject 3 but lower inf-sup components for subjects 2 and 4. The p-values for these comparisons are notably high (greater than 0.95), suggesting insignificant differences between the groups. The p-value of 0.35 for subject 1 further complicates comparison between the two groups. In summary, when comparing the two post-syllabic /k/'s, regardless of whether they follow a leading /k/ or a leading /s/, no significant difference in deformation can be conclusively drawn between groups 2 and 3.
On the other hand, the PCA results provide intriguing insights. For strains and deformation fields, PCA categorizes the three groups distinctly. Table IV illustrates that group 1 consistently exhibits the largest data point variance (scattering) while group 3 displays the smallest variance. PC 1 and PC 2 collectively represent approximately 95% of the dataset's variance across all four configurations, emphasizing their significance in Figs. 5 and 6, where data are projected onto their two-dimensional plane. Visually, groups 2 and 3 are more tightly clustered within the data point range of group 1. Examining the PC spaces further reveals that groups 2 and 3 have distinctly grouped data points, suggesting potentially unique characteristics within each group. These specific characteristics remain unidentified in this study but may be uncovered through deeper data analysis. Note that PCA extracts a key feature space that identifies the most significant patterns in the dataset. The fact that these significant patterns can distinguish the three groups further strengthens the statistical results in the previous tests. In short, PCA proves capable of uncovering hidden features that extend beyond what direct magnitude comparisons and statistical tests can reveal.
The observation regarding changes in strategies when pronouncing the same consonant is open to discussion. One plausible explanation is that the later positioning of the consonant is a direct influencing factor. After completing an utterance, the tongue naturally relaxes or adjusts in anticipation of returning to a neutral position. As a result, the final consonant requires less effort, leading to smaller deformations. This phenomenon appears independent of the initial consonant; whether /k/ or /s/ was previously produced does not affect the reduced effort of the final consonant. However, the vowel directly connecting to /k/ may be another reason for this phenomenon. The more intense /k/ is connected to the schwa, whereas the less intense /k/ is connected to the /u:/ vowel. The tongue positions itself differently for these two vowels (one lower and one higher), forming two different initial set-off standings. We need to emphasize that all deformation fields depicted in this work are being compared to a common neutral position, not the vowel before /k/. Thus, the reduced intensity is an absolute observation instead of a relative one. This confirms a direct impact of phonetic environment changes on speech production. There could be other contributors to the articulation differences as well. The subjects might have different palate heights, jaw sizes, and other anatomical differences in the oral cavity that could impact the motion, and the palate was demonstrated to play an important role in speech sound production (Stone , 2019). Finally, factors such as the phonetics surrounding the /k/, the length of the utterance, and whether the /k/ leads into another vowel may all contribute to these observations. Further exploration with a larger dataset beyond the current four subjects could provide deeper insights into this process. Moreover, refining the design of speech tasks to include more vowels and consonants would help elucidate the effects of additional factors. This study was also unable to show the coarticulatory effects of the prestress consonant as that is more likely to be in the anterior-posterior direction. Our goal is to highlight the potential of atlas-based imaging methods to advance research in speech-language studies with greater detail. In the future, with the collection of more complex speech samples from more subjects, the atlas itself may evolve to encompass various utterances—an atlas of atlases—enhancing computational capabilities for studies involving more complex speech tasks and diverse utterance designs.
V. CONCLUSION
We presented a workflow of image analysis methods designed to facilitate direct quantification of deformation variations in human speech sound production. Deformation fields of consonant /k/ in three different positions were estimated from dynamic MRI before constructing a spatiotemporal atlas. Statistical tests and PCAs were conducted on the deformation fields and internal muscle strains, revealing an overall decrease in deformation intensity over time. We demonstrated the strength of using dynamic MRI in capturing, identifying, and quantifying subtle deformation changes in speech. The proposed analysis method holds promise for examining complex high-dimensional data structures, offering a valuable tool for studying human articulatory motor control.
ACKNOWLEDGMENTS
Research reported in this publication was supported by the National Institute on Deafness and Other Communication Disorders of the National Institutes of Health under Award No. R01DC018511. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Ethics Approval
All speech scanning sessions for participating subjects were performed in accordance with research proposals approved by the IRB. Informed consent was obtained from all participants in this study.
DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding author upon reasonable request.