The current study explores whether Mandarin initial and medial voiceless unaspirated and voiceless aspirated stops differ in their tongue positions and post-vocalic voicing during closure. Ultrasound tongue imaging and acoustic data from five Mandarin speakers revealed (1) no consistent pattern for tongue positions among speakers, and (2) no difference in degree of voicing during closure between the two stop series. These findings suggest that tongue position is not a reliable articulatory correlate for Mandarin laryngeal contrasts. This further suggests that aspiration is not correlated with tongue position differences, unlike the reported correlation between voicing and tongue root advancement.
1. Introduction
Consonants of different laryngeal categories are articulated with different laryngeal configurations and oral–laryngeal timing relations (e.g., Löfqvist, 1980; Löfqvist and Yoshioka, 1981). For example, voiceless unaspirated stops are produced with relatively narrow glottal openings, with a shortened duration of the glottal opening before the onset of the following vowel. On the other hand, voiceless aspirated stops have relatively wide glottal openings, and it takes longer for phonation to restart in the following vowel. Previous research indicates that, in addition to laryngeal configuration, stops of different laryngeal categories may differ in aspects of their oral configuration, such as tongue position. One well-studied example is tongue root advancement in voiced stops (e.g., Westbury, 1983) in order to circumvent the articulatory voicing constraint (AVC) (Ohala, 1983, 2011). In short, a greater cavity volume facilitates the maintenance of a sufficient transglottal pressure gradient, as required for voicing during stop closure, and some articulatory adjustments, such as tongue root advancement, are used to enlarge the cavity volume during closure (Ahn, 2018a; Westbury, 1983).
While the lingual configurations of voiced stops, compared to their voiceless counterparts, are relatively well studied, the relation between aspiration and tongue configuration is less explored. Velar stops are known to have a longer Voice Onset Time (VOT) than labial stops cross-linguistically (Cho and Ladefoged, 1999; Chodroff and Wilson, 2017; Fischer-Jørgensen, 1954; Peterson and Lehiste, 1960). Cho and Ladefoged (1999, pp. 209–210) suggest that the smaller volume of the cavity behind the stop constriction for velars (compared to, e.g., alveolars or bilabials) may be responsible for this increase in the duration of aspiration: other factors held equal, a smaller cavity will exhibit higher pressure, which will require longer to dissipate, necessary for the re-initiation of voicing. This raises the possibility that speakers could actively retract their tongue roots to reduce the cavity volume behind a constriction at any place to enhance the production of aspiration.
In addition to tongue root advancement in languages with stops that are actually voiced during closure (e.g., Brazilian Portuguese in Ahn, 2018b; Saudi Arabic in Alamri, 2022), advanced tongue root is also found in American English (henceforth, English) in stops that do not exhibit voicing during closure (Ahn, 2018b). In utterance-initial position, English is known to implement the phonological voiced–voiceless contrast as short-lag and long-lag VOT, respectively (Keating, 1984). The ultrasound study in Ahn (2018a) reveals that English speakers exhibit more tongue root advancement when producing these phonologically voiced stops, even when they lack phonetic voicing. This can be interpreted in two ways: (1) more retracted tongue root facilitates the production of a longer VOT (i.e., aspiration) for voiceless stops, or (2) the speakers simply maintain the tongue root advancement, which is useful elsewhere for enhancement of voicing, but unnecessary in utterance-initial position.
Languages with laryngeal contrasts involving both voicing and aspiration might be expected to shed light on which feature is actively enhanced in the English case (and in similar languages), but case studies of these languages have yielded contradictory results. In Kalasha, voiceless aspirated stops exhibit more advanced tongue root than voiceless unaspirated stops, while voiced stops show more advanced tongue position than either voiceless stop type (Hussain and Mielke, 2020). Thai and Hindi aspirated stops do not exhibit a consistent difference in tongue position compared to their voiceless unaspirated counterparts (Ahn, 2018b). Korean lax, tense, and aspirated stops also do not show any difference in their tongue position (Kwon and Ahn, 2020). However, Yemba appears to exhibit tongue root retraction to enhance aspiration, based on one speaker's ultrasound data (Weller , 2023). Therefore, any relationship between tongue position and aspiration remains to be determined with more empirical data from more languages.
To this end, we investigate tongue position during Mandarin voiceless aspirated and unaspirated stops using ultrasound tongue imaging. Mandarin may allow testing of the lingual articulatory contribution to contrastive aspiration, and additionally whether any significant association of tongue root position to laryngeal contrast actually exists in the absence of phonetic voicing. We also aim to confirm that there is no difference in intervocalic voicing between the two stop types, since this is critical to interpreting any observed differences: for instance, English voiced stops are still considered phonologically voiced due to their tendency to voice intervocalically. Mandarin is claimed to exhibit considerably less intervocalic voicing (Deterding and Nolan, 2007), a finding that we wish to replicate in the data under consideration. In sum, there are two research questions: (1) Do Mandarin aspirated stops show tongue root retraction relative to unaspirated stops? (2) Do Mandarin stop closures in initial and medial positions differ in the degree of carryover intervocalic voicing? By answering to these questions, we broadly aim to unveil whether the tongue root position is an articulatory correlate of aspiration, in addition to its known relationship with voicing.
2. Methods
2.1 Participants
Participants were five (4 F, 1 M) self-identified first-language speakers of Standard Mandarin (i.e., 普通话 Pǔtōnghuà) living in Los Angeles, CA [age range = 19–24 yrs; mean = 21.8; standard deviation (SD) = 1.92]. All participants spoke another dialect of Chinese to some degree. Subject 3 (S3) reported low speaking ability and moderate receptive ability in Hangzhou Wu, which has a three-way laryngeal contrast among voiceless aspirated, voiceless unaspirated, and voiced stops (Cao and Maddieson, 1992; Simmons, 1992). Subject 1 (S1), Subject 2 (S2), and Subject 4 (S4) reported speaking and listening ability in non-standard (Zhongyuan or Jiaoliao) Mandarin; Subject 5 (S5) reported similar experience with Jin Chinese. None of these varieties are reported to have a stop voicing contrast comparable to that of Hangzhou Wu (Ding, 1998; Kurpaska, 2010, pp. 192–195). All participants reported that they acquired Standard Mandarin and their regional dialect as co-first languages, began learning English in primary school at ages 5–8, and immigrated to the United States from China at ages 15–19 yrs. All participants were compensated with $20 US currency.
2.2 Materials
Target stimuli were 18 disyllabic Mandarin words beginning with six target consonants, the aspirated and unaspirated bilabial, alveolar, and velar stops, i.e., /p t k ph th kh/. For each consonant, three words were selected to present a range of tonal contexts so that each target onset co-occurred with a high pitch target (i.e., high level or high-low falling tones) and a low pitch target (i.e., rising or low-dipping tones). The target consonant was always followed by /aʊ/ and a front lingual segment (i.e., high front vowel, /tɕ/, various /Cj/). In addition to these 18 target words, 30 distractors and disyllabic Mandarin words that do not begin with a stop +/aʊ/ sequence were included (see supplementary material for a complete list of stimuli). Each stimulus item was embedded in two frame sentences, placing the target words in either an utterance-initial position (X 变了○ biàn le; “X changed”) or an utterance-medial position (他把 X 变了○ Tā bǎ X biàn le; “He changed X”). This yielded 96 sentences: two frame sentences × [18 targets (six stops × three tonal contexts) + 30 distractors].
2.3 Recording
Ultrasound video and audio were recorded simultaneously in a soundproof booth in the UCLA Phonetics Laboratory. Ultrasound video was recorded with a Telemed MicrUs and Telemed MC4-2R20S-3 convex probe (Vilnius. Lithuania) at a frame rate of 82 Hz. The probe was positioned to capture midsagittal images of the tongue from blade to root and was stabilized using an UltraFit headset (Articulate Instruments Ltd., Musselburgh, East Lothian, UK) (Spreafico , 2018). While probe angle was adjusted to each participant's comfort, the tongue root was ensured to be visible for all speakers, as gauged by visibility of the hyoid shadow in all frames of reference. Audio (44.1 kHz sampling rate) was recorded using a Røde smartLav+ omnidirectional lavalier microphone clipped to the headset's hook-and-loop fastener tape strap, approximately 5 cm to the right of the speaker's mouth. Ultrasound video and audio recordings were synchronized by means of a synchronization signal emitted by the MicrUs device, which was fed through a Focusrite Scarlett 2i2 audio interface (Focusrite Audio Engineering Ltd., High Wycombe, Bucks, UK) and added to the audio as a time-aligned second channel.
The stimuli were presented in simplified Chinese characters using OpenSesame (Mathôt , 2012), embedded in the frame sentences and displayed on a computer screen one at a time. The full set of 96 sentences was repeated in six blocks, resulting in 576 utterances (of which 216 contained the target consonants) per speaker. Within each block, the stimuli were presented in pseudorandom order. A brief break was included between each block.
All testing was conducted in English or Standard Mandarin by the third author (M.F.) per participants' preference. Participants were explicitly instructed to read the sentences in Standard Mandarin. Prior to recording, participants read through the full list of stimuli to confirm they recognized all characters. If participants gave an unexpected or non-Standard reading for a character, the experimenter pointed out the presence of a deviation without pronouncing the word until the participants arrived at the intended pronunciation. The production task was self-paced: participants pressed a key to move from one item to the next. The recording session lasted approximately 40 min per speaker.
2.4 Analyses
Audio recordings were annotated with the Montreal Forced Aligner (McAuliffe , 2017) using the pre-trained Mandarin model and manually corrected in Praat (Boersma and Weenink, 2021) as necessary. Using the resulting annotations, two different analyses were conducted: articulatory and acoustic.
The articulatory analysis aimed to measure the position of the tongue root immediately prior to the release of the stop closure, comparing aspirated and unaspirated stops in different positions in utterance (initial vs medial). The last frame of each stop closure was extracted from the ultrasound video. Then, for each extracted frame, the tongue contours were semi-automatically tracked and manually corrected by S.A. and H.K. using EdgeTrak (Li , 2005). To ensure consistency of analysis, a subset of each speaker's recordings was cross-checked by both annotators, and the cross-checked data revealed no meaningful difference in results between annotators.
The resulting tongue contours were compared using smoothing spline analysis of variance (SSANOVA), which calculates whether two groups of contours significantly differ from each other along their lengths (Davidson, 2006; Gu, 2002). Each spline represented the average contour of a subset of the tokens. Bayesian confidence intervals of 95% were plotted around each curve, and two curves were determined to be significantly different at certain points in their length if their confidence intervals did not overlap at those points. In calculating smoothing spline comparisons, we used polar coordinates, which reflect the fan shape of the ultrasound imaging area more accurately than Cartesian coordinates, especially for the tongue root and tip areas (Mielke, 2015).
The region of interest for the current study was the tongue root area, which was determined for each speaker as the rearmost third of the entire tongue contour. Accordingly, two splines were determined to have a different tongue root position when their confidence intervals did not overlap in some portion of the rearmost of their length (e.g., S3's velar stops in Fig. 1). Because the rearmost portion of splines tends to be less reliable due to greater variability, two splines whose confidence intervals overlapped at the very end were considered different, as long as they were well separated in the same direction in the rest of the root area (e.g., S4's velar stops in Fig. 1). On the other hand, two splines whose confidence intervals overlapped through the whole area (e.g., S5's labial stops in Fig. 1), or whose splines crossed (e.g., S1's labial stops in Fig. 1), showing relative advancement in one part of the region and retraction in another, were noted as not differing in tongue root position.
SSANOVA plots for aspirated (green dashed lines) and unaspirated (blue solid lines) stops in utterance-initial position.
SSANOVA plots for aspirated (green dashed lines) and unaspirated (blue solid lines) stops in utterance-initial position.
For the acoustic analysis, we measured strength of excitation (SoE) (see Mittal , 2014; Murty and Yegnanarayana, 2008) using VoiceSauce (Shue , 2011), in order to determine whether the two stop series differ in their degree of voicing during closure. SoE measures the amplitude of the peak excitation strength of the harmonic component of the signal; higher SoE values thus represent stronger voicing. To reveal whether the stops varying in their laryngeal category and position in the utterance exhibit different extents of voicing during closure, the average SoE was calculated over the closure interval of each stop, z-scored by speaker, and statistically analyzed using a series of linear mixed effects models built with the lme4 package in R (Bates , 2014).
3. Results
3.1 Articulation: Tongue root configuration
The SSANOVA plots comparing aspirated and unaspirated stops in utterance-initial and utterance-medial positions are presented in Figs. 1 and 2. First, the comparison of utterance-initial stops did not reveal a consistent difference between aspirated and unaspirated stops for any speaker or place of articulation.
SSANOVA plots for aspirated (green dashed lines) and unaspirated (blue solid lines) stops in utterance-medial position.
SSANOVA plots for aspirated (green dashed lines) and unaspirated (blue solid lines) stops in utterance-medial position.
The data revealed considerable between-speaker variation as well as within-speaker variation across places of articulation. S1 did not show any differences in tongue root position between aspirated and unaspirated stops. S2 showed more retracted tongue root in aspirated labial and alveolar stops than unaspirated stops, but showed a reversal of this pattern in velar stops. S3 and S5 showed more retracted tongue root when producing aspirated stops than unaspirated stops for alveolar and velar stops. Also, note that for S5, the results are less clear-cut: S5's alveolar stops /t/ and /th/ overlap around of the splines' length; velar stops show the difference only toward the very back of the tongue, which is not always reliable. Both S3's and S5's unaspirated labial stops showed tongue body lowering, which can also be understood as a strategy to enlarge the cavity (Ahn, 2018a; Svirsky , 1997). Finally, S4 showed no distinction in tongue root position for bilabial stops and had distinct tongue root positions for /t k/ and /th kh/, but in the opposite of the expected direction—the unaspirated stops had more the retracted tongue root.
Utterance-medial stops also showed no consistent pattern across speakers, with each speaker showing a different pattern of tongue root position differences, as demonstrated in the SSANOVA plots in Fig. 2. In most cases, aspirated and unaspirated stops did not differ in their tongue root configurations. However, there were some cases in which aspirated stops showed tongue root retraction relative to unaspirated stops, such as S1's alveolar stops, S3's alveolar and velar stops, and S5's alveolar stops. As with the utterance-initial data, there were occasional reversals of this pattern: S2's aspirated alveolar stops showed tongue root advancement compared to their unaspirated counterparts. S3's labial stops exhibited lower tongue body position in the aspirated stops compared to unaspirated stops.
The articulatory findings are summarized in Table 1. S3, who is bilingual with Hangzhou Wu, shows a consistent difference in tongue position between the unaspirated and aspirated stops: in both positions, the tongue body is raised (for /ph/) or the root is retracted (for /th kh/) in aspirated stops compared to unaspirated stops. S5 shows the same pattern utterance-initially, but only S5's alveolar stops show a comparable difference in medial position. S1 and S2 show no consistency in their lingual adjustment, and S4 exhibits reversals of the tongue retraction patterns. To summarize, the tongue position differences between Mandarin aspirated and unaspirated stops show considerable inter-speaker variation, and no consistent relation to the production of upcoming aspiration.
Summary of articulatory results. Tongue root advancement of x relative to y is indicated with x > y; lowering of the tongue body with ; root retraction with x < y; and apparent absence of a tongue position difference with x = y. Question marks in parentheses indicate less clear-cut cases.
Position . | S1 . | S2 . | S3 . | S4 . | S5 . |
---|---|---|---|---|---|
Utterance initial | p ph | p > ph | p ph | p = ph | p ph |
t th | t > th | t > th | t < th | t > th (?) | |
k kh | k kh | k > kh | k < kh | k > kh (?) | |
Utterance medial | p ph | p ph | p ph | p ph | p ph |
t > th | t < th | t > th | t th | t > th | |
k kh | k kh | k > kh | k < kh | k kh |
Position . | S1 . | S2 . | S3 . | S4 . | S5 . |
---|---|---|---|---|---|
Utterance initial | p ph | p > ph | p ph | p = ph | p ph |
t th | t > th | t > th | t < th | t > th (?) | |
k kh | k kh | k > kh | k < kh | k > kh (?) | |
Utterance medial | p ph | p ph | p ph | p ph | p ph |
t > th | t < th | t > th | t th | t > th | |
k kh | k kh | k > kh | k < kh | k kh |
3.2 Acoustics: Voicing during closure
We examined the effects of the stop's laryngeal category, place of articulation, tonal context, and position in utterance on its SoE using a series of linear mixed effects models. With the z-scored SoE as its dependent variable, the full model included the following factors as fixed effects: ASPIRATION (aspirated, unaspirated), TONE (high, low), POSITION (initial, medial), PLACE of articulation (labial, coronal, velar), and their interactions (all dummy-coded; boldface indicates reference level for intercepts). The random effects included by-WORD and by-SUBJECT random intercepts as well as by-SUBJECT random slopes for ASPIRATION, TONE, POSITION, and PLACE. A likelihood ratio test comparing this model with a model without the four-way interaction of ASPIRATION, TONE, POSITION, and PLACE revealed that the interaction was significant [ , p < 0.001]. To further investigate this interaction, post hoc pairwise comparisons were carried out using Tukey's honestly significant difference (HSD) tests in the emmeans package (Lenth , 2020). The mixed effects model output and the pairwise comparison results are included in the supplementary material.
Here, we focus on the pairwise comparisons involving ASPIRATION and POSITION that are directly related to our second research question. The first noteworthy pattern is that the aspirated and unaspirated stops did not differ in their SoE (p > 0.05), regardless of their position, place of articulation, and tone (see Fig. 3). In contrast, all stops were more voiced (indicated by greater SoE values) in utterance-medial positions than in utterance-initial positions (p < 0.05). This pattern was held across aspirated and unaspirated stops of all places of articulation in all tonal contexts, suggesting that stops exhibited similar degrees of carryover voicing in utterance-medial intervocalic position, regardless of their laryngeal category, place of articulation, and tonal context. This finding is consistent with Deterding and Nolan (2007) in that Mandarin aspirated and unaspirated medial stops do not differ in the amount of carryover voicing.
Average SoE during closure by aspiration, tone context, position, and consonant place.
Average SoE during closure by aspiration, tone context, position, and consonant place.
4. Discussion
In this study, an ultrasound analysis revealed no consistent difference in tongue root advancement or retraction when comparing the Mandarin unaspirated and aspirated stops. An acoustic analysis found no difference between unaspirated and aspirated stops in terms of their voicing during closure in both initial and medial position. Taken together, tongue position does not appear to be consistently associated with the production of aspiration in aspirated stops, or its absence in unaspirated stops.
The acoustic data (see Fig. 3) show basically the same carryover voicing pattern as in Deterding and Nolan (2007), who found 10%–20% voicing in Mandarin intervocalic stops, regardless of aspiration. The absence of acoustic voicing differences between the aspirated and unaspirated stops may be related to the lack of tongue root differences in the intervocalic position. In addition, the differences between aspirated and unaspirated stops in intervocalic and utterance-initial positions are not consistent across speakers despite both stop series exhibiting some carryover voicing during closure in intervocalic positions. That is, some speakers show greater tongue root retraction in unaspirated stops relative to aspirated stops in medial positions than in initial positions while others show the opposite or null effects of the position (see Table 1). This suggests that there is no active tongue root advancement gesture supporting the carryover voicing during closure; the voicing exhibited during stop closure may not be actively intended by speakers and thus not enhanced through any lingual articulation.
Cross-linguistically, the VOT of posterior stops, such as velars, tends to be longer than anterior stops, such as bilabials (Cho and Ladefoged, 1999). In line with this cross-linguistic finding, Weller (2023) report a lingual retraction associated with aspiration in Yemba, a language with four different types of stops fully specified for both voicing during closure and voiceless aspiration after release (i.e., the voiced aspirated stops exhibit a sequence of voicing during closure and voiceless aspiration). Weller (2023) propose two distinct forces at play, based on acoustic correlates of tongue root position: tongue root advancement to reinforce stop closure voicing and tongue root retraction to maintain voicelessness during aspiration. An association of root retraction and aspiration (but not root advancement and voicing) was observed in the ultrasound data from the speaker analyzed in Weller (2023), suggesting that this is the more robust of the two proposed patterns. This may be due to the extreme demands involved in producing Yemba's very long and consistently voiceless aspiration (Faytak and Steffman, 2023).
However, the current articulatory data (see Figs. 1 and 2) do not show a consistent tongue retraction associated with aspiration, except for one speaker, S3, who consistently retracts or lowers the tongue body during unaspirated stops. S3 has early linguistic experience with Hangzhou Wu, which has a three-way contrast between voiced, voiceless unaspirated, and voiceless aspirated stops. The voiced stops in Hangzhou Wu are devoiced in most contexts (Simmons, 1992), but instrumental investigations of closely related Wu languages have demonstrated that intervocalic voicing is common (Chen, 2010; Gao and Hallé, 2017; Shi, 2020). This suggests that S3's productions in the current study may be influenced by Hangzhou Wu; the same cannot be said for the other four speakers' co-dominant dialects, which lack a stop voicing contrast (Ding, 1998; Kurpaska, 2010, pp. 192–195). While this possible transfer among homologous stop categories (e.g., Flege , 2003; MacLeod and Stoel-Gammon, 2009; Sancier and Fowler, 1997; Sundara , 2006) presents an interesting avenue for future research, it suggests, for the purpose of the present study, that the Mandarin aspiration contrast does not have a consistent relation with the tongue root configuration. In fact, Mandarin is not the only language that lacks this relation; no consistent tongue position difference was found between voiceless unaspirated and aspirated stops in Kalasha (Hussain and Mielke, 2020), Thai (Ahn, 2018b), Hindi (Ahn, 2018b), and Seoul Korean (Kwon and Ahn, 2020).
On the other hand, English “voiced” (short-lag VOT) stops have more advanced tongue root than “voiceless” (long-lag VOT) stops (e.g., Ahn, 2018b), which could arguably be understood either as the root advancement associated with phonological (but not phonetic) voicing in /b d g/ or as the retraction due to aspiration in /p t k/. Our data suggest that Mandarin stops do not exhibit tongue root retraction due to aspiration, adding to existing evidence that the patterns observed in English may be advancement due to phonological voicing rather than retraction due to aspiration. The lack of difference in acoustic voicing during Mandarin stops also distinguishes Mandarin from English, further suggesting that the tongue root difference in English may be due to underlying voicing though the voiced stops are typically devoiced in initial positions.
Taking all these previous findings together, the relation between lingual articulation and aspiration remains ambiguous. The current study, aiming to clarify this ambiguity, provides an additional piece of data from Mandarin, a language that has not yet been explored regarding the relation between lingual articulation and laryngeal contrast. Our findings suggest that Mandarin aspirated and unaspirated stops do not differ in their tongue root configuration and degree of acoustic voicing. More articulatory studies on languages with laryngeal contrasts not defined by voicing distinction are needed to further elucidate the role of lingual articulation in reinforcing laryngeal contrast.
Supplementary Material
See the supplementary material for a complete list of stimuli, the mixed effects model output, the pairwise comparison results, and spreadsheets containing articulatory and acoustic data.
Acknowledgments
The data were collected when S.A. and M.F. were affiliated with the Department of Linguistics at the University of California, Los Angeles. We are grateful to Pat Keating, the members of the UCLA Phonetics Lab, and the UCLA Linguistics department for their support. We also thank the anonymous reviewers and the editor for their contributions. This work has benefited from invaluable comments from the audience at the 179th Meeting of the Acoustical Society of America and Ultrafest IX. This research was supported by the funding from the University of Ottawa to S. A.
Author Declarations
Conflict of Interest
The authors have no conflicts to declare.
Ethics Approval
All recruitment and experimental procedures for this study were reviewed and approved by the North Campus institutional review board at the University of California, Los Angeles (IRB #19–000221).
Data Availability
The data that support the findings of this study are available within its supplementary material.