Adding to limited research on clear speech in tone languages, productions of Mandarin lexical tones were examined in pentasyllabic sentences. Fourteen participants read sentences imagining a hard-of-hearing addressee or a friend in a casual social setting. Tones produced in clear speech had longer duration, higher intensity, and larger F0 values. This style effect was rarely modulated by tone, preceding tonal context, or syllable position, consistent with an overall signal enhancement strategy. Possible evidence for tone enhancement was observed only in one set of analysis for F0 minimum and F0 range, contrasting tones with low targets and tones with high targets.

To facilitate speech intelligibility in adverse communicative contexts including a noisy background, insufficient language proficiency of a listener, or hearing loss, adult speakers may switch to a clear speech style, which involves adjustments such as slower speaking rate, increased segmental duration, expanded vowel space, and F0 modifications.1–5 Prior research has generally focused on non-tone languages where F0 patterns are not lexically contrastive but serve to indicate stressed or pitch-accented syllables as related to utterance-level prosody, rhythm and intonation. In these languages, F0 modifications were found to be used in clear speech with substantial individual variation: for example, some speakers had a larger F0 range in their clear-speech style than in conversational style, whereas others made no such difference.1,2,4 In Croatian, a pitch-accent language, F0 high targets were raised while F0 low targets remained stable, leading to the unidirectional expansion of F0 range by 15–21 Hz or 1.26–1.48 semitones, depending on speaker's gender.6 Whether speakers of tone languages modify F0 range or other F0-related characteristics in clear-speech style is not well understood.

This study contributes to a small body of research on clear speech adjustments in tone languages where F0 contours are lexically contrastive at the syllable level (e.g., “meow,” “a fan,” “rice,” “honey”), in addition to the utterance-level pragmatic functions of F0 common in all types of languages (e.g., focus, emotion expression). The effect of clear speech on F0 contours was investigated in productions of four Mandarin lexical tones: a high-level tone (T1), a mid-rising tone (T2), a low-dipping tone (T3), and a high-falling tone (T4). The lexical tone of a syllable is the most important determining factor for F0 contours; however, they are also influenced by the tones of adjacent syllables, particularly the preceding one.7 Most of the previous research on acoustic correlates of clear speech, however, was focused on monosyllables. For example, Tupper and colleagues8 elicited clear speech in a simulated ASR context and compared acoustic characteristics of tones in clear- and plain-style productions of monosyllabic, isolated words (“casual” will be used hereinafter to refer to speech contrasted with “clear”). The strongest effects of clear speech were an increase in duration and intensity across four tones and speakers. As for the F0-related effects, they varied by tone. Mean F0 increased in clear-speech productions of Tones 2 and 3 only. F0 range largely remained unchanged, showing a lengthening of tonal contours without a substantial change in slope (with the exception of Tone 4 that showed a slope decrease). Cooper et al.9 compared clear- and casual-style productions of monosyllabic words embedded in sentences in another tone language, Cantonese. The strongest effects of clear speech were an increase in vowel durations and dispersion. F0, F0 range and tonal dispersion did not differ between styles, with only one F0 adjustment being observed in rising contour tones (namely, an increase in F0 offset of clear-style productions).

In research on child-directed speech, which can be construed as a type of clear speech, Wong10 investigated productions of Mandarin tones in mono- and disyllabic environments by mothers participating in a picture-naming task with or without their children. F0-related acoustic measures (e.g., F0 mean, maximum, and minimum; F0 range and slope) were found to be larger in child- than adult-directed speech across four tones and syllable positions, but duration was not different. The acoustic measures were larger in monosyllabic than in disyllabic words, and in the first than in the second syllable of disyllabic words, suggesting that syllable context should be controlled in study comparisons.

One direction in research on speakers' strategies for production of clear speech has been centered on the question of whether speakers are trying to enhance overall speech intelligibility for listeners by enhancing some general signal characteristics (e.g., rms intensity, articulation rate, F0 mean, frequent pausing, lengthening at the prosodic boundaries) or by enhancing phonological contrasts (e.g., vowel or tone dispersion; F0 range, VOT, or creakiness modifications to further contrast segments or tones).6,11 Signal enhancement would look similar across typologically different languages; contrast enhancement would be language-specific. In their overview article, Smiljanić and Bradlow5 concluded that “clear speech is in part guided by a contrast enhancement principle that maximizes the distance between language-specific contrasting sound categories and makes prosodic structure more salient”; however, they also noted that studies on languages other than English were sparse. Few prior studies of clear-speech effects on F0 contours of lexical tones suggested that speakers’ strategy is an overall signal enhancement rather than tone contrast enhancement.6,8,9 It could also be that segmental contrast enhancements are prioritized over tonal contrast enhancement in the clear speech style.9 

In this study, acoustic characteristics of Mandarin tone words embedded in multisyllabic sentences were examined in clear and casual speech styles, arguably, a more naturalistic data than isolated word productions.8,10 Monosyllabic and disyllabic words were examined to account for syllable position and contextual tone effects.7 The main research question was whether adult Mandarin speakers adjust clear speech productions of the four lexical tones consistent with signal enhancement (a similar increase in F0 measures, duration, and mean intensity across all tones) rather than contrast enhancement (changes in some acoustic cues exaggerating differences between tones or asymmetrical across tones and tonal environments). Signal enhancement may be inferred from a significant main effect of speech style on acoustic measures in the absence of interactions between style, tones, and environments. Tonal contrast enhancement may be inferred from style × tone or style × tone × syllable position interactions; for example, a lower F0 minimum in tones T3 and T4 involving F0 fall than in T2 involving no low target, or an asymmetrical F0 maximum difference between tones following a high-level T1 rather than a dynamic tone, or an increase in F0 range in some but not other tones.

Fourteen speakers born and raised in China, ten females and four males (31 years old, on average), were recruited to take an experiment as volunteers by word of mouth and through online advertisement on Itaki, a language learner's forum site. All participants reported speaking Mandarin natively; in addition, six of them spoke other Chinese languages—Sichuanese, Cantonese, Fuzhounese, and Southern Min. Seven of the participants resided in an English-speaking country at the time of the study (on average, the length of residence was 11 years, the age of arrival was 25 years old). The other half of the participants never lived in an English-speaking country.

Three-word combinations in Table 1 (e.g., māo mī ná mǎ dāo “a meowing cat takes a horse knife”) provided stimulus sentences, comprising a total of 24 possible combinations. The full list of sentences is provided in Appendix A (see supplementary material or an Open Science Framework repository).11 These materials were adopted after Xu7 to elicit lexical tones in real words with controlled tonal contexts. While all stimulus sentences were grammatical, not all were semantically meaningful (see Appendix A for details). The first and the last syllable of each sentence had always Tone 1. Syllable 2 carried all tones (6 sentences x 4 tones); syllable 3 did not carry Tone 3 to avoid tone sandhi (8 sentences × 3 tones;) syllable 4 carried just Tone 1 or Tone 3 (12 sentences × 2 tones).

Table 1.

Three-word combinations resulting in 24 pentasyllabic stimulus sentences12 (1 = high-level tone T1, 2 = mid-rising tone T2, 3 =  low-dipping tone T3, 4 = high-falling tone T4).

Syllable 1 Syllable 2 Syllable 3 Syllable 4 Syllable 5
Word 1 Word 2 Word 3
mao1 “cat”  mi1 “meow”  mo1 “touches”  mao1 “cat”  mi1 “meow” 
mao1 “cat”  mi2 “fan”  na2 “takes”  ma3“ horse”  dao1 “knife” 
mao1 “cat”  mi3 “rice”  mai4 “sells”     
mao1 “cat”  mi4 “honey”       
Syllable 1 Syllable 2 Syllable 3 Syllable 4 Syllable 5
Word 1 Word 2 Word 3
mao1 “cat”  mi1 “meow”  mo1 “touches”  mao1 “cat”  mi1 “meow” 
mao1 “cat”  mi2 “fan”  na2 “takes”  ma3“ horse”  dao1 “knife” 
mao1 “cat”  mi3 “rice”  mai4 “sells”     
mao1 “cat”  mi4 “honey”       

All syllables had the CV structure with a nasal consonant in the initial position except for one word (dāo “knife”). Sentence-initial noun phrases had the same segmental structure (mao mi); sentence-medial monosyllabic verbs and sentence-final noun phrases varied in their vowels, which was not expected to obscure style effect, if any, in comparisons between clear and casual speech.

Data were collected using an online experimental platform Gorilla.13 Sentences were presented twice, in two counterbalanced blocks: casual speech and clear speech. In the casual block, participants were instructed to read sentences as if they were talking to a friend (images of an outdoor restaurant setting, people laughing and sitting outside accompanied each trial). In the clear speech block, participants were instructed to read sentences as if they were talking to somebody with hearing loss (an image of a hand placed on an ear and an image of people speaking into the ear of an elderly woman accompanied each trial). Three repetitions were elicited for each sentence (N = 2016, 48 trials × 3 repetitions × 14 speakers). Trials were randomized within each block.

Data were downloaded and screened for unnatural pauses and other task performance issues. A ProsodyPro script for praat14 was used to segment syllables and take the following measurements: duration, mean intensity, mean F0, F0 maximum, F0 minimum, F0 range, and F0 at ten equidistant intervals in each syllable. In addition, F0 slope (Hz/s) was calculated in the second part of each syllable (timepoints 5–10, or 44.4%–100% of the syllable duration) following a similar method in Wong10 and displayed in Eq. (1),
F 0 slope 44 100 % = F 0 range 44 100 % / ( F 0 max time 44 100 % F 0 min time 44 100 % ) .
(1)
This method was used to avoid contextual effect in tone slope calculations, because initial parts of F0 contours of sentence-medial syllables were expected to be influenced by preceding tones.7 A positive value of the slope indicated a rising F0 (i.e., F0 max occurred later than F0 min); a negative value indicated a falling F0 (i.e., F0 max occurred earlier than F0 min). A larger value indicated a steeper change in F0.

For all measures, outliers were removed using the algorithm in IBM spss statistics (version 29), where any data value outside of the following ranges is considered to be an outlier: [3rd quartile + 1.5 × interquartile range] and [1st quartile + 1.5 × interquartile range]. The remaining measurements were averaged across three repetitions for each syllable, sentence, and speaker, and used in analyses. Duration, intensity, and slope values were not normalized; F0 normalization is described below.

To plot and visually compare F0 contours, F0 values at ten equidistant points in a syllable were normalized by using the log transformation formula (2), where x represents the observed F0, max and min are F0 maximum and F0 minimum of a speaker across all their productions.8,15 This transformation accounts for F0 range differences among speakers because “the scale of normalized F0 values is determined by the more extreme productions,” that is, by speaker's F0 maximum and minimum values across speech styles:8 
T = 5 x lg 10 x lg 10 max / lg 10 max lg 10 min .
(2)
For statistical analyses of F0 mean, minimum, maximum, and range, a normalization procedure different from Eq. (2) was used to avoid scaling most of these measures (except for the first one) in relation to themselves (i.e., F0 extremes). For methodological comparability to previous research on clear speech, formula (3) was used to convert these hertz values to semitones and thus to account for F0 range differences among speakers,6,10
F 0 s emitone = 12 * ln F 0 Hz / 50 / ln 2 .
(3)

Mixed-effect linear regression modeling was conducted in IBM spss statistics (version 29) to examine the effect of speech style on tone productions, with speech style and tone (also, with preceding tone in Sec. 3.2 and syllable position in Sec. 3.3) as fixed factors, and speaker as a random factor (random factor intercept was included in all models). The effects of tone and speaker were significant in all analyses, but are not detailed below, as our focus is on overall differences between clear and casual speech styles.

Characteristics of lexical tones were analyzed first when produced in the same syllable position and the same preceding tone, that is, in syllable 2 (4 tones × 6 sentences, Sec. 3.1). Next, the analyses were repeated in syllable 3, in a less controlled environment where the preceding tone varied (3 tones × 4 preceding tones × 2 sentences, Sec. 3.2). Last, analyses were repeated by tone across different sentence-medial syllable positions and tonal contexts (T2: 14 sentences; T3: 18 sentences; T4: 14 sentences, Sec. 3.3). Syllables 1 and 5 were not included in any of the aforementioned analyses because they carried T1 only by stimulus design.

F0 contours in syllable 2 presented the most contextually controlled tone environment: the preceding syllable always had level T1 (time intervals 1–10 in Fig. 1), while tones in syllable 2 varied (time intervals 11–20). Figure 1 shows that F0 contours were uniformly higher in clear than casual speech. Across tones, syllable 2 in clear speech was longer (339 ms vs 294 ms); higher in mean F0 (246 Hz vs 230 Hz), F0 max (287 Hz vs 263 Hz), and F0 min (196 Hz vs 189 Hz); and had a larger F0 range (89 Hz vs 74 Hz). T4 range appears to be slightly larger in clear than casual speech in Fig. 1, but that difference was not significant (see below). Last, there was no difference in mean intensity between the styles. Tables 1–4 in Appendix B12 provide summaries of untransformed data for female and male speakers; supplemental audio files illustrate individual variation among speakers in using different clear-speech strategies.

Fig. 1.

The effect of speech style on F0 patterns of sentence-initial disyllabic noun phrases (across twenty time-normalized points).

Fig. 1.

The effect of speech style on F0 patterns of sentence-initial disyllabic noun phrases (across twenty time-normalized points).

Close modal
Fig. 2.

Effects of speech style and preceding tone on F0 patterns of sentence-medial monosyllabic words (across ten time-normalized points).

Fig. 2.

Effects of speech style and preceding tone on F0 patterns of sentence-medial monosyllabic words (across ten time-normalized points).

Close modal

The results of seven mixed-effects linear regression analyses on measures in syllable 2 are summarized in Table 2. These results suggest that clear speech effects were observed for all tones in duration- and F0-related measures. With the exception of the F0 slope44%–100% measure, the interactions between style and tone were not significant showing that the style effect was similar across the four target tones. Supplementay material Table 4 in Appendix B12 suggest that the change in F0 slope44%–100% was steeper in clear than casual speech in Tone 1 and Tone 2 contours, that is in tones that do not involve phonological low-pitch targets.

Table 2.

Effects of speech style and tone on seven measures in syllable 2 (pairwise comparisons with Bonferroni corrections).

Measure (unit) Speech style Tone Style × tone
Duration (ms)  F(1,632) = 131.80, p <0 .001  F(3,632) = 13.18, p < 0.001  n.s. 
Clear > Casual  T2 > T4 = T3 > T1 
Intensity (dB)  n.s.  F(3,632) = 29.83, p < 0.001  n.s. 
T1 = T4 > T2 > T3 
F0 mean (st)  F(1,632) = 80.30, p < 0.001  F(3,632) = 257.43, p < 0.001  n.s. 
Clear > Casual  T1 = T4 > T2 = T3 
F0 range (st)  F(1,641) = 36.57, p < 0.001  F(3,641) = 928.76, p < 0.001  n.s. 
Clear > Casual  T3 = T4 > T2 > T1 
F0 max (st)  F(1,631) = 139.63, p < 0.001  F(3,631) = 44.42, p < 0.001  n.s. 
Clear > Casual  T4 > T3 > T2 = T1 
F0 min (st)  F(1,632) = 7.31, p < 0.001  F(3,632) = 371.40, p < 0.001  n.s. 
Clear > Casual  T1 > T2 > T4 > T3 
F0 slope44-100% (Hz/ms)  n.s.  F(3,632) = 192.45, p < 0.001  F(3,632) = 3.19, p = 0.023 
T2 > T1 > T3 > T4 
Measure (unit) Speech style Tone Style × tone
Duration (ms)  F(1,632) = 131.80, p <0 .001  F(3,632) = 13.18, p < 0.001  n.s. 
Clear > Casual  T2 > T4 = T3 > T1 
Intensity (dB)  n.s.  F(3,632) = 29.83, p < 0.001  n.s. 
T1 = T4 > T2 > T3 
F0 mean (st)  F(1,632) = 80.30, p < 0.001  F(3,632) = 257.43, p < 0.001  n.s. 
Clear > Casual  T1 = T4 > T2 = T3 
F0 range (st)  F(1,641) = 36.57, p < 0.001  F(3,641) = 928.76, p < 0.001  n.s. 
Clear > Casual  T3 = T4 > T2 > T1 
F0 max (st)  F(1,631) = 139.63, p < 0.001  F(3,631) = 44.42, p < 0.001  n.s. 
Clear > Casual  T4 > T3 > T2 = T1 
F0 min (st)  F(1,632) = 7.31, p < 0.001  F(3,632) = 371.40, p < 0.001  n.s. 
Clear > Casual  T1 > T2 > T4 > T3 
F0 slope44-100% (Hz/ms)  n.s.  F(3,632) = 192.45, p < 0.001  F(3,632) = 3.19, p = 0.023 
T2 > T1 > T3 > T4 

In syllable 3, only T1, T2, and T4 were represented to avoid tone sandhi in subsequent tones 3. The preceding tone in syllable 2 influenced F0 patterns as shown by differently colored lines in Fig. 2. In each panel of Fig. 2, contextual (preceding) tone effects can be observed to some degree for the whole duration of syllable 3. However, these contextual tone effects seem to be, overall, similar for each target tone (1, 2, or 4) across speech styles.

The results of mixed-effects linear regression analyses on measures in syllable 3 are summarized in Table 3. In comparison with the model used for syllable 2 (Sec. 3.1), preceding tone was also included here as a fixed factor in each model, as well as all two- and three-way interactions among the terms (i.e., a full model). In Table 3, only interactions of interest for our research question, that is the interactions between style and other factors, are displayed. Clear speech effects were observed in all measures but F0 slope: in comparison to casual, syllable 3 in clear speech was significantly longer (333 vs 294 ms) and louder (74.2 vs 72.7 dB); higher in mean F0 (234 vs 216 Hz), F0 max (266 vs 242 Hz), and F0 min (187 vs 178 Hz); and had a larger F0 range (79 vs 64 Hz). The two interactions of interest were not significant for any measure, indicating that the style effect was similar across the three target tones even when the preceding tones varied.

Table 3.

Effects of speech style, tone, and preceding (pre) tone on seven measures in syllable 3 (pairwise comparisons for tones with Bonferroni corrections).

Measure (unit) Speech style Tone Pre Tone Style × Tone, Style × Tone × pre Tone
Duration (ms)  F(1,616) = 86.96, p < 0.001  F(2,616) = 4.65, p < 0.001  n.s.  n.s., n.s. 
T1 > T4 = T2 
Intensity (dB)  F(1,616) = 95.67, p < 0.001  F(2,616) = 60.02, p < 0.001  F(3,616) = 19.05, p < 0.001  n.s., n.s. 
T1 = T4 > T2 
F0 mean (st)  F(1,616) = 97.39, p < 0.001  F(2,616) = 453.04, p < 0.001  F(3,616) = 79.07, p < 0.001  n.s., n.s. 
T1 = T4 > T2 
F0 range (st)  F(1,616) = 69.96, p < 0.001  F(2,616) = 187.46, p < 0.001  F(3,616) = 17.69, p < 0.001  n.s., n.s. 
T4 > T2 > T1 
F0 max (st)  F(1,616) = 125.22, p < 0.001  F(2,616) = 191.30, p < 0.001  F(3,616) = 49.90, p < 0.001  n.s., n.s. 
T4 > T1 > T2 
F0 min (st)  F(1,616) = 14.78, p < 0.001  F(2,616) = 147.99, p < 0.001  F(3,616) = 84.80, p < 0.001  n.s., n.s. 
T1 > T4 > T2 
F0 slope44%–100% (Hz/ms)  n.s.  F(2,629) = 221.21, p < 0.001  F(3,629) = 2.92, p = 0.034  n.s., n.s. 
T2 = T1 > T4 
Measure (unit) Speech style Tone Pre Tone Style × Tone, Style × Tone × pre Tone
Duration (ms)  F(1,616) = 86.96, p < 0.001  F(2,616) = 4.65, p < 0.001  n.s.  n.s., n.s. 
T1 > T4 = T2 
Intensity (dB)  F(1,616) = 95.67, p < 0.001  F(2,616) = 60.02, p < 0.001  F(3,616) = 19.05, p < 0.001  n.s., n.s. 
T1 = T4 > T2 
F0 mean (st)  F(1,616) = 97.39, p < 0.001  F(2,616) = 453.04, p < 0.001  F(3,616) = 79.07, p < 0.001  n.s., n.s. 
T1 = T4 > T2 
F0 range (st)  F(1,616) = 69.96, p < 0.001  F(2,616) = 187.46, p < 0.001  F(3,616) = 17.69, p < 0.001  n.s., n.s. 
T4 > T2 > T1 
F0 max (st)  F(1,616) = 125.22, p < 0.001  F(2,616) = 191.30, p < 0.001  F(3,616) = 49.90, p < 0.001  n.s., n.s. 
T4 > T1 > T2 
F0 min (st)  F(1,616) = 14.78, p < 0.001  F(2,616) = 147.99, p < 0.001  F(3,616) = 84.80, p < 0.001  n.s., n.s. 
T1 > T4 > T2 
F0 slope44%–100% (Hz/ms)  n.s.  F(2,629) = 221.21, p < 0.001  F(3,629) = 2.92, p = 0.034  n.s., n.s. 
T2 = T1 > T4 

With the exception of the model examining syllable 3 duration, the effect of preceding tone was significant in the other five models, p < 0.001. Figure 2 shows that the mean F0, F0 max, and F0 min were higher when the preceding tones were T1 and T2 than when the preceding tones were T3 and T4; F0 range in syllable 3 was larger when it was preceded by tones T3 or T4 than by tones T1 or T2. In addition, intensity in syllable 3 was higher when it was preceded by tones 1 and 2 (74.4 and 73.7 dB) than the other two tones (72.6 and 73.2 dB).

This last set of analysis investigated the effect of clear speech on lexical tones in the most generic way, across different sentence-medial syllables and preceding tonal contexts. Note that due to the stimulus design, all tones were represented in syllable 2; T1, T2, and T4 in syllable 3; and T1 and T3 only in syllable 4.

The syllable position influenced F0 patterns as shown by differently colored lines in Fig. 3. All tones in four columns of the figure have a larger F0 mean and F0 max in syllable 2 position where they were preceded by the level T1 (black solid lines), than in other syllables, that is where the preceding tonal context varied. As for F0 minimum, it appears that it was also higher in syllable 2 than the other syllables for T1 and T2 which have high-tone targets. T3 and T4 which have low tone targets, however, seem to have a similar F0 min regardless of their syllable position (and preceding tonal context), or even a lower F0 min in syllable 2, when preceded by the level T1. A visual comparison of the top and bottom rows suggests that with the exception of the F0 minimum variation described above, the effect of speech style was largely similar across all target tones: clear-speech contours tended to have higher F0 values than casual-speech contours.

Fig. 3.

Normalized F0 patterns by tone across sentence-medial syllables 2–4.

Fig. 3.

Normalized F0 patterns by tone across sentence-medial syllables 2–4.

Close modal

The results of statistical analyses are summarized in Table 4. The main effects of speech style, tone, and syllable position were significant. Interestingly, among the interactions involving the speech style factor, only two were significant: the style by tone interaction for the F0 minimum measure and for the F0 range measure. The difference between F0 minimum in clear and casual speech styles was 15 Hz for T1, 10 Hz for T2, 2 Hz for T3, and 6 Hz for T4. The difference between F0 range in clear and casual speech styles was 7 Hz for T1, 14 Hz for T2, 10 Hz Hz for T3, and 22 Hz for T4. These findings suggest that while the effect of style was unidirectional across tones and syllable positions (even F0 minimum and F0 range were larger in clear than in casual speech), the style difference was relatively small for F0 minimum in T3 and T4, and it was relatively large for F0 range in T4, as compared to other tones.

Table. 4.

Effects of speech style, tone, and syllable position on seven measures in sentence-medial syllables S2, S3, and S4.

Measure (unit) Speech style Tone Syllable position Style × Tone, Style × Tone × Syl. position
Duration (ms)  F(1,1941) = 290.86, p < 0.001  F(3,1941) = 5.38, p = 0.001  F(2,1941) = 35.02, p < 0.001  n.s., n.s. 
T2 = T4 > T1 > T3  S2 = S3 > S4 
Intensity (dB)  F(1,1927) = 77.07, p < 0.001  F(3,1927) = 108.73, p < 0.001  F(2,1927) = 13.54, p < 0.001  n.s., n.s. 
T1 = T4 > T2 > T3  S3 > S2 > S4 
F0 mean (st)  F(1,1928) = 125.87, p < 0.001  F(3,1928) = 259.56, p < 0.001  F(2,1927) = 104.72, p < 0.001  n.s., n.s. 
T4 > T1 > T2 = T3  S2 > S3 > S4 
F0 range (st)  F(1,1919) = 80.95, p < .001  F(3,1919) = 699.69, p < 0.001  F(2,1919) = 32.07, p < 0.001  F(21 919) = 3.68, p = 0.012, n.s. 
T3 = T4 > T2 > T1  S2 > S3 > S4 
F0 max (st)  F(1,1924) = 189.06, p < 0.001  F(3,1924) = 65.63, p < 0.001  F(2,1924) = 146.49, p < 0.001  n.s., n.s. 
T4 > T3 = T1 = T2  S2 > S3 > S4 
F0 min (st)  F(1,1928) = 33.05, p < 0.001  F(3,1919) = 583.34, p < 0.001  F(2,1928) = 108.96, p < 0.001  F(21 928) = 3.95, p = 0.008, n.s. 
T1 > T2 = T4 > T3  S2 > S3 > S4 
F0 slope44-100% (Hz/ms)  n.s.  F(3,1918) = 225.81, p < 0.001  F(2,1918) = 10.48, p < 0.001  n.s., n.s. 
T2 > T1 > T3 > T4  S3 > S2 = S4 
Measure (unit) Speech style Tone Syllable position Style × Tone, Style × Tone × Syl. position
Duration (ms)  F(1,1941) = 290.86, p < 0.001  F(3,1941) = 5.38, p = 0.001  F(2,1941) = 35.02, p < 0.001  n.s., n.s. 
T2 = T4 > T1 > T3  S2 = S3 > S4 
Intensity (dB)  F(1,1927) = 77.07, p < 0.001  F(3,1927) = 108.73, p < 0.001  F(2,1927) = 13.54, p < 0.001  n.s., n.s. 
T1 = T4 > T2 > T3  S3 > S2 > S4 
F0 mean (st)  F(1,1928) = 125.87, p < 0.001  F(3,1928) = 259.56, p < 0.001  F(2,1927) = 104.72, p < 0.001  n.s., n.s. 
T4 > T1 > T2 = T3  S2 > S3 > S4 
F0 range (st)  F(1,1919) = 80.95, p < .001  F(3,1919) = 699.69, p < 0.001  F(2,1919) = 32.07, p < 0.001  F(21 919) = 3.68, p = 0.012, n.s. 
T3 = T4 > T2 > T1  S2 > S3 > S4 
F0 max (st)  F(1,1924) = 189.06, p < 0.001  F(3,1924) = 65.63, p < 0.001  F(2,1924) = 146.49, p < 0.001  n.s., n.s. 
T4 > T3 = T1 = T2  S2 > S3 > S4 
F0 min (st)  F(1,1928) = 33.05, p < 0.001  F(3,1919) = 583.34, p < 0.001  F(2,1928) = 108.96, p < 0.001  F(21 928) = 3.95, p = 0.008, n.s. 
T1 > T2 = T4 > T3  S2 > S3 > S4 
F0 slope44-100% (Hz/ms)  n.s.  F(3,1918) = 225.81, p < 0.001  F(2,1918) = 10.48, p < 0.001  n.s., n.s. 
T2 > T1 > T3 > T4  S3 > S2 = S4 

This study contributes to filling “the paucity of data on clear speech production” in languages other than English,6 especially, tone languages, by investigating the acoustic characteristics of Mandarin tone words embedded in sentences. In comparison to previous studies of clear-speech modifications in lexical tone production,8,10 the effects of preceding tonal context and syllable position on F0 patterns were also investigated here, furthering our understanding of clear speech beyond isolated tone words.

Using seven acoustic measures that proved to be effective in previous research, we replicated many previous findings. We found that in pentasyllabic sentences, regardless of the tone type (T1–T4), preceding tonal context (T1–T4), or syllable position (S3 in monosyllabic words; S2 and S4 in the second or the first syllable of disyllabic words, respectively), Mandarin lexical tones produced in clear-speech style had longer duration; higher intensity (with an exception of S2); larger values of mean F0, F0 range, F0 maximum, and F0 minimum. These findings point to an overall acoustic signal enhancement in clear speech, consistent with the conclusion by Tupper and colleagues8 for monosyllabic Mandarin words produced in isolation in a simulated ASR context and the conclusion by Wong for monosyllabic and disyllabic Mandarin words produced in isolation in child-directed speech. These global clear speech adjustments appear to be language-independent, observed in typologically different languages for the measures of duration,16 intensity,1,4 mean or median F0,2,16 and F0 range.2,4,6 Differently from previous studies,8 the F0 slope measure was not particularly useful in distinguishing clear speech modifications, although it did distinguish lexical tones well.

In all three sets of analyses reported in this study, clear speech effects were similar across all four tones (i.e., non-significant style × tone, style × tone × preceding tone, style × tone × syllable position interactions) with three exceptions involving the F0 slope measure in syllable 2, and the F0 minimum and F0 range measures in syllables 2–4. First, the difference between F0 slope44%–100% in clear and casual speech was larger when syllable 2 had a high-level Tone 1 or a mid-rising Tone 2 than the other two tones. Second, the difference between F0 minimum in clear and casual speech was larger, again, when syllables 2–4 had T1 or T2, which do not have low tone targets, than in the low-dipping T3 and the high-falling T4 syllables, which do have low tone targets. Third, the difference between F0 range in clear and casual speech was the largest for the high-falling T4 in syllables 2–4. These results suggest that low F0 targets are maintained in both speech styles, while F0 range is expanded by the means of increasing F0 maximum, similar to findings in the Croatian language.6 They could be interpreted as phonological contrast enhancement, contrasting low-target Tones 3–4 and high-target Tones 1–2. It is unclear, however, why these variations in style effect by tone were not observed in the syllable 3 analysis (monosyllabic verbs; Sec. 3.2). Recall that by stimulus design, the distribution of tones across syllables 2–4 was unequal; for example, T3 was not represented in syllable 3. It is possible that this imbalance in tone representation in different word structures and, relatedly, prosodic structures (e.g., consider intonational phrasing or focus) resulted in some significant but inconsistent style-by-tone interactions. For example, Wong suggested “a general trend for acoustic measures to be higher in monosyllabic words than in the first and second syllables of disyllabic words,”10 and such trends might have influenced our findings. It is also possible that compared to segmental contrast enhancements in clear speech in tonal and non-tonal languages alike (e.g., vowel contrasts implied from the vowel space expansion in Cantonese,9 English,2,3,6,16 Spanish,3 and Croatian6 clear speech; VOT-based contrasts in English15), tonal contrast enhancements are not prioritized in clear speech.6 Inconsistent style-by-tone effects might be result of this strategy, further influenced by individual variation, which was not directly investigated in this study but was observed via the consistently significant effect of speaker in mixed modeling in all analyses, and via observations of individual differences in clear-speech enhancement strategies (Appendix B and accompanying audio files12).

One measure that is often used in clear-speech research is F0 range. For tonal languages, if F0 range is expanding more in falling than rising tones, or if it is expanding via F0 minimum in low-target tones but via F0 maximum in high-target tones, this could be interpreted as tonal contrast enhancement, but it can be observed only via an examination of style-by-tone interactions such as described above. Tupper and colleagues8 found no effect of clear speech in Mandarin on F0 range of lexical tones (with one possible exception of Tone 4) and a limited effect on F0 mean and F0 slope adjustment (Tones 2 and 3 only). They concluded that their speakers relied consistently on duration and intensity modifications for clear speech rather than F0-related modifications. This study's results show that in multi-syllabic Mandarin utterances, F0-related adjustments were made for tonal contours in clear speech, similar to F0-related adjustments in child-directed speech by Wong.10 Inconsistent results among these three studies of Mandarin tone production are likely attributable to methodological choices: multisyllabic vs monosyllabic utterances and clear-speech elicitation techniques. These could be determining which acoustic measures effectively differentiate clear from casual speech in the data corpus. Last, we could note that the average F0 range expansion in lexical tones in this study (13 Hz in syllable 2, 15 Hz in syllable 3, and 13 Hz in syllables 2–4) was comparable to the average expansion of 18 Hz in non-tonal languages like English and Croatian.6 

In addition to the results related to clear speech in this study, readers could be interested in the findings related to the role of the target tone type and the preceding tone type on the duration, intensity, and F0-related characteristics of tonal contours (as detailed in Tables 2–4). Possible limitations to our results and their interpretation include a medium sample size (fourteen speakers), the choice of stimulus sentences after Xu7 and, consequently, an unbalanced number of tones across sentence-medial syllables 2–4. Future work should also examine individual variation in clear speech modifications of lexical tones and cue weighting. Our observations in this study suggest that for clear speech production, speakers may employ “do little,” “do longer,” “do louder,” “do higher,” or/and “enhance a low-target tone” strategies to a different degree (Appendix B12). Last but not least, perception research has to demonstrate that clear-speech modifications facilitate intelligibility, reduce listener effort, or serve as speech enhancement for listener benefit in some other way.

Among different acoustic adjustments in clear-speech style, prosodic modifications are used by speakers of tone languages such as Mandarin, similar to speakers of stress languages such as English. Most clear-speech modifications of Mandarin lexical tones suggested shifting of F0 contours to a higher F0 range, more so for F0 maxima of high-tone targets than for F0 minima of low-tone targets, which was observed in F0 range expansion. This strategy may not be used by all tone-language speakers and in all prosodic positions, but it is substantial to reveal itself as a significant trend in tone-word production. A relative benefit of this strategy for lexical tone distinction and overall intelligibility in clear speech is yet to be investigated for tone languages.

See the supplementary material for the list of stimulus sentences (Appendix A), means of individual untransformed data (Appendix B), and audio examples of clear-speech strategies.

This research was supported by an LSU ASPIRE grant. We are grateful to Wayne Lian and Kevin Yau for their help with Mandarin instructions and recruitment materials; Gabe Feinn for beta-testing the online project; Yi Xu for customized praat scripts; Yejun Wu for help with English translation of stimulus sentences; and the Editor and two reviewers for their valuable feedback on the manuscript.

The authors have no conflicts of interest to disclose.

The supporting data are available in the Open Science Framework repository at https://osf.io/cv6kp/.

1.
M. A.
Picheny
,
N. I.
Durlach
, and
L. D.
Braida
, “
Speaking clearly for the hard of hearing. II. Acoustic characteristics of clear and conversational speech
,”
J. Speech. Lang. Hear. Res.
29
,
434
446
(
1986
).
2.
A. R.
Bradlow
,
N.
Kraus
, and
E.
Hayes
, “
Speaking clearly for learning-impaired children: Sentence perception in noise
,”
J. Speech. Lang. Hear. Res.
46
,
80
97
(
2003
).
3.
A. R.
Bradlow
, “
Confluent talker- and listener-oriented forces in clear speech production
,”
Lab. Phon.
7
,
241
274
(
2008
).
4.
H. J.
Han
,
B.
Munson
, and
R. S.
Schlauch
, “
Fundamental frequency range and other acoustic factors that might contribute to the clear-speech benefit
,”
J. Acoust. Soc. Am.
149
,
1685
1698
(
2021
).
5.
R.
Smiljanić
and
A.
Bradlow
, “
Speaking and hearing clearly: Talker and listener factors in speaking style changes
,”
Language Linguist. Compass
3
,
236
264
(
2009
).
6.
R.
Smiljanić
and
A.
Bradlow
, “
Production and perception of clear speech in Croatian and English
,”
J. Acoust. Soc. Am.
118
,
1677
1688
(
2005
).
7.
Y.
Xu
, “
Effects of tone and focus on the formation and alignment of f0 contours
,”
J. Phonetics
27
,
55
105
(
1999
).
8.
P.
Tupper
,
K. W.
Leung
,
Y.
Wang
,
A.
Jongman
, and
J. A.
Sereno
, “
The contrast between clear and plain speaking style for Mandarin tones
,”
J. Acoust. Soc. Am.
150
,
4464
4473
(
2021
).
9.
A.
Cooper
,
A.
Bradlow
, and
Y.
Wang
, “
Acoustic-phonetic consequences of clear speech in Cantonese vowels and lexical tones
,”
J. Acoust. Soc. Am.
144
,
1719
(
2018
).
10.
P.
Wong
, “
Mothers do not enhance tonal contrasts in child-directed speech: Perceptual and acoustic evidence from child-directed Mandarin lexical tones
,”
J. Acoust. Soc. Am.
143
,
3169
3183
(
2018
).
11.
A. R.
Bradlow
and
T.
Bent
, “
The clear speech effect for non-native listeners
,”
J. Acoust. Soc. Am.
112
,
272
284
(
2002
).
12.
I. A.
Shport
and
J.
Rittenberry
, “
Clear speech effects in production of sentence-medial Mandarin lexical tones
,” OSF, osf.io/cv6kp (
2024
).
13.
A. L.
Anwyl-Irvine
,
J.
Massonié
,
A.
Flitton
,
N.
Kirkham
, and
J. K.
Evershed
, “
Gorilla in our midst: An online behavioral experiment builder
,”
Behav. Res.
52
,
388
407
(
2019
).
14.
Y.
Xu
, “
ProsodyPro—A tool for large-scale systematic prosody analysis
,” in
Proceedings of Tools and Resources for the Analysis of Speech Prosody
, Aix-en-Provence, France (August 30,
2013
), pp.
7
10
.
15.
Y.
Wang
,
A.
Jongman
, and
J. A.
Sereno
, “
Acoustic and perceptual evaluation of Mandarin tone productions before and after perceptual training
,”
J. Acoust. Soc. Am.
113
,
1033
1043
(
2003
).
16.
S.
Granlund
,
V.
Hazan
, and
R.
Baker
, “
An acoustic-phonetic comparison of the clear speaking styles of Finnish-English late bilinguals
,”
J. Phonetics.
40
,
509
520
(
2012
).

Supplementary Material