This study investigates how California English speakers adjust nasal coarticulation and hyperarticulation on vowels across three speech styles: speaking slowly and clearly (imagining a hard-of-hearing addressee), casually (imagining a friend/family member addressee), and speaking quickly and clearly (imagining being an auctioneer). Results show covariation in speaking rate and vowel hyperarticulation across the styles. Additionally, results reveal that speakers produce more extensive anticipatory nasal coarticulation in the slow-clear speech style, in addition to a slower speech rate. These findings are interpreted in terms of accounts of coarticulation in which speakers selectively tune their production of nasal coarticulation based on the speaking style.
1. Introduction
One of the goals of speech communication is intelligibility. Thus, people produce different acoustic-phonetic adjustments to be better understood across contexts, such as for different types of listeners (e.g., hard-of-hearing listeners; Picheny , 1985, 1986; Scarborough and Zellou, 2013), in background noise (Brumm and Zollinger, 2011), and following a listener's misunderstanding (Cohn , 2022). Collectively, the types of adjustments made to improve intelligibility are known as “clear speech,” which is often contrasted with “plain” or “casual” speech like that produced with friends or family members in the absence of a communicative barrier (Bradlow, 2002; Cohn , 2021b; Smiljanić and Bradlow, 2007). Perception experiments have shown that the clear speech adjustments speakers make are often helpful for listeners (Aoki , 2022; Bradlow and Bent, 2002; Kangatharan , 2023; Smiljanić and Bradlow, 2005).
Cross-linguistically, talkers make a range of acoustic modifications when “speaking clearly”; clear speech is often slower and louder, has a higher and wider pitch range, and contains greater vowel space expansion (Smiljanić and Bradlow, 2005; Tang , 2017). For example, when imagining talking to an adult with a hearing impairment, US English speakers produce slower speech with more vowel space expansion and less coarticulatory overlap, making each segment more distinct, compared to imagining talking to a friend (Scarborough and Zellou, 2013). One framework for understanding intelligibility-motivated phonetic variation is the Hyper- and Hypo-Articulation (H&H) theory (Lindblom, 1990), which proposes that speakers aim to conserve articulatory effort, producing greater hyperarticulation only when needed (e.g., after an error or for a listener who might have trouble understanding). One question about the H&H model is whether speakers selectively fine-tune different acoustic features when producing hyperarticulated forms, depending on their goals or the type of communicative context, or whether clear speech involves a global hyperarticulation setting.
Examining coarticulation, or temporal overlap of gestural features in the speech signal, is one way to investigate this question. Speech rate accounts of coarticulation (Agwuele , 2009; Lindblom , 2009; Moon and Lindblom, 1994) propose that the relationship between coarticulatory overlap and hyperarticulation is the result of a speech rate change: Slower speech produces more hyperarticulation and less coarticulation, while faster speech produces the opposite pattern. This has been observed in studies of hypoarticulation, or more reduced speech (Farnetani and Recasens, 2010), which often consists of faster speaking rate, reduced vowel space expansion, and greater coarticulation as articulatory gestures overlap in time (Lindblom , 2009). However, a reduction in coarticulation in communicatively challenging contexts is at odds with perception, since increased coarticulation can improve intelligibility by making acoustic cues to the word more robust and temporally distributed over time. For example, listeners are better able to recognize a nasal coda (e.g., “bone” vs “bode”) when hearing vowels that contain more extensive anticipatory coarticulatory cues (Beddor , 2013; Zellou and Dahan, 2019). Listener-oriented accounts of coarticulation propose that speakers fine-tune their produced degree of coarticulation based on real (or assumed) demands on their listener (Scarborough, 2013; Scarborough and Zellou, 2013; Zellou and Scarborough, 2015). For instance, speakers produce greater coarticulatory nasalization on words that could be harder for a listener to comprehend, such as those more phonological competitors (Scarborough, 2013) and words acquired later when talking to an infant (Zellou and Scarborough, 2015). Furthermore, work has shown that speakers selectively enhance vowel hyperarticulation while maintaining degree of coarticulation (Bradlow, 2002; Cohn , 2022; Matthies , 2001), suggesting the two features can be independently targeted.
The current study tests the degree to which speakers adjust clear speech adaptation across styles varying in speech rate: speaking slowly and clearly (imagining a hard-of-hearing addressee) or speaking quickly and carefully (imagining they are an auctioneer), relative to speaking casually (imagining a friend/family addressee). Clear styles (slow-clear, fast-clear) are predicted to have greater vowel hyperarticulation compared to the casual style. There are two possibilities for coarticulation: On the one hand, a speech rate prediction is that the slower rate in slow-clear speech will result in less coarticulatory overlap. On the other hand, a listener-oriented prediction is that coarticulation will increase in slow-clear condition.
2. Methods
2.1 Target words
Target words consisted of six CVC-CVN-NVN minimal triplets with non-high vowels (to avoid issues in measuring acoustic nasality with vowel height; Chen, 1997): bad-ban-man, bed-Ben-men, bud-bun-munn, bod-bon-monn, bode-bone-moan, bade-bane-mane.
2.2 Participants and procedure
Participants (n = 27)1 were all native California English speakers (11 female, 14 male, 2 nonbinary; age range = 20–29 years; mean age = 23.59 ± 2.74 years) recruited from Academic Prolific, all of whom indicated they grew up in California and were currently living in California. The majority (n = 20) were monolingual English speakers, while several indicated additional first languages to English (one Gujarati, one Japanese, one Tagalog, three Spanish, one Vietnamese). All participants completed informed consent to participate and were compensated with $5.25. The study was approved by the UC Davis Institutional Review Board.
Participants completed the study from a quiet room in their homes via a Qualtrics survey. Before the experimental trials, participants completed a sound recording check, where they produced a target phrase on the screen (“This is a test of the sound system.”). Sound was recorded with Pipe2 (as a .wav file, sampling rate = 48 kHz) embedded in Qualtrics.
Next, they were shown the target word list and told that they would be reading these words aloud. In case participants were not familiar with the nonwords, additional information was provided (Note that two of the words, “munn” and “monn,” are nonwords. They rhyme with the other words: “munn” rhymes with “bun”; “monn” rhymes with “bon.”).
Participants then completed three speech style conditions (blocked). Before each, they were given instructions for the style for slow-clear (“Speak extremely slowly and clearly, like talking to someone who is hard of hearing and is lip reading.”) [adapted from Garnier (2018); Solé, 1995], casual/conversational (“Speak in a natural, casual manner, like talking to close friends/family.”) [from Cohn (2021b)], and fast-clear (“Speak as fast as you can without making an excessive number of errors. Like an auctioneer.”) [adapted from Hirata (2004); “auctioneer” added in the current study].
Participants always completed the speech style blocks in the same order—slow-clear, casual, fast-clear—to encourage increases in speech rate with familiarity with the stimuli. In each block, they first saw the style instructions. Next, they initiated the recording with Pipe (as a .wav file, 48 kHz sampling rate). After clicking that, they were ready to begin, and custom Javascript code was used to present a loading text (5 s), followed by each sentence presentation and an interstimulus interval (ISI).
To encourage speakers to speed up across the styles, the sentences were presented at shorter durations and with reduced ISIs going from slow-clear (5000 ms presentation, 1000 ms ISI) to casual (3500 ms presentation, 500 ms ISI) to fast-clear (1750 ms presentation, 250 ms ISI). On each trial, participants saw the target word presented in the carrier sentence, “I say ___ again,” wherein the target word was sentence-medial and preceded/followed by vowels to facilitate segmentation. Twenty-four pseudorandomized word presentation lists were generated (with the constraint that nonwords did not occur as the first word in any list). In each Style and Repetition, there were four potential word presentation lists, from which one was randomly selected for each participant. In total, participants produced 108 tokens (18 target words * 3 Styles * 2 Repetitions). The experiment took roughly 30 min.
2.3 Acoustic analysis
First, each utterance (n = 2698) was annotated by trained research assistants (RAs). Trials in which RAs indicated an issue were removed (130 trials that contained noise and 83 mispronunciations, leaving n = 2615 observations). Then each utterance was force aligned with the Montreal Forced Aligner (McAuliffe , 2017). Next, target vowel boundaries were hand-corrected based on the presence of higher formant structure, a voicing bar, and increased amplitude. Acoustic measurements were made at the sentence and vowel levels in Praat (Boersma and Weenink, 2021). Sentence-level speech rate measurements (mean number of syllables per second) were made with Praat script (De Jong and Wempe, 2009). Target vowel measurements of the first three formants (F1, F2, F3) were made at seven time-normalized time windows between 0% and 100% of the vowel using Fast Track (Barreda, 2021). Timepoints 1 and 7 were omitted to avoid measurement errors at vowel onset and offset, leaving a total of five timepoints (14.3%–28.6%, 28.6%–42.9%, 42.9% –57.2%, 57.2%–71.5%, 71.5%–85.8%). All formant measurements were log-transformed and then centered by subtracting each speaker's mean (logged) formant values from individual values (i.e., log-mean normalized) (Barreda, 2020; Nearey, 1978). F1-F2 Euclidean distance was calculated on log-mean normalized formant values, reflecting the distance from each speaker's vowel space center. Target vowel acoustic nasality [A1-P0; the difference between the amplitude of the harmonic under F1 (A1) and a low-frequency nasal peak (P0)] was measured at seven time-normalized vowel points from 0% to 100% of the vowel (Chen, 1997; Styler, 2017) and excluding Timepoints 1 and 7 (i.e., sampling the same portions of the vowel as the nasality measurements described above). Measurements with errors were removed (n = 1998; e.g., no difference in A1-P0 in n = 99 measurements), leaving 9987 observations. A1-P0 values were centered within each speaker based on their mean A1-P0 across CVC, CVN, and NVN tokens.
2.4 Statistical analysis
Analysis was conducted on the subset of productions for CVN tokens (n = 811 utterances; n = 3990 formant measurements; n = 3431 nasality measurements). Each acoustic measurement was modeled in a separate linear mixed effect model with the lme4 R package (Bates , 2015). In the speech rate model, fixed effects included Style (three levels: slow-clear, casual, and fast-clear). Random effects included by-Speaker and by-Sentence random intercepts, as well as by-Speaker and by-Sentence random slopes for Style.
In the vowel acoustic nasality model and F1-F2 hyperarticulation models, fixed effects included Style, Timepoint (centered), and their interaction. Random effects included by-Speaker and by-Vowel random intercepts and by-Speaker and by-Vowel random slopes for Style, Timepoint, and their interaction. In all models, Style was treatment coded (reference= casual). In the case of a singularity or convergence error, the random effects structure was simplified by removing predictors that accounted for the least amount of variance until the model fit [following Barr (2013)].3
3. Results
The model output tables are provided in an Open Science Framework repository for the project4 (Tables S1–S3). The speech rate model output is provided in Table S1. The summarized utterance measurements are plotted in Fig. 1. As seen, there was an effect of Style. Relative to casual speech, slow-clear speech was slower (Coef = −0.61, t = −5.13, p < 0.001), while fast-clear speech was produced with more syllables per second (Coef = 0.61, t = 4.82, p < 0.001).
Speech rate across speaking styles (slow-clear, casual, fast-clear). Error bars indicate standard error of the mean.
Speech rate across speaking styles (slow-clear, casual, fast-clear). Error bars indicate standard error of the mean.
The hyperarticulation model output is provided in Table S2. Summarized vowel space expansion measurements for CVN words are plotted in Fig. 2 (top panel). The vowel hyperarticulation model revealed an effect of Timepoint: Over the course of the vowel, speakers tended to produce less expansion (Coef = −0.01, t = −4.55, p < 0.001). There were also effects of Style: Relative to casual, slow-clear speech was produced with less expansion overall (Coef = –0.05, t = −3.03, p < 0.01), and fast-clear speech was produced with greater expansion (Coef = 0.06, t = 3.23, p < 0.01). However, these Style effects were mediated by interactions with Timepoint. As seen in Fig. 2 (top panel), expansion increased over the course of the vowel for slow-clear speech (Coef = 0.02, t = 7.01, p < 0.001) but decreased over the vowel for fast-clear speech (Coef = −0.01, t = −4.54, p < 0.001), relative to casual speech.
Mean vowel hyperarticulation (F1-F2 distance from vowel space center) (top panel) and acoustic nasality (A1-P0) (bottom panel) across five time-normalized points of the vowel across speaking styles (slow-clear, casual, fast-clear) for CVN words. Error bars indicate standard error of the mean.
Mean vowel hyperarticulation (F1-F2 distance from vowel space center) (top panel) and acoustic nasality (A1-P0) (bottom panel) across five time-normalized points of the vowel across speaking styles (slow-clear, casual, fast-clear) for CVN words. Error bars indicate standard error of the mean.
The acoustic nasalization model output is provided in Table S3, while the summarized raw data for vowels in CVN words are plotted in Fig. 2 (lower panel). The model revealed effects of Style: Slow-clear speech was produced with overall greater nasalization (leading to a smaller A1-P0 value), relative to casual speech (Coef = −0.83, t = −2.66, p < 0.05). Additionally, there was an effect of Timepoint, wherein nasalization increased over the course of the vowel, indeed, increasing in degree of anticipatory nasality in portions closer to the coda (Coef = −1.55, t = −11.58, p < 0.001). Finally, there was an interaction between Style and Timepoint, such that compared to casual, fast-clear speech showed less nasalization over the course of the vowel (Coef = 0.43, t = 3.16, p < 0.01).
4. Discussion
This study examined acoustic variation in CVN words produced across three speech styles, varying in demands on intelligibility and speech rate. Results show that speakers produce distinct patterns of speech rate, vowel space expansion (hyperarticulation), and nasal coarticulation for different types of addressees and styles, extending prior work that has shown variation for different types of addressees (Scarborough and Zellou, 2013).
At the utterance level, findings reveal the expected effects on speech rate: Speech directed toward an imagined hard-of-hearing addressee is slower (mean = 2.91 syllables/s) than in a casual style toward an imagined friend or family member (mean = 3.49 syllables/s). This is consistent with related work showing slowed speech to individuals with hearing impairments (Picheny , 1986), as well as in the speech rate differences observed across clear and casual speech in related studies (Smiljanić and Bradlow, 2005). Speech spoken as quickly and carefully as possible, like an “auctioneer,” was the fastest speech produced by speakers in the current study, averaging 4.11 syllables/s.
As in prior work examining clear speech, speakers in the current study show increases in hyperarticulation (here, vowel space expansion) in the clear styles, compared to casual speech (Smiljanić and Bradlow, 2005). In slow-clear speech, hyperarticulation increases more over the course of the vowel than for casual speech, consistent with work showing both greater vowel space expansion and diphthongization in slower speech (Zellou and Scarborough, 2019). Relative to casual speech, fast-clear speech shows greater expansion overall, but that decreases more over the course of the vowel. This might reflect a strategy to improve intelligibility “up front”; given the reduced vowel duration overall in fast-clear speech, speakers might produce greater vowel space expansion in earlier portions of the vowel in fast-clear speech. Taken together, speakers use a dynamic approach in conveying vowel identity information based on style.
In terms of vowel nasalization, there is evidence of selective tuning of coarticulatory nasality based on speech style, supporting listener-oriented accounts of coarticulation (Scarborough, 2013; Scarborough and Zellou, 2013; Zellou and Scarborough, 2015). When imagining talking to a hearing-impaired person, speakers produce vowels in CVNs with greater coarticulatory nasalization. Indeed, prior work has shown that the presence of a neighboring nasal consonant makes vowel perception more challenging (Zee, 1981), and speakers appear to tune nasalization to improve intelligibility for a hard-of-hearing addressee. Counter to speech rate accounts of coarticulation (Agwuele , 2009; Lindblom , 2009; Moon and Lindblom, 1994), there is greater coarticulation in the slowest speech (i.e., slow-clear speech) in the current study.
The patterns of coarticulation and hyperarticulation in the current study differ slightly from related work by Scarborough and Zellou (2013), who found less coarticulatory nasalization toward an imagined hard-of-hearing addressee than to an imagined friend. One difference between the studies is that here the recordings took place at home, where speakers are in a more comfortable context, compared to the Scarborough and Zellou study, where they were in a laboratory setting. Indeed, work has found that the presence of an experimenter can shape participants' behavior (Belletier and Camos, 2018; Orne, 1962) and that speakers make adaptations for individuals who “overhear” their interactions (Clark and Carlson, 1982). For example, speakers make even greater computer-directed speech adaptations when they are in-lab compared to at home (and not observed) (Cohn , 2021a), suggesting that the context of the study itself can shape speech style adjustments.
This study has several limitations that can serve as avenues for future research. First, the current study examined a single variety of English: California English speakers. Coarticulatory nasalization patterns are part of a sound change in California (Brotherton , 2019; Zellou , 2020). Future work examining patterns of nasal coarticulation and hyperarticulation across other dialects of English, as well as other languages, is needed for a fuller picture of speakers' selective tuning of these features across speech styles. Additionally, we examined a combination of diphthong and monophthong vowels; while pre-nasal vowels are often diphthongized in California dialects, it is possible that vowel hyperarticulation and nasalization might be targeted to certain vowels [e.g., /æ/ in Brotherton (2019)]. Future work examining vowel-specific features and vowel trajectories can shed light on this possibility. Furthermore, while the current study examined nasal coarticulation, examinations of other types of coarticulation can further shed light on the mechanisms of speakers' intelligibility adaptations [e.g., F2 in Krull (1989); whole-spectrum measures in Guo and Smiljanic (2023)]. Here, one possibility is that coarticulation with the stop consonant /b/ could have shaped the degree of expansion in the slow-clear condition, as it is known to lower formant frequencies (Liberman , 1956; Liberman , 1967). Indeed, relative to casual speech, in slow-clear, there is less expansion overall, but it increases more over the course of the vowel. Future work examining other types of consonant onsets (e.g., /b/, /d/, /g/) can test this possibility.
Another potential limitation of the current study is the instructions for eliciting the fast-clear condition: “Speak as quickly as possible without making errors. Like an auctioneer.” While there is related work showing that speech of auctioneers and sports commentators has features of fast-clear speech [e.g., less spectral tilt in Trouvain and Barry (2000)], future work using other types of fast and clear prompts can probe the generalizability of this finding for fast-clear, if it is limited to “auctioneer-speech” or a fast-clear style more generally. Similarly, the slow-clear speech, for a hearing-impaired individual who is lip reading, might also vary depending on the type of addressee. For example, speech directed to technology vs human addressees has distinct acoustic-phonetic adjustments [for a review, see Cohn (2022)].
Additionally, the current study used an entirely imagined context to limit the amount of influence an addressee's patterns might have on speakers via accommodation (Gallois and Giles, 2015; Giles, 1973), such as convergence based on coarticulatory nasalization (Zellou and Brotherton, 2021). However, it is possible that speakers might target coarticulation and hyperarticulation differently if in an actual interaction (cf. Scarborough and Zellou, 2013), complete with feedback about the correct or incorrect reception by their listeners. Future work tuning the local communicative pressures [e.g., staged misrecognition errors as in Cohn (2022)] can shed light on these possibilities.
Acknowledgments
We thank Editor Irina A. Shport and two anonymous reviewers for their helpful feedback on the paper. We also thank our RAs, who listened and annotated all the productions: Nishchala Beeram, Benjamin Getz, Mercedes Herrera, Lauren Kim, Mira Kren, Elizabeth Tedrow, and Alexander Tse. This material is based upon work supported by National Science Foundation Grant No. 2140183 awarded to G.Z.
Author Declarations
Conflict of Interest
The authors report funding from the National Science Foundation. No other conflicts of interest are reported.
Ethics Approval
This study was approved by the UC Davis Institutional Review Board, and all participants completed informed consent.
Data Availability
The data that support the findings of this study are openly available in the Open Science Framework repository at https://doi.org/10.17605/OSF.IO/N3FZJ.
Originally, 29 participants completed the study. Data for two participants were removed due to technical issues with the recording (e.g., recording static, inconsistent volume) as determined by the research assistants (RAs) who listened to and annotated each recording.
Available at https://addpipe.com/.
Retained model structures: Rate ∼ Style + (1 + Style | Participant) + (1 | Sentence); F1-F2 Distance ∼ Style * Timepoint + (1 + Style + Timepoint | Participant) + (1 + Style | Vowel); A1-P0 ∼ Style * Timepoint + (1 + Timepoint | Participant) + (1 + Style | Vowel).