Individual speakers are often able to modify their speech to facilitate communication in challenging conditions, such as speaking in a noisy environment. Such vocal “enrichments” might include reductions in speech rate or increases in acoustic contrasts. However, it is unclear how consistently speakers enrich their speech over time. This study examined inter-speaker variability in the speech enrichment modifications applied by speakers. The study compared a baseline habitual speaking style to a clear-Lombard style and measured changes in acoustic differences between the two styles over sentence trials. Seventy-eight young adult participants read out sentences in the habitual and clear-Lombard speaking styles. Acoustic differences between speaking styles generally increased nonlinearly over trials, suggesting that speakers require practice before realizing their full speech enrichment potential when speaking clearly in noise with reduced auditory feedback. Using a recent objective intelligibility metric based on glimpses, the study also found that predicted intelligibility increased over trials, highlighting that communicative benefits of the clear-Lombard style are not static. These findings underline the dynamic nature of speaking styles.

Everyday speech communication regularly occurs in the presence of various sources of ambient noise at different intensity levels. Typically, speakers are able to “enrich” their speech to facilitate communication when necessary (for a review, see Cooke , 2014). In this paper, the term “speech enrichment” refers to any process by which talkers modify their speech (in an attempt) to make the speech easier for listeners to process. Although we know that speakers generally enrich their speech in noisy situations, it is unclear how consistently speakers apply these modifications over time. This paper aims to address this question.

Lombard speech, elicited by having speakers speak in the presence of loud noise, is known to have characteristics that reflect speakers' increased vocal efforts (e.g., Lombard, 1911; Van Summers , 1988; Junqua, 1993). Lombard speech exhibits acoustic features, such as reduced articulation rate, raised fundamental frequency (or F0), enhanced energy in the 1–3 kHz frequency range or flattened spectral tilt, and expanded vowel space (e.g., Junqua, 1993; Bradlow , 1996; Cooke and Lu, 2010; Garnier and Henrich, 2014; Lu and Cooke, 2009; Tang , 2017; Tuomainen and Hazan, 2016).

There is some debate as to whether Lombard speech is produced as some sort of reflex in response to loud noise and to what extent it might reflect intentional changes (e.g., Garnier , 2008). Despite this debate on the automatic or intentional nature of Lombard speech production, studies have reported overlap in the acoustic features of Lombard speech and “instructed clear speech,” with the latter being intended to overcome difficult communication situations (Lam and Tjaden, 2013; Uchanski, 2008). More specifically, clear and Lombard speech exhibit slower articulation rate and enhanced pitch modulation and vowel articulation (e.g., Smiljanić and Bradlow, 2005). Some of these acoustic-phonetic modifications of Lombard and clear speech have been shown to have perceptual benefits for hearing-impaired listeners, non-native listeners, as well as normal-hearing listeners, particularly, under the listening conditions which the modifications were meant to counter (e.g., Bradlow , 1996; Bosker and Cooke, 2020; Ferguson and Morgan, 2018; Uchanski , 1996).

There are multiple factors that influence the type and amount of speech modifications speakers apply when speaking (clearly) in noise, such as speaker age (e.g., Amazi and Garber, 1982), type and intensity of background noise (e.g., Cooke and Lu, 2010), and the presence or absence of communicative intent (e.g., reading or monologue versus talking to an actual addressee or dialogue). For instance, Hazan and Baker (2011) investigated speakers' acoustic-phonetic characteristics communicating with a conversation partner under various noise conditions. They found that speakers modified their speech according to the different noise conditions to attend to their interlocutors' needs. In addition, Garnier (2010) conducted a series of experiments to examine Lombard speech production in various sound immersion conditions (with different noise types) with and without interactions with a communication partner. They found that speech modifications in noise were greater when speakers were interacting with an interlocutor, suggesting that speech production (by adult speakers) is listener oriented.

Despite evidence that speakers can generally enrich their speech to make it more intelligible (Ferguson and Morgan, 2018; Lam and Tjaden, 2013; Schum, 1996), large inter-speaker differences in speakers' speech intelligibility and the amount of speech modifications in their clear/Lombard speech production have also been described (e.g., Junqua, 1993). A study by Bradlow (1996) investigated the relationship between intelligibility and speaker-specific acoustic-phonetic characteristics based on data from 20 speakers. They found that F0 range and vowel space were positively correlated with speech intelligibility (Bradlow , 1996). Additionally, Hazan and Markham (2004) found total energy in the 1–3 kHz region and word duration to be important predictors of word intelligibility in noise. Gender differences in speakers' speech enrichment success (as reflected by intelligibility) have been observed in multiple studies across different speaking styles (e.g., Junqua, 1993; Bradlow , 1996). More specifically, female speakers have been demonstrated to exhibit larger clear-speech benefits than male speakers when their speech was presented in noise as judged by children (Bradlow , 2003) and adult listeners (Bradlow and Bent, 2002).

However, few studies have looked at the consistency with which speakers apply their speech enrichment over time. Do speakers get better at enriching their speech with more practice? Or conversely, do they decrease the size or extent of their modifications over time (e.g., due to fatigue)? Additionally, little research has addressed the question of whether experience with clear speech modifies speech enrichment success.

Some studies on prolonged voice usage and vocal fatigue have observed links between increased vocal effort and vocal fatigue and henceforth decreased vocal function (Bottalico , 2016; Solomon, 2008). Given that Lombard speech exhibits acoustic modifications that require increased vocal effort, the results from those studies on prolonged voice usage and vocal fatigue could suggest a decrease in the amount of modifications that speakers apply when producing the effortful Lombard speech over an extended period of time. On the other hand, many speakers are able to maintain an extended conversation in noisy environments, such as busy bars or restaurants. This could be due to interlocutors reminding them to speak clearly or even speakers improving their Lombard speaking style with practice. Therefore, rather than showing vocal fatigue and slackening of attention in the Lombard speaking style, speakers may instead show some learning or practice effect over time. If speakers improve with practice, this also raises the question of what information they actually use to improve the clarity of their own speech if they cannot hear their own speech well because of loud background noise (played via over-ear headphones such that auditory feedback is drastically reduced).

Little research has directly measured the consistency of the acoustic- and articulatory-phonetic changes in Lombard speech over a certain period of time or related it to the consistency of speakers' habitual speaking styles. However, a recent study investigated speakers' use and maintenance of clear speech in an interactive speech task with interlocutors (Lee and Baese-Berk, 2020). In this naturalistic “spot-the-difference” (Diapix) task (Baker and Hazan, 2011), where speakers have to describe their version of a pictured scene to their interlocutor who has a slightly different version of the picture, speakers were not explicitly instructed to use clear speech. Nevertheless, native-English speakers' speech was found to be more intelligible when their interlocutors were non-native (rather than native) English listeners. Additionally, their speech was found to be more intelligible in the early portion of each conversation. Speakers in their study were found to “reset” to clear speech whenever they started their description of a new picture. Thus, Lee and Baese-Berk (2020) concluded that the initiation of clear speech could be listener oriented while the (lack of) maintenance of clear speech is perhaps speaker-driven as speakers can gradually spend less articulatory effort throughout their conversation to still be understood and then reset their speech clarity at topic boundaries.

In our study, we investigated the consistency of speakers' speech enrichment. To have comparable speech materials (in terms of content and quantity) across speakers, we used a sentence-reading task with instructed “clear speech” production rather than spontaneous (less controlled) speech. Specifically, we analyzed the acoustic modifications that speakers apply when moving from a habitual reading style to a clear-Lombard reading style. Our main research question was whether the acoustic differences between habitual and clear-Lombard speech would increase or decrease over the course of an experimental session (i.e., over the sentence list). Note that it is also unclear how much acoustic variation there may be in the habitual speaking style over trials, as any change in the difference between speaking styles could be due to changes in either or both speaking styles. If our results-pattern resembles that of Lee and Baese-Berk (2020), we would expect to observe a gradual decrease in speech clarity. Alternatively, we may find that speakers maintain a consistent clear-Lombard speech style throughout the experimental session if clear-speech behavior differs between spontaneous speech and sentence reading.

Apart from analyzing the acoustic changes in speakers' speech enrichment modifications, it is also critical to understand how acoustic enrichment contributes to speech intelligibility. One way of obtaining speech intelligibility scores is to collect listeners' subjective responses. However, the process of acquiring subjective intelligibility responses can be quite time-consuming and resource-demanding, especially when the number of speakers to be assessed is rather large as in our study (N = 78). Given the constraints on human (subjective) intelligibility responses, several objective intelligibility measures have been proposed. For instance, the articulation index (Kryter, 1962a,b) and speech-transmission index (Steeneken and Houtgast, 1980) were commonly used metrics in earlier studies. More recently, a glimpse-based speech perception model has been proposed (Cooke, 2006). This model uses the amount of suprathreshold target speech surviving energetic masking (masking caused by the interaction of speech and masker at the level of the auditory periphery) as a proxy for intelligibility (e.g., Tang and Cooke, 2012; Valentini-Botinhao , 2012). Tang and Cooke (2016) evaluated several extended glimpsing metrics and concluded that an approach based on high-energy glimpses [the high-energy glimpse proportion (HEGP) metric] alone best accounted for listeners' sentence recognition performance.

In our study, we examined the speech enrichment modifications speakers apply when going from baseline (habitual) speech to clear-Lombard speech production. More specifically, by analyzing differences between speaking styles in four acoustic features (articulation rate, median F0, F0 range, and spectral balance), our first research question (RQ 1) investigated how consistently speakers apply their speech enrichment modifications over sentence trials. Relatedly, we investigated whether model predicted intelligibility of speakers' habitual and clear-Lombard speech, as well as the model predicted intelligibility difference between the two styles, changed over sentence trials (RQ 2). Model predicted intelligibility was calculated using an intelligibility metric (the HEGP) that attempts to model human speech perception in noise. Note that like many other objective intelligibility metrics based on acoustics, HEGP only measures the contribution of energetic masking to intelligibility and is not able to capture any higher-level linguistic modifications that speakers may apply in their clear-Lombard speech (such as enhanced phonetic contrasts).

A total of 78 native Dutch speakers [age, M = 22 years 7 months old; standard deviation (SD) = 2 years 10 months old; 61 females] were recruited online through the Radboud Research Participation System. All of the participants were university students or had graduated from university at the time of the experiment. Participants did not report any known history of speech, hearing, or reading disabilities, nor past diagnosis of speech pathology or brain injury at the time of testing. Additionally, they were all reported to have normal or corrected-to-normal vision. The study protocol had been evaluated and approved by the Ethics Assessment Committee Humanities at Radboud University. All of the participants gave informed consent for their data to be analyzed anonymously, and they received either course credits or gift vouchers as compensation for their time.

All of the 78 speakers performed a sentence-reading task in a sound-attenuating recording booth at the Centre for Language Studies Laboratory of Radboud University. Sentences were presented on a 24″ full high-definition (HD) monitor placed on a table in front of the participant. The presentation of the task stimuli was controlled real-time by the experimenter (C.S.) on the stimulus computer outside of the recording booth. Speech recordings were made using a Sennheiser ME 64 cardioid capsule microphone (Wennebostel, Germany) placed around 15 cm away from the speaker's mouth through a pre-amplifier (Audi Ton, Amsterdam, The Netherlands) onto a Roland R-05 WAVE recorder (Shizuoka, Japan). The sampling rate of the resulting .wav format recordings that were used for acoustic analysis was 44.1 kHz with 16-bit resolution.

A multi-talker speech corpus—the Radboud Habitual and Lombard Speech Corpus (RaLoCo) was created for the purpose of this study. The RaLoCo corpus (Shen and Janse, 2018, 2021) contains Dutch sentence-reading material read by the 78 native Dutch speakers with 96 sentences per speaker (i.e., 7488 sentences in total). Each participant read out the same 48 unique sentences twice: once in a habitual style in which they were only told to read out the sentences fluently (i.e., habitual speaking style), and once in a style where they were instructed to read the sentences out as clearly as possible while hearing speech-shaped noise continuously over headphones (i.e., clear-Lombard style). The speech-shaped noise file that participants heard was created based on an average speech spectrum envelope across a balanced mix of male and female voices. The total duration of the noise file was around 25 min, which proved long enough to cover the noise-condition part of the sentence-reading session. Note that there was a brief break between readings of the two conditions as participants needed to read through the instructions for the clear-Lombard style. After the participant indicated that they had understood the clear-Lombard instructions, the experimenter (C.S.) played the noise file at 78 dB sound pressure level (SPL; as calibrated using a Brüel and Kjaer type 4153 artificial ear, Nærum, Denmark) through a HP Probook laptop (Palo Alto, CA) over a pair of closed headphones (Sennheiser HD 215 MKII DJ, Wedemark, Germany) to the participant.

All of the 48 sentences were between 12 and 16 syllables long and syntactically correct and semantically plausible. An example sentence is “Mijn opa had de piep jammer genoeg niet meer gehoord (translation, my grandpa, unfortunately, had just missed the beep).” Speakers were presented with 48 unique sentences in 1 of 8 random orders, followed by the same 48 sentences in a different random order. The 48 unique sentences in 2 speaking styles formed 4 combined long lists of 96 sentences. The four long sentence lists (differing only in order) were rotated over participants such that order (i.e., trial) effects could be isolated from sentence (i.e., item) effects. During the reading task, the habitual style always preceded Lombard style for all of the participants to avoid potential spillover effects from Lombard to habitual speech production.

The production of the sentences was live monitored for hesitations, misarticulations, and other disfluencies. If the experimenter noticed disfluency, she would ask the speaker to repeat the target sentence (by re-presenting the sentence stimulus in the slide show). Overall, participants were able to read out all of the sentences fluently with only occasional repetitions of one or two sentences for some speakers.

To analyze the type of modifications speakers employed producing the two types of speech, we used standard Lombard speech acoustic measures (e.g., articulation rate, pitch/F0 measures, and spectral balance measures) as reported in previous studies (e.g., Garnier , 2010; Cooke and Lu, 2010; Lu and Cooke, 2008). The long audio recordings from the sentence-reading task were labelled and extracted as separate sentence-length audio files for analysis.

For articulation rate, long silences (>200 ms) in the extracted sentences were manually labelled and then removed (3% or 244 out of 7488 sentences) in Praat. Articulation rate was calculated as syllables per second by dividing the number of syllables in the orthographic transcription of a sentence by the actual production duration of that sentence (excluding long silences). Higher values index faster rates.

F0 measures were obtained using a customized script in Praat (Boersma and Weenink, 2017) that calculates estimated F0 values at 10 ms intervals in each individual sentence. The pitch floor was set at 75 and 60 Hz for female and male speakers, respectively, while the pitch ceiling was 500 and 300 Hz respectively. The script coded unvoiced parts of the audio as “−1,” and these values were subsequently excluded from further analyses. “Doubling” or “halving” errors in pitch tracking (depicted as sudden jumps in the estimated F0 values) were corrected through a customized Python script that detects and deletes values that are above or below a factor of 1.5 compared to the penultimate value (cf., Marcoux and Ernestus, 2019). Median F0 and F0 range measures were then calculated per sentence based on the remaining 96.32% of (cleaned) pitch values (note that all of the speakers produced pitch errors). Instead of using the 100% range for calculating F0 range, we opted for the range between the 10th and 90th percentiles from the original F0 values to avoid extremely high and low values caused by erroneous pitch values that might not have been excluded by the Python script. The observed Hz values for median F0 and F0 range per sentence were next converted to semitones using 1 Hz as the reference, following this formula, s =12 log 2(targetHz) (e.g., Hazan and Baker, 2011; Dichter , 2018). Higher values for median F0 index higher or raised voice and higher numbers for F0 range indicate expansion of F0 range.

Spectral balance was calculated using the Hammarberg index (Hammarberg , 1980) at the individual sentence level. The Hammarberg index captures the relative difference between the energy maxima in the low frequency range (0–2000 Hz) and high frequency range (2000–5000 Hz). The Hammarberg values were calculated automatically using a customized Praat script by extracting the long-term average spectrum (LTAS) of each sentence with a filter bandwidth of 100 Hz and subtracting energy maxima in the low frequency range from that in the high frequency range. Higher values indicate a steeper spectral roll-off, which implies less vocal effort.

HEGP-metric predicted intelligibility scores were obtained per sentence (for the 48 unique sentences) in both speaking styles (habitual and Lombard) for all of the speakers using a customized script from Tang and Cooke (2016). The metric was computed at various signal-to-noise ratios (SNRs) using the same speech-shaped noise-masker that had been used to elicit the clear-Lombard speech. The selected SNR for the HEGP metric was -5 as differences in estimated intelligibility between habitual and Lombard speech were largest around this SNR level. Following Tang and Cooke (2016), HEGP scores were computed by applying a selection process to raw glimpses of the target speech signal. Raw glimpses are defined as spectro-temporal regions where the target speech is more energetic than the masker. Separate spectro-temporal representations were formed for speech and masker by passing the signal through a bank of 40 gammatone filters with center frequencies spaced equally on an equivalent-rectangular-bandwidth- (or ERB-) rate scale from 100 to 8000 Hz, followed by smoothing the instantaneous envelope at the output of each filter using a temporal integrator with an 8 ms time constant and down sampling to 100 Hz. HEGP scores result from selecting raw glimpses whose energy exceeds the mean speech-plus-masker energy in each frequency channel. The HEGP scores lie between zero and one with higher numbers indicating a higher glimpse proportion escaping energetic masking, thus, suggesting higher predicted intelligibility.

To verify that the HEGP-predicted intelligibility ratings would produce reliable objective ratings for our speech material, we compared the HEGP-predicted ratings to human listening effort ratings that had been collected for other purposes (for a small, selected subset of four female and four male speakers). Although intelligibility, even if derived from human transcription of the speech, is not the same as listening effort, research has shown that intelligibility and listening effort (both as perceived by human listeners) are highly related (Krueger , 2017b). For this human ratings study, 32 additionally normal-hearing young (18–30 years of age) native Dutch female listeners who had not previously participated in the main speech production experiment were recruited online. They completed the rating experiment using the online survey software Qualtrics (Provo, UT). Each listener rated a total of 48 unique sentences produced by the 8 selected speakers (6 unique stimulus sentences per speaker). These 6 sentences per speaker were divided over the 2 speaking styles, resulting in 24 sentences in the habitual style and 24 sentences in the clear-Lombard style per experimental list. Four different experimental lists were created, each with all eight speakers represented but assigned to different unique sentences to rotate speakers over sentences. The 32 listeners were randomly assigned to the 4 experimental lists, resulting in each listener rating each of the 8 speakers. This yielded a total of 1536 ratings (48 sentences × 32 listeners). The sentences presented to the listeners were embedded in the same speech-shaped noise that was used to elicit the Lombard speech at −6 dB SNR (note that although there is a slight, 1 dB difference in SNRs between this perception experiment and the aforementioned HEGP-metric predicted intelligibility ratings, the differences between −5 and −6 dB SNR in HEGP scores are small).

The listeners were instructed to rate the amount of listening effort that they experienced while listening to the speech stimulus presented in noise. Ratings were given on a scale from “1” (no effort) to “7” (extreme effort) using an adapted version of the Adaptive Listening Effort Tests developed by Krueger (2017a). For each of the available speaker-style combinations, an average human rating was calculated based on ratings from eight human listeners.

Pearson correlation coefficients were calculated (at individual item level, i.e., for each combination of speaker-sentence-speaking style) between the average subjective ratings (obtained at −6 dB SNR) and HEGP-predicted intelligibility scores (obtained at −5 dB SNR). A significant and strong correlation was found between the subjective ratings and HEGP scores (r = −0.81, p < 0.001). In other words, higher predicted intelligibility was associated with lower degrees of listening effort, thus, arguing for the use of HEGP-model predicted intelligibility scores as a proxy of speech intelligibility in noise.

To answer the research questions set out for the current study, a number of linear mixed-effects (LME) models were run in RStudio (version 1.2.1335, Boston, MA), using the lme4 package (version 1.1–21; Bates , 2015). Detailed data modelling procedures are discussed below.

The consistency of the acoustic modifications that each speaker applied throughout the habitual and clear-Lombard speaking styles over the 48 sentences or trials (i.e., trial represents the stimulus-sentence's position in the sentence list/experimental session) were investigated separately for each acoustic measure, i.e., articulation rate, median F0, F0 range, and spectral balance. Additionally, to check for potential nonlinear trial effects with effects levelling off or even changing direction over time, we added the quadratic trial term (i.e., trial-squared) to each of the four LME models (Bruce and Bruce, 2017). Specifically, for each dependent acoustic measure, one LME model with and one without a quadratic trial term were set up (always in addition to a linear trial effect). For the models with quadratic trial terms, trial (i.e., sentence number), trial-squared (i.e., quadratic term of trial), and speaking style (habitual and Lombard) were entered as fixed effects of interest. Gender of the speaker (female and male) was included as a fixed control predictor (but note that the gender distribution was far from balanced with 61 out of 78 speakers being female). For contrast coding of speaking style and gender (see Brehm and Alday, 2022, for contrast coding choices in LME models), dummy coding was used (with habitual style and female speakers as reference levels). As for the models without the quadratic term of trial, trial-squared was removed while all of the other predictors remained unaltered.

Interactions between speaking style and trial and, if applicable, also between speaking style and trial-squared were included to answer RQ 1 on the consistency of speech enrichment applied by speakers when changing from habitual to clear-Lombard speech over the experimental session. Additionally, an interaction between speaking style and speaker gender was also included to test whether speakers' gender influenced the size of speakers' speech enrichment modifications applied when moving from habitual to clear-Lombard speech. Across all of the models, participant and item (i.e., sentence) were included as two random effects. We also allowed a random by-participant slope for speaking style and trial, acknowledging that individual participants may differ in the acoustic changes moving from their habitual to clear-Lombard speaking style and in their behavior over sentence trials.

Model comparisons using the anova( ) function in R between the full models with and without the quadratic trial term (and its interactions) were applied. The full models with better fit were then selected for stepwise model stripping to arrive at the most parsimonious models. More specifically, we took out insignificant interactions first and, next, removed insignificant effects, starting with those with the lowest t-values in the model outputs. Model comparisons were applied after each removal of the least significant predictor or interaction term to verify that the removal of each predictor/interaction term did not result in significant loss of model fit.

Having verified the relationship between HEGP-metric predicted intelligibility and human listeners' subjective ratings of listening effort for a subsample of our materials, we now return to our second research question on the consistency of HEGP-model predicted intelligibility scores over the course of our experimental session. Prior to fitting the LME model, the raw HEGP scores (proportion, p) for each unique sentence token were converted to logits using the following equation: logit = ln(p/1 − p) (Jaeger, 2008). Similar to the setup of the LME models for the four acoustic measures above, the quadratic term of trial (i.e., trial-squared) was also added to the LME model for HEGP logit scores. The same effects and interactions reported above were investigated here as well. We applied the same model comparison and simplification procedures. Specifically, we included HEGP-model predicted intelligibility scores (in logits) as a dependent variable. For the model with the quadratic term of trial: trial, trial-squared, and style were entered as fixed effects of interest. Speaker gender (female and male) was included as a fixed control predictor. Again, interactions between speaking style and trial and (if applicable) speaking style and trial-squared were included to investigate the consistency of the model predicted intelligibility in habitual and clear-Lombard speech over trials. Additionally, an interaction between speaking style and speaker gender was also included to test whether speakers' gender modified the difference in predicted intelligibility scores between speaking styles. For the random structure, we allowed participant and item (i.e., sentence) as two random effects plus random by-participant slopes for speaking style and trial. The full model with better fit was then stripped in a stepwise manner to arrive at the most parsimonious model with model comparisons applied after each removal of the least significant predictor.

Descriptive data for articulation rate, pitch measures (median F0 and F0 range in semitones), spectral balance, and HEGP-predicted intelligibility scores split by speaking style and speaker gender are summarized in Table I. Additionally, speakers' overall speech enrichment success (as predicted by the HEGP intelligibility model) is shown in Fig. 1.

TABLE I.

Summary of the four acoustic measures and HEGP intelligibility scores divided by speaking style (habitual and Lombarda) and speaker gender.

Habitual Lombard
Female Male Female Male
Measurement Mean (SD) Mean (SD) Mean (SD) Mean (SD)
Articulation rate (syll/sec)  5.53 (0.70)  5.98 (0.95)  4.53 (0.69)  5.19 (0.81) 
Median F0 (semitone)  93.03 (1.95)  82.92 (2.91)  95.09 (2.07)  86.40 (2.75) 
F0 range (semitone)  6.38 (1.90)  6.37 (2.12)  8.02 (2.04)  8.47 (1.86) 
Spectral balance (dB)  19.96 (4.05)  19.39 (3.39)  12.17 (3.57)  16.34 (4.04) 
HEGP intelligibility scores  0.44 (0.04)  0.44 (0.04)  0.56 (0.04)  0.50 (0.06) 
Habitual Lombard
Female Male Female Male
Measurement Mean (SD) Mean (SD) Mean (SD) Mean (SD)
Articulation rate (syll/sec)  5.53 (0.70)  5.98 (0.95)  4.53 (0.69)  5.19 (0.81) 
Median F0 (semitone)  93.03 (1.95)  82.92 (2.91)  95.09 (2.07)  86.40 (2.75) 
F0 range (semitone)  6.38 (1.90)  6.37 (2.12)  8.02 (2.04)  8.47 (1.86) 
Spectral balance (dB)  19.96 (4.05)  19.39 (3.39)  12.17 (3.57)  16.34 (4.04) 
HEGP intelligibility scores  0.44 (0.04)  0.44 (0.04)  0.56 (0.04)  0.50 (0.06) 
a

Note that for the sake of simplicity, the clear-Lombard style in our study was referred to as “Lombard” when presenting statistical results and in graphs.

FIG. 1.

(Color online) Scatterplot, illustrating the mean HEGP intelligibility scores (proportion) averaged over sentences per speaker in the two speaking styles (habitual and Lombard), are color-coded by speaker gender.

FIG. 1.

(Color online) Scatterplot, illustrating the mean HEGP intelligibility scores (proportion) averaged over sentences per speaker in the two speaking styles (habitual and Lombard), are color-coded by speaker gender.

Close modal

To obtain a general view of relationships between different measures in each speaking style, we explored the correlations between the acoustic measures and HEGP intelligibility scores through Pearson correlation coefficients. In addition, we examined how these intercorrelations may differ between speaking styles. Note that we adopted a conservative alpha level for our correlation measures (as we present 20 correlations in total, we opted for the strict alpha level of p < 0.0025).

The correlations in Tables II and III show that in the habitual and clear-Lombard speaking styles, the strongest correlations were found between HEGP intelligibility scores and spectral balance (r = −0.66 and r = −0.64 respectively). In general, correlations are stronger for associations between acoustic measures in the clear-Lombard style than in the habitual style. HEGP intelligibility score was associated with pitch measures median F0 and F0 range (r = 0.49, and r = 0.11) in clear-Lombard style. However, note that F0 is gender dependent. If we split the (clear-Lombard) data by gender, we saw that the correlation between HEGP scores and median F0 was stronger in the female speakers (r = 0.27) than in the male speakers (r = 0.16). These results indicate that these acoustic features change hand-in-hand in (modified) speech production, and the four acoustic measures are associated with HEGP-model predicted intelligibility scores, particularly, with predicted intelligibility for the clear-Lombard style.

TABLE II.

Pearson correlation coefficients of the four acoustic measures and HEGP intelligibility scores in the habitual speaking style. The asterisk represents significant correlations after Bonferroni correction for multiple testing.

Parameter F0 range Median F0 Articulation rate Spectral balance
HEGP  0.09*  0.01  −0.13*  −0.66* 
Spectral balance  −0.16*  0.06*  0.02   
Articulation rate  −0.08*  −0.24*     
Median F0  0.05*       
Parameter F0 range Median F0 Articulation rate Spectral balance
HEGP  0.09*  0.01  −0.13*  −0.66* 
Spectral balance  −0.16*  0.06*  0.02   
Articulation rate  −0.08*  −0.24*     
Median F0  0.05*       
TABLE III.

Pearson correlation coefficients of the four acoustic measures and HEGP intelligibility scores in the Lombard speaking style. The asterisk represents significant correlations after Bonferroni correction for multiple testing.

Parameter F0 range Median F0 Articulation rate Spectral balance
HEGP  0.11*  0.49*  −0.59*  −0.64* 
Spectral balance  −0.05  −0.45*  0.31*   
Articulation rate  −0.09*  −0.39*     
Median F0  0.02       
Parameter F0 range Median F0 Articulation rate Spectral balance
HEGP  0.11*  0.49*  −0.59*  −0.64* 
Spectral balance  −0.05  −0.45*  0.31*   
Articulation rate  −0.09*  −0.39*     
Median F0  0.02       

Outcomes of our statistical modelling procedures will be discussed below per dependent variable (acoustic measures followed by HEGP-model predicted intelligibility scores). Figure 2 presents an overview of all of the time-course results.1 Results of the statistical models are presented in the  Appendix (Tables V–IX). A summary of all of the time-course statistical results is given in Table IV.

FIG. 2.

(Color online) An overview of all of the model-based time-course results, illustrating how articulation rate, median F0, F0 range, spectral balance, and HEGP are associated with speaking style and trial (shading represents 95% confidence interval).

FIG. 2.

(Color online) An overview of all of the model-based time-course results, illustrating how articulation rate, median F0, F0 range, spectral balance, and HEGP are associated with speaking style and trial (shading represents 95% confidence interval).

Close modal
TABLE IV.

An overall summary of all of the time-course results relating to speaking style with coefficient estimates and standard errors (cf. Tables V–IX in the  Appendix for the full models that these estimates were taken from). The asterisk denotes significant results.

Articulation rate Median F0 F0 range Spectral balance HEGP (logits)
Predictors  Estimate (standard error) 
(Intercept)  5.25 (0.09)*  92.80 (0.24)*  6.61 (0.19)*  20.34 (0.42)*  −0.22 (0.02)* 
Style (Lombard)  −0.62 (0.07)*  1.77 (0.17)*  1.43 (0.13)*  −7.87 (0.37)*  0.43 (0.02)* 
Trial2  −0.0004 (0.0004)*  −0.0003 (0.0001)*    0.0003 (0.0003)  0.000 02 (0.000 01)* 
Trial  0.02 (0.002)*  0.02 (0.006)*  −0.009 (0.003)*  −0.02 (0.02)  −0.0009 (0.0005) 
Style (Lombard) × trial2  0.0005 (0.000 05)*      −0.001 (0.0004)*   
Style (Lombard) × trial  −0.03 (0.003)*  0.01 (0.32)*  0.01 (0.003)*  0.05 (0.02)*  0.0006 (0.0002)* 
Articulation rate Median F0 F0 range Spectral balance HEGP (logits)
Predictors  Estimate (standard error) 
(Intercept)  5.25 (0.09)*  92.80 (0.24)*  6.61 (0.19)*  20.34 (0.42)*  −0.22 (0.02)* 
Style (Lombard)  −0.62 (0.07)*  1.77 (0.17)*  1.43 (0.13)*  −7.87 (0.37)*  0.43 (0.02)* 
Trial2  −0.0004 (0.0004)*  −0.0003 (0.0001)*    0.0003 (0.0003)  0.000 02 (0.000 01)* 
Trial  0.02 (0.002)*  0.02 (0.006)*  −0.009 (0.003)*  −0.02 (0.02)  −0.0009 (0.0005) 
Style (Lombard) × trial2  0.0005 (0.000 05)*      −0.001 (0.0004)*   
Style (Lombard) × trial  −0.03 (0.003)*  0.01 (0.32)*  0.01 (0.003)*  0.05 (0.02)*  0.0006 (0.0002)* 

1. Acoustic measures

a. Articulation rate.

For articulation rate, the model with the quadratic term of trial had better fit than the model without. Table V shows that articulation rate is modulated for the habitual speaking style mapped on the intercept by trial, trial-squared, speaking style, and speaker gender. Additionally, there is a significant interaction between speaking style and trial and between speaking style and trial-squared. These results address our RQ 1 on consistency over trials by showing that the rate difference between the two speaking styles is not consistent across trials. Figure 2 shows that speakers initially speeded up over trials in the habitual speaking style while slowing down in the clear-Lombard speaking style. As these effects again leveled off at later trials, the overall rate difference between the two speaking styles was largest halfway through the experimental trials and then decreased again. Moreover, female speakers had, in general, slower articulation rates than male speakers.

TABLE V.

Model predicting articulation rate. Asterisks denote significant results.

Articulation rate
Predictors Estimates Standard error p
(Intercept)  5.253 64  0.089 43  <0.001* 
Style (Lombard)  −0.624 49  0.068 77  <0.001* 
Trial2  −0.000 37  0.000 04  <0.001* 
Trial  0.021 98  0.002 02  <0.001* 
Gender (male)  0.593 13  0.138 41  <0.001* 
Style (Lombard) × trial2  0.000 54  0.000 05  <0.001* 
Style (Lombard) × trial  −0.031 16  0.002 64  <0.001* 
Random effects (SD) 
Subject (intercept)  0.580 30 
Speech style by subject  0.557 81 
Trial by subject  0.005 09 
Sentence (intercept)  0.335 74 
Residual  0.293 02 
Nsubject  78 
Nsentence  48 
Observations  7488 
Conditional R 0.897 
Articulation rate
Predictors Estimates Standard error p
(Intercept)  5.253 64  0.089 43  <0.001* 
Style (Lombard)  −0.624 49  0.068 77  <0.001* 
Trial2  −0.000 37  0.000 04  <0.001* 
Trial  0.021 98  0.002 02  <0.001* 
Gender (male)  0.593 13  0.138 41  <0.001* 
Style (Lombard) × trial2  0.000 54  0.000 05  <0.001* 
Style (Lombard) × trial  −0.031 16  0.002 64  <0.001* 
Random effects (SD) 
Subject (intercept)  0.580 30 
Speech style by subject  0.557 81 
Trial by subject  0.005 09 
Sentence (intercept)  0.335 74 
Residual  0.293 02 
Nsubject  78 
Nsentence  48 
Observations  7488 
Conditional R 0.897 
b. Median F0.

For median F0, the model with the quadratic term of trial had a better fit than the model with only a linear trial effect. Table VI shows that for the habitual speaking style mapped on the intercept, changes in median F0 (in semitones) are related to trial, trial-squared, speaking style, and speaker gender. Furthermore, speaking style interacted with trial (linear trial term only) and speaker gender. These results illustrate that speakers raised their median F0 throughout the experiment session in both speaking styles and the increase in median F0 in the clear-Lombard style was larger than the increase in the habitual style (see Fig. 2). As a result, the difference in median F0 between the two speaking styles was enlarged throughout the experiment session. Additionally, female speakers had higher median F0 than male speakers, and Lombard speaking style exhibited higher median F0 than habitual speaking style. The significant interaction between speaker gender and speaking style reflects that the increase in median F0 in the Lombard style was larger for male speakers than for female speakers.

TABLE VI.

Model predicting median F0. Asterisks denote significant results.

Median F0
Predictors Estimates Standard error p
(Intercept)  92.803 96  0.236 95  <0.001* 
Style (Lombard)  1.767 79  0.166 35  <0.001* 
Trial  0.019 22  0.006 02  0.001* 
Gender (male)  −10.080 96  0.465 93  <0.001* 
Trial2  −0.000 32  0.000 12  0.006* 
Style (Lombard) × trial  0.013 51  0.002 60  <0.001* 
Style (Lombard) × gender (male)  1.247 37  0.316 78  <0.001* 
Random effects (SD) 
Subject (intercept)  1.725 23 
Speech style by subject  1.184 95 
Trial by subject  0.011 52 
Sentence (intercept)  0.366 66 
Residual  1.269 91 
Nsubject  78 
Nsentence  48 
Observations  7488 
Conditional R2  0.925 
Median F0
Predictors Estimates Standard error p
(Intercept)  92.803 96  0.236 95  <0.001* 
Style (Lombard)  1.767 79  0.166 35  <0.001* 
Trial  0.019 22  0.006 02  0.001* 
Gender (male)  −10.080 96  0.465 93  <0.001* 
Trial2  −0.000 32  0.000 12  0.006* 
Style (Lombard) × trial  0.013 51  0.002 60  <0.001* 
Style (Lombard) × gender (male)  1.247 37  0.316 78  <0.001* 
Random effects (SD) 
Subject (intercept)  1.725 23 
Speech style by subject  1.184 95 
Trial by subject  0.011 52 
Sentence (intercept)  0.366 66 
Residual  1.269 91 
Nsubject  78 
Nsentence  48 
Observations  7488 
Conditional R2  0.925 
c. F0 range.

For F0 range, the model with the quadratic term of trial did not differ from the model without the quadratic term such that the simpler model (with only a linear trial effect) was the more parsimonious model. Table VII shows that for the habitual speaking style mapped on the intercept, F0 range measures (in semitones) decreased over trials. The significant speaking style effect suggested that speakers exhibit larger F0 range in Lombard than in habitual speaking style. Related to our RQ 1, the interaction between speaking style and trial indicates that the difference in F0 range between Lombard and habitual speaking style actually increased over trials. As F0 range in the clear-Lombard style is rather stable over trials, this increased difference between the two speaking styles over the course of sentence trials is mainly the result of a decline in F0 range toward the end of the sentence list in the habitual style (see Fig. 2). A lack of a gender effect in F0 range suggests that females and males had similar F0 ranges (in semitones) and female and male speakers increased their F0 range to similar degrees when changing speaking styles.

TABLE VII.

Model predicting F0 range. Asterisks denote significant results.

F0 range
Predictors Estimates Standard error p
(Intercept)  6.607 42  0.185 23  <0.001* 
Style (Lombard)  1.432 06  0.134 47  <0.001* 
Trial  −0.009 32  0.002 61  <0.001* 
Style (Lombard) × trial  0.012 57  0.002 67  <0.001* 
Random Effects (SD) 
Subject (intercept)  1.427 97 
Speech style by subject  1.003 38 
Trial by subject  0.012 67 
Sentence (intercept)  0.483 25 
Residual  1.304 63 
Nsubject  78 
Nsentence  48 
Observations  7488 
Conditional R 0.637 
F0 range
Predictors Estimates Standard error p
(Intercept)  6.607 42  0.185 23  <0.001* 
Style (Lombard)  1.432 06  0.134 47  <0.001* 
Trial  −0.009 32  0.002 61  <0.001* 
Style (Lombard) × trial  0.012 57  0.002 67  <0.001* 
Random Effects (SD) 
Subject (intercept)  1.427 97 
Speech style by subject  1.003 38 
Trial by subject  0.012 67 
Sentence (intercept)  0.483 25 
Residual  1.304 63 
Nsubject  78 
Nsentence  48 
Observations  7488 
Conditional R 0.637 
d. Spectral balance.

For spectral balance, the model with the quadratic trial term had better fit than the model without. Table VIII displays that relative to the habitual speaking style mapped on the intercept, changes in spectral balance are related to speaking style. As lower values in spectral balance indicate louder voices or more vocal effort/energy, these results suggest that speakers increased their vocal effort/energy when changing from habitual to Lombard speaking style (shown as negative values in Table VIII). Although trial and trial-squared were not significant predictors for the habitual speaking style mapped on the intercept, they significantly interacted with speaking style. These results suggest the over-trial increase in vocal effort/energy was present in the clear-Lombard speaking style and nonlinearly so. Thus, instead of decreasing vocal effort toward the end of the clear-Lombard reading session, speakers in our experiment actually increased their vocal effort/energy more toward the end (see Fig. 2). The effect of gender was not significant for the habitual style, but it interacted with speaking style with males increasing their vocal effort/energy less than females when changing from habitual to Lombard speaking style.

TABLE VIII.

Model predicting spectral balance. Asterisks denote significant results.

Spectral balance
Predictors Estimates Standard error p
(Intercept)  20.335 56  0.424 00  <0.001* 
Style (Lombard)  −7.870 21  0.372 44  <0.001* 
Trial2  0.000 30  0.000 30  0.321 
Trial  −0.024 93  0.015 60  0.110 
Gender (male)  −0.585 05  0.667 21  0.381 
Style (Lombard) × trial2  −0.001 29  0.000 42  0.002* 
Style (Lombard) × trial  0.045 72  0.021 06  0.030* 
Style (Lombard) × gender (male)  4.660 22  0.658 72  <0.001* 
Random effects 
Subject (intercept)  2.458 18 
Speech style by subject  2.354 76 
Trial by subject  0.020 58 
Sentence (intercept)  1.615 63 
Residual  2.343 00 
Nsubject  78 
Nsentence  48 
Observations  7488 
Conditional R 0.800 
Spectral balance
Predictors Estimates Standard error p
(Intercept)  20.335 56  0.424 00  <0.001* 
Style (Lombard)  −7.870 21  0.372 44  <0.001* 
Trial2  0.000 30  0.000 30  0.321 
Trial  −0.024 93  0.015 60  0.110 
Gender (male)  −0.585 05  0.667 21  0.381 
Style (Lombard) × trial2  −0.001 29  0.000 42  0.002* 
Style (Lombard) × trial  0.045 72  0.021 06  0.030* 
Style (Lombard) × gender (male)  4.660 22  0.658 72  <0.001* 
Random effects 
Subject (intercept)  2.458 18 
Speech style by subject  2.354 76 
Trial by subject  0.020 58 
Sentence (intercept)  1.615 63 
Residual  2.343 00 
Nsubject  78 
Nsentence  48 
Observations  7488 
Conditional R 0.800 
e. HEGP-model predicted intelligibility.

For HEGP-model predicted intelligibility scores, the model with the quadratic term of trial had better fit than the model without. Table IX shows that HEGP-model predicted intelligibility scores for the habitual speaking style (mapped on the intercept) are significantly related to the nonlinear term of trial. Predicted intelligibility increased nonlinearly over trials throughout the experiment. The speaking style effect confirmed that clear-Lombard speech indeed had a higher predicted intelligibility in noise than in habitual speech. Furthermore, speaking style interacted with trial, suggesting that at least for female speakers (mapped on the intercept), the speaking style effect (or Lombard intelligibility gain) increased over trials (see Fig. 2). The interaction between speaking style and gender indicates that male speakers had a smaller predicted Lombard intelligibility gain compared to female speakers. This result echoes with our previous results in the consistency of speech modifications that speakers apply over trials. Speakers generally increased their Lombard speech modifications over trials, resulting in improved (predicted) intelligibility gains in noise over the course of the experiment.

TABLE IX.

Model predicting HEGP-model predicted intelligibility. Asterisks denote significant results.

HEGP (logits)
Predictors Estimates Standard error p
(Intercept)  −0.222 52  0.017 36  <0.001* 
Style (Lombard)  0.433 96  0.018 21  <0.001* 
Trial  −0.000 91  0.000 46  0.051 
Gender (male)  0.001 19  0.022 14  0.957 
Trial2  0.000 02  0.000 01  0.015* 
Style (Lombard) × trial  0.000 55  0.000 20  0.005* 
Style (Lombard) × gender (Male)  −0.218 64  0.037 36  <0.001* 
Random effects (SD) 
Subject (intercept)  0.080 55 
Speech style by subject  0.136 11 
Trial by subject  0.001 11 
Sentence (intercept)  0.089 50 
Residual  0.093 86 
Nsubject  78 
Nsentence  48 
Observations  7488 
Conditional R 0.880 
HEGP (logits)
Predictors Estimates Standard error p
(Intercept)  −0.222 52  0.017 36  <0.001* 
Style (Lombard)  0.433 96  0.018 21  <0.001* 
Trial  −0.000 91  0.000 46  0.051 
Gender (male)  0.001 19  0.022 14  0.957 
Trial2  0.000 02  0.000 01  0.015* 
Style (Lombard) × trial  0.000 55  0.000 20  0.005* 
Style (Lombard) × gender (Male)  −0.218 64  0.037 36  <0.001* 
Random effects (SD) 
Subject (intercept)  0.080 55 
Speech style by subject  0.136 11 
Trial by subject  0.001 11 
Sentence (intercept)  0.089 50 
Residual  0.093 86 
Nsubject  78 
Nsentence  48 
Observations  7488 
Conditional R 0.880 

In this study, we investigated variability in the speech enrichment modifications speakers apply when changing from baseline (habitual) speech to clear-Lombard speech production. More specifically, we analyzed how consistently speakers produced enriched and habitual speech over trials (RQ 1). Furthermore, we investigated the consistency of speakers' model predicted speech intelligibility for the two speaking styles (RQ 2). In general, albeit to a varying degree, virtually all of the speakers applied enrichment modifications in their clear-Lombard speech compared to their habitual style speech (see Fig. 1 in Sec. III). These enrichment modifications are evident from the four acoustic measures, namely, articulation rate, median F0, F0 range, and spectral balance. This confirms that speakers generally enriched their speaking style through the clear-Lombard speech elicitation technique, which enables us to address the research questions for this study.

Most of the acoustic measures (i.e., articulation rate, median F0, and spectral balance) showed changes during the course of the experiment in the clear-Lombard speaking style. These findings suggest that speakers constantly adapted their speech modifications throughout the experiment session regardless of whether these changes were linear or nonlinear. Additionally, any increase in the difference between speaking styles over trials can be attributed mainly to changes in the habitual speaking style as observed in the F0 range or to changes in both styles. Hence, the intelligibility benefit of the clear-Lombard style is far from static. Unlike Lee and Baese-Berk (2020), speakers in our study exhibited mostly increased speech enrichment modifications and, hence, improved intelligibility ratings throughout the experiment. Given the differences in task setup, clear speech production in the more spontaneous conversations as in the study by Lee and Baese-Berk (2020) could be more effortful to begin with, which may explain the differences in clarity/intelligibility changes over time. Possibly, if speakers read rather than formulate sentences spontaneously, they can allocate more attention to the clarity of their pronunciation.

The enrichment modifications in articulation rate that speakers applied in their clear-Lombard speech (i.e., slowing down over Lombard reading trials) exhibited an opposite pattern compared to the rate changes in their habitual style, making the differences in articulation rates in the two speaking styles at its maximum halfway through the experiment. Thus, for articulation rate, speakers initially took some time to adjust their articulation rate for the clear speaking style, and they would then gradually return to rates that they are supposedly more comfortable with.

Concerning pitch measures, clear-Lombard speech exhibited significantly higher median F0 and wider F0 range. Additionally, speakers generally increased their pitch enrichment modifications throughout the experiment session. Specifically, they continuously increased their median F0 in the clear-Lombard style. These results show that our speakers were able to continuously raise their pitch throughout the experiment session, which again suggests that speakers need time or practice to achieve their maximally clear style. Although speakers cannot keep raising their pitch, they may do so until a certain “asymptote” is reached, which is similar to that observed in articulation rate. However, without empirical data, we cannot know the maximum amount of pitch modifications possible for our young adult speakers tested here. It seems that from comparing the rate and pitch patterns over trials, speakers need more time or practice for pitch adaptation than rate adaptation in clear-Lombard speech production. Alternatively, speakers may assign greater value to pitch adaptation such that they strive to continue improving it. For future work, it could be informative to test speakers' maximum pitch modifications capacity in clear-Lombard speech production using longer experimental sessions.

For spectral balance, changes over trials were greater in the clear-Lombard style than in the habitual style, especially toward the end of the experiment session. These results suggest that speakers were able to continuously apply this speech enrichment adaptation in their clear-Lombard speech production, and some practice is needed for them to realize the more “enriched” speech. Whereas speakers seemed to exert less effort over trials in their habitual style, as evident from changes in pitch and rate, their vocal effort remained more or less constant in their habitual reading style. However, to test speakers' maximum capacity in speech enrichment modifications, a longer experiment session would be required given that vocal fatigue has generally been observed after prolonged periods of voice use (e.g., Novak , 1991; Gelfer , 1991).

These speech enrichment modifications that speakers applied in the clear-Lombard speaking style are consistent with previous literature (e.g., Van Summers , 1988; Cooke and Lu, 2010; Garnier and Henrich, 2014; Junqua, 1993). The novel results on the consistency of speakers' speech enrichment modifications in an experiment session indicate that speakers (at least the healthy and young adult speakers tested in our study) may need some practice to reach their full potential in producing clear and/or Lombard speech. It is unclear whether this practice result, as our overall pattern of results, should be attributed to the noise aspect of our clear-Lombard condition or to the explicit instructions to speak clearly. The combination of speaking in noise and clear-speech instructions may have boosted the modifications observed in our study and driven the practice effect. Hence, it remains for further investigation whether our pattern of results challenges the idea that the Lombard reflex is an automatic change. Nevertheless, this “practice effect” or the fact that speakers would need some time to adjust their speech production has also been found in a study when their auditory feedback was altered (e.g., Purcell and Munhall, 2006) and where their articulation was disturbed (e.g., Fowler and Turvey, 1980). Note, again, that in our study, speakers adapted their speaking style while being exposed to loud noise played through headphones, which strongly reduced their auditory feedback. Possibly, on receiving such limited feedback on their speech production, speakers still realized that they could do more and, thus, kept trying to overcome the noise to monitor their own speech better by extracting useful (auditory and/or somatosensory) information about how well their articulation targets are being met. Thus, our finding that speakers became generally better at enriching their speech over time without actually being able to hear their own speech well raises the interesting question of what (somatosensory) cues speakers used to improve their speech clarity for future studies. Possibly, this very limited auditory feedback may have caused speakers to be relatively slow at adjusting their speech output during the clear-Lombard speech production compared to a condition where they would have heard their own speech better. Future studies are needed to address this.

In general, higher HEGP scores, particularly in the clear-Lombard style, were associated with reduced articulation rate, raised median F0, expanded F0 range, and reduced spectral balance (i.e., increased vocal effort/energy). Amongst these correlations, HEGP intelligibility scores and spectral balance measures displayed the strongest relationship, suggesting that increased vocal effort/energy was the most salient contributor to higher (predicted) intelligibility in noise.

In line with the enlarged acoustic differences (over trials) between speaking styles, the predicted intelligibility difference between the two speaking styles also increased over sentence trials. As noted before, such an increase could, in principle, be brought about by (opposite) changes in both styles or a change in either style. For predicted intelligibility, it seems that this enlarged difference over trials was mainly driven by speakers' increased HEGP scores over trials in the clear-Lombard speaking style. These scores are likely associated with alterations in spectral balance and articulation rate throughout the Lombard reading session (see, also, Table III with correlational data).

Interestingly, gender differences in speakers' speech enrichment success, as reflected by HEGP-model predicted intelligibility, were also found in our speaker group. Female speakers had a larger (predicted) Lombard intelligibility gain compared to male speakers (see, e.g., Ferguson and Morgan, 2018; Junqua, 1993; Bradlow , 1996). This is in line with the acoustic finding that female speakers increased their vocal effort/energy more than male speakers in their clear-Lombard speech. Male speakers raised their pitch more than female speakers but did not increase their vocal effort as much as female speakers when enriching their speech for the clear-Lombard style. These results also link with the observation that vocal effort/energy (as indexed by spectral balance) has the strongest correlation with HEGP intelligibility scores. The differences in speech enrichment modifications between female and male speakers may have contributed to the higher (HEGP-model predicted) intelligibility scores for female speakers, particularly, in the clear-Lombard speaking style.

The current study elicited read sentences in isolation rather than in the presence of an addressee or via a communicative task. Additionally, a fixed elicitation order (habitual followed by clear-Lombard) was used. Further studies are required to determine the extent to which the findings of the current study generalize under different elicitation conditions. It is also possible that longer sentence lists would reveal articulatory fatigue effects. Although the choice of 200 ms as a threshold for removing silences in articulation rate measures is in line with previous studies (e.g., Cucchiarini , 2002; Gustafson and Goldrick, 2018), future clear-Lombard/plain speech comparisons would benefit from an in-depth analysis to determine how (other) acoustic and predicted intelligibility measures are affected by this choice of threshold.

Our study was also not specifically designed to address research questions about gender and because of the imbalanced gender distribution in our speaker sample, any gender-related results should be interpreted with caution. In addition, even though we take the HEGP metric to reflect speech intelligibility in noise, listening tests with human listeners are needed to show how the acoustic changes over sentence trials affect human listener ratings. After all, the HEGP model does not “perceive” speech intelligibility in the same way as human listeners do as it focuses on “low-level” energetic masking (or release thereof) while ignoring segmental changes that could be beneficial to listeners, such as vowel space expansion. However, it is worth noting that model predicted intelligibility scores did correlate highly with human listeners' listening effort ratings in a subsample of speakers. These limitations suggest the need for future studies to deepen our understanding of how acoustic changes affect communicative behavior.

In summary, our study showed that young adult speakers may need some practice to reach their full speech enrichment potential when asked to speak clearly in the presence of loud background noise, which drastically reduced their auditory feedback. This was also reflected by the HEGP-model predictions of their speech intelligibility in noise. All in all, our results demonstrate that habitual and clear-Lombard speaking styles are not constant across a sentence list, highlighting the dynamic nature of speaking styles.

This project has received funding from the European Union's Horizon 2020 research innovation program under the Marie Skłodowska-Curie Grant No. 675324 (ENRICH). We would like to thank Mirjam Ernestus and Valerie Hazan for their valuable input and comments.

See Tables V–IX for detailed results of the statistical models for articulation rate, median F0, F0 range, spectral balance, and HEGP.

1

See supplementary material at https://www.scitation.org/doi/suppl/10.1121/10.0017769 for visualization of all of the time-course measurements with raw data overlaid on the model curves (color). Note that for a clearer illustration of overlapping data points, a random jitter was applied to the raw data to improve the visualization of density.

1.
Amazi
,
D.
, and
Garber
,
S.
(
1982
). “
The Lombard sign as a function of age and task
,”
J. Speech. Lang. Hear. Res.
25
(
4
),
581
585
.
2.
Baker
,
R.
, and
Hazan
,
V.
(
2011
). “
DiapixUK: Task materials for the elicitation of multiple spontaneous speech dialogs
,”
Behav. Res. Methods
43
(
3
),
761
770
.
3.
Bates
,
D.
,
Mächler
,
M.
,
Bolker
,
B.
, and
Walker
,
S.
(
2015
). “
Fitting linear mixed-effects models using lme4
,”
J. Stat. Software
67
(
1
),
1
48
.
4.
Boersma
,
P.
, and
Weenink
,
D.
(
2017
). “
Praat: Doing phonetics by computer (version 6.0.36) [computer program]
,” available at http://www.praat.org/ (Last viewed August 5, 2020).
5.
Bosker
,
H. R.
, and
Cooke
,
M.
(
2020
). “
Enhanced amplitude modulations contribute to the Lombard intelligibility benefit: Evidence from the Nijmegen Corpus of Lombard Speech
,”
J. Acoust. Soc. Am.
147
(
2
),
721
730
.
6.
Bottalico
,
P.
,
Graetzer
,
S.
, and
Hunter
,
E. J.
(
2016
). “
Effects of speech style, room acoustics, and vocal fatigue on vocal effort
,”
J. Acoust. Soc. Am.
139
(
5
),
2870
2879
.
7.
Bradlow
,
A. R.
, and
Bent
,
T.
(
2002
). “
The clear speech effect for non-native listeners
,”
J. Acoust. Soc. Am.
112
(
1
),
272
284
.
8.
Bradlow
,
A. R.
,
Kraus
,
N.
, and
Hayes
,
E.
(
2003
). “
Speaking clearly for children with learning disabilities: Sentence perception in noise
,”
J. Speech. Lang. Hear. Res.
46
(
1
),
80
97
.
9.
Bradlow
,
A. R.
,
Torretta
,
G. M.
, and
Pisoni
,
D. B.
(
1996
). “
Intelligibility of normal speech I: Global and fine-grained acoustic-phonetic talker characteristics
,”
Speech Commun.
20
,
255
272
.
10.
Brehm
,
L.
, and
Alday
,
P. M.
(
2022
). “
Contrast coding choices in a decade of mixed models
,”
J. Mem. Lang.
125
,
104334
.
11.
Bruce
,
P.
, and
Bruce
,
A.
(
2017
).
Practical Statistics for Data Scientists
(
O'Reilly Media, Inc.
,
Sebastopol, CA
), Chap.
4
.
12.
Cooke
,
M.
(
2006
). “
A glimpsing model of speech perception in noise
,”
J. Acoust. Soc. Am.
119
(
3
),
1562
1573
.
13.
Cooke
,
M.
,
King
,
S.
,
Garnier
,
M.
, and
Aubanel
,
V.
(
2014
). “
The listening talker: A review of human and algorithmic context-induced modifications of speech
,”
Comput. Speech Lang.
28
(
2
),
543
571
.
14.
Cooke
,
M.
, and
Lu
,
Y.
(
2010
). “
Spectral and temporal changes to speech produced in the presence of energetic and informational maskers
,”
J. Acoust. Soc. Am.
128
,
2059
2069
.
15.
Cucchiarini
,
C.
,
Strik
,
H.
, and
Boves
,
L.
(
2002
). “
Quantitative assessment of second language learners' fluency: Comparisons between read and spontaneous speech
,”
J. Acoust. Soc. Am.
111
(
6
),
2862
2873
.
16.
Dichter
,
B. K.
,
Breshears
,
J. D.
,
Leonard
,
M. K.
, and
Chang
,
E. F.
(
2018
). “
The control of vocal pitch in human laryngeal motor cortex
,”
Cell
174
(
1
),
21
31.e9
.
17.
Ferguson
,
S. H.
, and
Morgan
,
S. D.
(
2018
). “
Acoustic and perceptual correlates of subjectively rated sentence clarity in clear and conversational speech
,”
J. Speech. Lang. Hear. Res.
61
(
1
),
159
173
.
18.
Fowler
,
C. A.
, and
Turvey
,
M. T.
(
1980
). “
Immediate compensation in bite-block speech
,”
Phonetica
37
(
5-6
),
306
326
.
19.
Garnier
,
M.
,
Dohen
,
M.
,
Loevenbruck
,
H.
,
Welby
,
P.
, and
Bailly
,
L.
(
2008
). “
The Lombard effect: A physiological reflex or a controlled intelligibility enhancement?
,” in
7th International Seminar on Speech Production
, pp.
255
262
.
20.
Garnier
,
M.
, and
Henrich
,
N.
(
2014
). “
Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise?
,”
Comput. Speech Lang.
28
(
2
),
580
597
.
21.
Garnier
,
M.
,
Henrich
,
N.
, and
Dubois
,
D.
(
2010
). “
Influence of sound immersion and communicative interaction on the Lombard effect
,”
J. Speech. Lang. Hear. Res.
53
(
3
),
588
608
.
22.
Gelfer
,
M. P.
,
Andrews
,
M. L.
, and
Schmidt
,
C. P.
(
1991
). “
Effects of prolonged loud reading on selected measures of vocal function in trained and untrained singers
,”
J. Voice
5
(
2
),
158
167
.
23.
Gustafson
,
E.
, and
Goldrick
,
M.
(
2018
). “
The role of linguistic experience in the processing of probabilistic information in production
,”
Lang. Cognit. Neurosci.
33
(
2
),
211
226
.
24.
Hammarberg
,
B.
,
Fritzell
,
B.
,
Gauffin
,
J.
,
Sundberg
,
J.
, and
Wedin
,
L.
(
1980
). “
Perceptual and acoustic correlates of abnormal voice qualities
,”
Acta Oto-Laryngol
90
(
5-6
),
441
451
.
25.
Hazan
,
V.
, and
Baker
,
R.
(
2011
). “
Acoustic-phonetic characteristics of speech produced with communicative intent to counter adverse listening conditions
,”
J. Acoust. Soc. Am.
130
(
4
),
2139
2152
.
26.
Hazan
,
V.
, and
Markham
,
D.
(
2004
). “
Acoustic-phonetic correlates of talker intelligibility for adults and children
,”
J. Acoust. Soc. Am.
116
(
5
),
3108
3118
.
27.
Jaeger
,
T. F.
(
2008
). “
Categorical data analysis: Away from ANOVAs (transformation or not) and towards logit mixed models
,”
J. Mem. Lang.
59
(
4
),
434
446
.
28.
Junqua
,
J.
(
1993
). “
The Lombard reflex and its role on human listeners and automatic speech recognizers
,”
J. Acoust. Soc. Am.
93
,
510
524
.
29.
Krueger
,
M.
,
Schulte
,
M.
,
Brand
,
T.
, and
Holube
,
I.
(
2017a
). “
Development of an adaptive scaling method for subjective listening effort
,”
J. Acoust. Soc. Am.
141
(
6
),
4680
4693
.
30.
Krueger
,
M.
,
Schulte
,
M.
,
Zokoll
,
M. A.
,
Wagener
,
K. C.
,
Meis
,
M.
,
Brand
,
T.
, and
Holube
,
I.
(
2017b
). “
Relationship between listening effort and speech intelligibility in noise
,”
Am. J. Audiol.
26
,
378
392
.
31.
Kryter
,
K. D.
(
1962a
). “
Methods for the calculation and use of the articulation index
,”
J. Acoust. Soc. Am.
34
(
11
),
1689
1697
.
32.
Kryter
,
K. D.
(
1962b
). “
Validation of the articulation index
,”
J. Acoust. Soc. Am.
34
(
11
),
1698
1702
.
33.
Lam
,
J.
, and
Tjaden
,
K.
(
2013
). “
Intelligibility of clear speech: Effect of instruction
,”
J. Speech. Lang. Hear. Res.
56
(
5
),
1429
1440
.
34.
Lee
,
D.-Y.
, and
Baese-Berk
,
M. M.
(
2020
). “
The maintenance of clear speech in naturalistic conversations
,”
J. Acoust. Soc. Am.
147
(
5
),
3702
3711
.
35.
Lombard
,
É.
(
1911
). “
Le signe de l'elevation de la voix” (“The sign of raising the voice”)
,
Ann. Maladies l'oreille Larynx
37
,
101
109
.
36.
Lu
,
Y.
, and
Cooke
,
M.
(
2008
). “
Speech production modifications produced by competing talkers, babble, and stationary noise
,”
J. Acoust. Soc. Am.
124
(
5
),
3261
3275
.
37.
Lu
,
Y.
, and
Cooke
,
M.
(
2009
). “
The contribution of changes in F0 and spectral tilt to increased intelligibility of speech produced in noise
,”
Speech Commun.
51
,
1253
1262
.
38.
Marcoux
,
K.
, and
Ernestus
,
M.
(
2019
). “
Pitch in native and non-native Lombard speech
,” in
Proceedings of the 19th ICPhS
, pp.
2605
2609
.
39.
Novak
,
A.
,
Dlouha
,
O.
,
Capkova
,
B.
, and
Vohradnik
,
M.
(
1991
). “
Voice fatigue after theater performance in actors
,”
Folia Phoniatr. Logop.
43
(
2
),
74
78
.
40.
Purcell
,
D. W.
, and
Munhall
,
K. G.
(
2006
). “
Adaptive control of vowel formant frequency: Evidence from real-time formant manipulation
,”
J. Acoust. Soc. Am.
120
(
2
),
966
977
.
41.
Schum
,
D. J.
(
1996
). “
Intelligibility of clear and conversational speech of young and elderly talkers
,”
J. Am. Acad. Audiol.
7
(
3
),
212
218
.
42.
Shen
,
C.
, and
Janse
,
E.
(
2018
). “Radboud Lombard Corpus (Dutch) (RaLoCo),”
Zenodo
, available at https://doi.org/10.5281/zenodo.4040685 (Last viewed November 22, 2021).
43.
Shen
,
C.
, and
Janse
,
E.
(
2021
). “Radboud Lombard Corpus (Dutch)—Additional information [data set],”
Zenodo
, available at https://doi.org/10.5281/zenodo.5645385 (Last viewed November 22, 2021).
44.
Smiljanić
,
R.
, and
Bradlow
,
A. R.
(
2005
). “
Production and perception of clear speech in Croatian and English
,”
J. Acoust. Soc. Am.
118
(
3
),
1677
1688
.
45.
Solomon
,
N. P.
(
2008
). “
Vocal fatigue and its relation to vocal hyperfunction
,”
Int. J. Speech-Lang. Pathol.
10
(
4
),
254
266
.
46.
Steeneken
,
H. J. M.
, and
Houtgast
,
T.
(
1980
). “
A physical method for measuring speech-transmission quality
,”
J. Acoust. Soc. Am.
67
(
1
),
318
326
.
47.
Tang
,
P.
,
Xu Rattanasone
,
N.
,
Yuen
,
I.
, and
Demuth
,
K.
(
2017
). “
Phonetic enhancement of Mandarin vowels and tones: Infant-directed speech and Lombard speech
,”
J. Acoust. Soc. Am.
142
(
2
),
493
503
.
48.
Tang
,
Y.
, and
Cooke
,
M.
(
2012
). “
Optimising spectral weightings for noise-dependent speech intelligibility enhancement
,” in
Proceedings of Interspeech
, pp.
955
958
.
49.
Tang
,
Y.
, and
Cooke
,
M.
(
2016
). “
Glimpse-based metrics for predicting speech intelligibility in additive noise conditions
,” in
Proceedings of Interspeech
, pp.
2488
2492
.
50.
Tuomainen
,
O.
, and
Hazan
,
V.
(
2016
). “
Articulation rate in adverse listening conditions in younger and older adults
,” in
Proceedings of Interspeech
, pp.
2105
2109
.
51.
Uchanski
,
R. M.
(
2008
). “
Clear speech
,” in
The Handbook of Speech Perception
, edited by
D. B.
Pisoni
and
R. E.
Remez
(
Wiley
,
New York
), pp.
207
235
.
52.
Uchanski
,
R. M.
,
Choi
,
S. S.
,
Braida
,
L. D.
,
Reed
,
C. M.
, and
Durlach
,
N. I.
(
1996
). “
Speaking clearly for the hard of hearing IV: Further studies of the role of speaking rate
,”
J. Speech. Lang. Hear. Res.
39
(
3
),
494
509
.
53.
Valentini-Botinhao
,
C.
,
Maia
,
R.
,
Yamagishi
,
J.
,
King
,
S.
, and
Zen
,
H.
(
2012
). “
Cepstral analysis based on the glimpse proportion measure for improving the intelligibility of HMM-based synthetic speech in noise
,” in
Proceedings of ICASSP
, pp.
3997
4000
.
54.
Van Summers
,
W.
,
Pisoni
,
D. B.
,
Bernacki
,
R. H.
,
Pedlow
,
R. I.
, and
Stokes
,
M. A.
(
1988
). “
Effects of noise on speech production: Acoustic and perceptual analyses
,”
J. Acoust. Soc. Am.
84
,
917
928
.

Supplementary Material