The temporal distribution of acoustic cues in whispered speech was analyzed using the gating paradigm. Fifteen Portuguese participants listened to real disyllabic words produced by four Portuguese speakers. Lexical choices, confidence scores, isolation points (IPs), and recognition points (RPs) were analyzed. Mixed effects models predicted that the first syllable and 70% of the total duration of the second syllable were needed for lexical choices to be above chance level. Fricatives' place, not voicing, had a significant effect on the percentage of correctly identified words. IP and RP values of words with postalveolar voiced and voiceless fricatives were significantly different.
Our ability to perceive speech is shaped by nonlinear characteristics of the auditory system,1 by the phonetic knowledge2 we gain as speakers (e.g., the inventory of speech sounds in our first language shapes speech perception), and by the linguistic structure3 of our first language (e.g., speech perception is influenced by the lexical status of the sound patterns we hear).
It has long been known that the acoustic cues to voicing and place,4,5 distributed along the duration of consonants, are used by listeners to identify words6,7 and that the relationship between acoustic perceptual cues conveyed by the signal, occurring within a specific duration of time, play a central role in linguistic contrast.8
The gating paradigm9 has been previously used to study the time course of spoken word recognition.7 The first gating experiments9 have shown that listeners need over 50% of the duration of the original word to identify the target word above chance level. This paradigm allows for precise control of the acoustic-phonetic information of the stimuli presented to the subjects. As a result, it can be used to evaluate the amount of acoustic-phonetic information required to identify the stimuli.10
In gating experiments, participants are presented with a spoken language stimulus (e.g., phones, syllables, or words) in segments of increasing duration and are then asked to “guess” the word presented and give a confidence rating.9 Three sets of data are usually collected in this type of study: lexical choice (subjects' responses at each gate), confidence rating (the confidence rating of each lexical choice),7 and isolation point (IP) (the duration from the stimulus onset to the point at which correct identification is achieved and maintained without any change in decision after listening to the remainder of the stimulus).11
The gating paradigm is not a true on-line paradigm,12 but Tyler and Wessels13 showed that it is sensitive to the real-time processes involved in spoken words. Furthermore, Bruno et al.14 showed that gating tasks are suitable for measuring phonological processing because they are independent of a phonemic level of representation, as is the case in other tasks (e.g., categorization tasks).
Warren and Marslen-Wilson7 studied the cues used to choose the lexical information on voicing and place characteristics of final word consonants. They suggested that lexical access and selection are primarily controlled by the bottom-up input, concluding that the acoustic-phonetic input is projected onto the lexical level during the process of spoken word recognition by a continuous uptake. This means that the speech signal is continuously modulated as the utterance is produced, and as the duration of the syllable increases, the listener produces lexical choices that reflect these changes, shifting, for example, from voiceless to voiced as a durational criterion is reached.
In everyday life, people rapidly switch from voiced to whispered speech to reduce intelligibility.15 Whispered speech is used primarily to selectively exclude or include potential listeners in a conversation. It is a natural way of reducing speech intelligibility due to the absence of the fundamental frequency and the harmonic structure of voiced speech. A pulmonic egressive airstream mechanism is the source of excitation in whispered speech with the shape of the pharynx adjusted so that vocal folds do not vibrate.16,17 Air passes through a constricted but open larynx, generating turbulent aperiodic airflow, which forms the source for a “rich hushing sound.”18
When listeners attempt to perceive whispered speech,19 they are faced with the difficult challenge of resolving lexical ambiguities20 and missing cues,19,21 but speech perception may still rely on predictability and probabilistic cues, intuitively employed to form expectations and anticipate the next input using a signal-plus-knowledge mechanism.22 Listeners build percepts based on the information that is available in the speech signal plus the knowledge each “listener invokes to modulate the stimulus.”22
A recent production study23 has shown that there are still some viable cues for voicing and place in Portuguese whispered fricatives: Same-place voiceless fricatives were longer than voiced fricatives both in voiced and whispered speech; place of articulation had a significant effect on source strength, in voiced and whispered fricatives; and the same first front cavity resonance frequency shifts with the place of articulation could be observed in whispered and voiced fricatives.
In this study, results from perception tests based on gated whispered words allowed us to define when the listener starts to understand a whole whispered word, more specifically, how much acoustic information is required to recognize a word6 and how alveolar (/s, z/) and postalveolar (/ʃ, ʒ/) places of articulation24 and voiced (/z, ʒ/) and voiceless (/s, ʃ/) consonants5 are perceived in whispered speech.
This led us to formulate the following research questions:
How early in a syllable is acoustic information relevant to the perception of whispered speech available?
What is the role of fricative consonants' place and voicing cues in the signal-plus-knowledge mechanism?
Fifteen female participants [18–23 years of age; mean (M) = 20 years; standard deviation of the mean (SD) = 2 years] from the same dialectal region in Portugal were recruited using convenience sampling in the districts of Aveiro and Coimbra.
2.1 Stimuli construction
Four real dissyllabic words [vowel-consonant-vowel (VCV)] produced by four Portuguese speakers (two women) from the same dialectal region as the listeners were selected from a previously recorded [using a Sennheiser (Wedemark, Germany) Ear Set 1 condenser microphone; sampled at 48 000 Hz with 16 bits per sample] whispered speech database:25 [ˈasɐ] ⟨assa⟩/⟨she/he/it bakes/roasts⟩; [ˈazɐ] ⟨asa⟩/⟨wing⟩; [ˈaʃɐ] ⟨acha⟩/⟨she/he thinks/finds that⟩; [ˈaʒɐ] ⟨haja⟩/⟨there is/will be⟩. Ten different gates for each word produced by each one of the four speakers were then generated, resulting in 10 gates × 4 words × 4 speakers = 160 different whispered speech stimuli.
matlab version 220.127.116.114444 purpose-built scripts and the mPraat toolbox for matlab version 1.1.326 were used to produce gated speech.27 The first gate consisted of the first syllable and 10% of the total duration of the second syllable, and the second gate included the first syllable and 20% of the total duration of the second syllable, with the following gates using the same increment (10%) up to the tenth gate, where the whole word was reproduced.
2.2 Listening conditions
Participants were seated in a quiet room [background noise level of 15.1 dB level A-weighted equivalent (LAeq)], in front of a 24-in. computer screen, wearing Sennheiser HD 600 headphones, connected to a Focusrite Scarlett 6i6 audio interface and a Windows 10 laptop computer running Alvin 3.12 experiment-control software.28 They were instructed to orthographically transcribe (on screen, using a keyboard and mouse) the words they thought were reproduced over the headphones and to provide a confidence score (1–5) on the same window box.
Participants heard, successively, ten gated speech stimuli from the same word, starting with the first syllable up to 10% of the total duration of the second syllable and ending with the whole word. The 16 words produced by the four speakers were presented in the same randomized order to the 15 listeners, using a true random generator based on atmospheric noise.29 Only the sequence of ten gated stimuli (from the same word) was presented sequentially (from the shortest stimulus to the full word). This was a self-paced experiment that lasted, on average, 25 min.
2.3 Data extraction and measurement procedures
Listeners' responses were exported to a raw Alvin results text file that was then imported into Excel version 2108 for data cleanup and coding.37 The 2040 responses (10 gates × 4 words × 4 speakers × 15 listeners) were individually coded and sorted so that the following data were available for analysis: speaker (1–4); word (⟨assa⟩; ⟨aza⟩; ⟨acha⟩; ⟨haja⟩); listener (1–15); gate (1–10); place (0–4); voicing (0–4); lexical choice (0 = incorrect; 1 = correct). The responses regarding place and voicing were coded as in Ref. 7: place = 1 represents an alveolar response to an alveolar stimulus; place = 2 represents a postalveolar response to a postalveolar stimulus; place = 3 represents an alveolar response to a postalveolar stimulus; place = 4 represents a postalveolar response to an alveolar stimulus; voicing = 1 represents a voiced response to a voiced stimulus; voicing = 2 represents a voiceless response to a voiceless stimulus; voicing = 3 represents a voiced response to a voiceless stimulus; voicing = 4 represents a voiceless response to a voiced stimulus; 0 represents other responses.
The shortest duration, from onset, required to correctly identify whispered speech targets, known as IP, was measured.11,30 The IP was the first stimulus of a sequence of at least three consecutive (lexically) correct responses. The recognition point (RP), that defines a stricter criterion, was also measured based on Ref. 7: three consecutive correct responses with at least 80% confidence (a 4 in our 1–5 confidence scale). Both the IP and RP were calculated using Excel version 2108.
2.4 Statistical analysis
All graphs and statistical analysis in this paper were generated using R version 4.2.2 running in RStudio 2022.12.0 + 353 Release. Mixed logistic regression models31 were developed using the gmler function from the lme4 version 1.1–31 package, with correct/incorrect lexical choice, place, and voicing answers as outcome variables, considering gate number as a fixed effect and listener and word as random effects. The resulting models' regression lines and shading spanning the 95% confidence intervals were drawn using the sjPlot 2.8.12 package.
Mixed effects regression models were also developed (using the lmer function also from the lme4 version 1.1–31 package) for the IP and RP values of each place of articulation (alveolar and postalveolar) with a voicing (voiceless/voiced: ⟨assa⟩/⟨asa⟩ or ⟨acha⟩/⟨haja⟩) fixed effect and a listener (15 listeners) varying intercept, using maximum likelihood for estimation.31
This section presents the analysis and discussion of the lexical choice, IPs and RPs, discriminating place and voicing, and type of deviation results.
3.1 Lexical choice
Figure 1 presents the overall results of the lexical choice for the stimuli based on the four different words. One can observe that more than 60% of the total duration of the second syllable had to be presented so that more than 50% (above chance level) of the listeners' responses were correct.
As the experiment progressed, the listeners correctly identified the words earlier on; at the start of the experiment, listeners needed over 70% of the total syllable duration to make the right lexical choice, but this number fell below 60% halfway through the stimuli (especially for words ⟨asa⟩ and ⟨haja⟩ with voiced fricatives). Also, nine (60%) of the listeners heard a fronted vowel [ae] in the second syllable of the words (all with [ɐ]), when 10%–60% of the syllable was played.
A mixed logistic regression model with the lme4 syntax lexical choice ∼ gate + (1+gate|listener) + (1+gate|word) revealed a much higher SD for the by-listener (SD = 1.942 51) than the by-word (SD = 0.590 78) varying intercepts. The estimated correlation between varying intercepts and varying slopes showed that higher intercepts had lower gate slopes both for listener (r = −0.96) and word (r = −0.88) random effects.31 The resulting model, represented in Fig. 2, predicted that listeners needed 70% (gate 7) of the total duration of the second syllable for their lexical choice to be above chance level. It should be noted that we have only focused on the truncation of the second syllable because it has been recently observed that the “relative duration of same-place voiceless fricatives was higher than voiced fricatives both in voiced and whispered speech,”23 which we wanted to test perceptually. However, it has been shown that the perception of voicing can gradually switch as a function of pre-consonantal vowel duration in voiced speech for various languages,32,33 including Portuguese.34
A mixed logistic regression model lexical choice ∼ gate + place (1+gate|listener) + (1+gate|word) was developed to explore the role of place in the identification of words, revealing, through likelihood ratio tests31 of the model with the place effect against the model without the place effect, a significant difference between models (χ2(1) = 8.494, p = 0.0036), i.e., there was a significant difference between the percentage of correct responses of words with alveolar and postalveolar fricatives. Words with more posterior fricatives were predicted to have lower percentage of correct responses than those with more anterior fricatives. This better word recognition as a function of fricatives' place of articulation is quite a puzzling result that could be related to “feature-decoding asymmetries”35 used to facilitate speech perception in challenging conditions, such as in whispered speech.
An additional mixed logistic regression model lexical choice ∼ gate + voicing (1+gate|listener) + (1+gate|word) was used to test the effect of voicing in lexical choice, revealing that there was not a significant difference (χ2(1) = 3.154, p = 0.0758) between the percentage of correct words that had voiceless fricatives and those that had voiced fricatives. Even when separate models were developed for responses to word stimuli with alveolar [χ2(1) = 0.150, p = 0.6989] and postalveolar [χ2(1) = 0.7741, p = 0.379] fricatives, the effect of voicing was not significant.
3.2 IPs and RPs
The M and SD of the IP and RP values for each word are presented in Table 1.
|Word .||MIP .||SDIP .||MRP .||SDRP .|
|Word .||MIP .||SDIP .||MRP .||SDRP .|
Mixed effects regression models IP/RP ∼ voicing + (1 | listener) predicted a negative effect of voicing in IP and RP values for both places of articulation (alveolar and postalveolar), i.e., that voiced fricatives had lower IP and RP values than same-place voiceless fricatives. Likelihood ratio tests of the models with the voicing effect against the model without the voicing effect revealed a significant difference between models31 only for words with postalveolar fricatives, i.e., there was only a significant difference between IP and RP values of words with postalveolar voiced fricatives and words with postalveolar voiceless fricatives: IP⟨assa⟩/⟨asa⟩ − χ2(1) = 3.59, p = 0.0582; IP⟨acha⟩/⟨haja⟩ − χ2(1) = 9.61, p = 0.00194; RP⟨assa⟩/⟨asa⟩ − χ2(1) = 3.49, p = 0.0616; RP⟨acha⟩/⟨haja⟩ − χ2(1) = 5.31, p = 0.0212.
Previous studies19,20,24,36 on the perception of voicing in whispered fricatives have shown a voiceless bias, but the current results could be interpreted as evidence against it.
3.3 Discriminating place
The total correct and incorrect percentages of place answers were also analyzed as shown in Fig. 3, revealing that when just 20% of the total duration of the second syllable was available to the listeners, they responded above chance level. The same pattern was observed when the responses were broken down by correct alveolar and postalveolar and incorrect alveolar and postalveolar responses.
A mixed logistic regression model with the lme4 syntax place ∼ gate + (1+gate|listener) + (1+gate|word) predicted that higher intercepts had lower gate slopes both for listener (r = −0.67) and word (r = −1.00) random effects and that listeners needed less than 20% of the syllable duration to identify the place of articulation as shown in Fig. 3 (right).
The overall place discrimination curve obtained by subtracting incorrect from correct place responses at each gate,7 the alveolar discrimination curve, calculated by subtracting the number of alveolar responses to postalveolar stimuli from the alveolar responses to alveolar stimuli, and the postalveolar discrimination curve, calculated by subtracting the number of postalveolar responses to alveolar stimuli from the postalveolar responses to postalveolar stimuli, corroborated the modeling results.
3.4 Discriminating voicing
The total correct and incorrect percent voicing responses are shown in Fig. 4. The correct and incorrect response curves are symmetrical for place (see Fig. 3, left) and not for voicing (see Fig. 4, left), but this is not because the “other responses” were treated differently. We believe that the way our listeners perceived the voicing cues was quite different from the place cues. What is classified as “other responses” is an amalgamation of responses that could correspond to various phonemes. We are not 100% sure what they are since we were collecting orthographic transcriptions from our listeners. It is the coding described in Sec. 2.3 that determines the correct and incorrect answers, but the “other responses” are not included in the analyses. We do not believe the similar response patterns for words with voiceless and voiced fricatives are biased, especially considering the voicing profiles/voicing continua that have been previously shown for Portuguese fricatives.23
A mixed logistic regression voicing ∼ gate + (1+gate|listener) + (1+gate|word) was also used to model the listeners' discrimination of voicing, predicting that after gate 3, answers were above chance level, as shown in Fig. 4 (right). Notice the much wider shading spanning the 95% confidence intervals for the voicing model (shown in Fig. 4, right) than the place model (shown in Fig. 3, right).
3.5 Type of deviation from the word played as stimulus
When the percentages of correct and incorrect responses (lexical choice) were analyzed as a function of type of deviation (place or voicing) from the word played as stimulus,7 results revealed again that the correct (above chance level) fricative place and voicing happens very early on (before gate 3) and similar response patterns for words with voiceless and voiced fricatives.
Speech perception is a highly adaptable process, especially when the acoustic signal is deprived of some of its cues, as in whispered speech, tangible in the learning effect that was evident during our experiments.
Our results have shown that listeners needed 70% of the total duration of the second syllable, in a disyllabic word, for their lexical choice to be above chance level and that there was a significant difference between the percentage of correct responses of words with alveolar and postalveolar fricatives. There was not a significant difference between the percentage of correct words that had voiceless fricatives and those that had voiced fricatives. There was only a significant difference between IP and RP values of words with postalveolar voiced fricatives and words with postalveolar voiceless fricatives. Listeners needed less than 20% of the second syllable duration to identify the place of articulation and more than 30% to discriminate voicing.
Speech perception evidence presented in this paper builds on recent production studies that have identified cues for voicing and place in Portuguese whispered fricatives, supporting Lindblom's22 theory of the interactions between perception and production. We have contributed new evidence supporting the view that words can be perceived in whispered speech using the available signal-plus-knowledge mechanism. Some of the information usually available in voiced speech is absent in whispered speech signals, but listeners were able to perceive place and voicing features very early on a CV syllable, and this had a relevant role in their lexical choices even before the whole word was available.
This work was financially supported by Project No. PTDC/EMD-EMD/29308/2017–POCI-01-0145-FEDER-029308, funded by FEDER funds through COMPETE2020-POCI and by national funds (PIDDAC) through FCT/MCTES. Support was also received in the context of the projects UIDB/00127/2020 and UIDB/04106/2020. The authors have no conflicts to disclose. Ethical permission (Parecer No. P523-10/2018, dated November 21, 2018) was obtained from an independent ethics committee (Comissão de Ética da Unidade Investigação em Ciências da Saúde–Enfermagem da Escola Superior de Enfermagem de Coimbra, Coimbra, Portugal), and informed consent was obtained from all participants prior to data collection. The 160 different whispered speech stimuli are available from the corresponding author upon reasonable request as .wav (Windows PCM without compression) audio files, sampled at 48 000Hz with 16 bits per sample.