English vowel recognition in multi-talker babbles mixed with different numbers of talkers a)

: The present study examined English vowel recognition in multi-talker babbles (MTBs) in 20 normal-hearing, native-English-speaking adult listeners. Twelve vowels, embedded in the h-V-d structure, were presented in MTBs consisting of 1, 2, 4, 6, 8, 10, and 12 talkers (numbers of talkers [ N ]) and a speech-shaped noise at signal-to-noise ratios of (cid:2) 12, (cid:2) 6, and 0 dB. Results showed that vowel recognition performance was a non-monotonic function of N when signal-to-noise ratios were less favorable. The masking effects of MTBs on vowel recognition were most similar to consonant recognition but less so to word and sentence recognition reported in previous studies. V C 2024 Author(s). All article content, except where otherwise noted, is licensed under a Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/) .


Introduction
Perceiving speech in the presence of noise can be challenging due to masking (Miller, 1947).Masking effects are typically categorized into two types: energetic masking (EM) and informational masking (IM) (Brungart, 2001).EM refers to the masking when the target sound cannot be properly represented at the auditory nerve due to the energy or modulation interruption from the masker (Kidd and Colburn, 2017;Stone et al., 2012).In contrast, IM refers to the difficulty in distinguishing the target from masker due to perceptual similarities when both are audible (Shinn-Cunningham, 2008).Depending on the masker, the total masking effects could be attributed to either type or both types.A steady-state noise, for example, exerts mostly EM but little IM (Arbogast et al., 2002).On the other hand, speech babbles sharing a similar fundamental frequency (e.g., Brungart et al., 2001) or linguistic information (e.g., Freyman et al., 2001;Van Engen and Bradlow, 2007) as the target can elicit strong IM.
Masking by the speech babbles, also termed "cocktail party problems" (Cherry, 1953), is particularly intriguing as the amount of EM and IM varies with the numbers of talkers (N) comprising the babble.Specifically, EM grows with N as the spectral-temporal fluctuation flattens, reducing the opportunities for "dip listening" (Cooke, 2006;Nittrouer and Lowenstein, 2023).In contrast, IM peaks at relatively small N (i.e., N ¼ 2 or 3) (Brungart et al., 2001), where acoustic and linguistic information in the babbles remains extractable (Hoen et al., 2007).The performance of speech recognition in multi-talker babbles (MTBs) is an non-monotonic function of N due to the combined effects of EM and IM (Miller, 1947;Rosen et al., 2013;Simpson and Cooke, 2005;Wang and Xu, 2020).
In the study by Simpson and Cooke (2005), English consonant recognition was measured in MTBs consisting of a wide range of N (i.e., from 1 to 512).The performance decreased rapidly as N increased from 1 to 8 and reached the minimum at N ¼ 8. Further increasing N did not worsen the performance as compared to a speech-shaped noise (i.e., N ¼ 1).Similar patterns were also seen in word (Miller, 1947) and sentence (Rosen et al., 2013) recognition, although the N at which the recognition performance reached its lowest point [i.e.,"breakpoint," defined as the N after which the slope of performance curve does not changes significantly (Rosen et al., 2013)] differed.Hoen et al. (2007) suggested that targets that activate successful lexical access, such as meaningful words and sentences, suffer from increased IM from MTBs due to lexical competition, in contrast to targets that do not initiate lexical access, such as consonants and vowels, which experience solely acoustic-phonetic interference.Therefore, one can expect meaningful speech materials to reach their breakpoints at a smaller N than those that are not meaningful, because the lexical inteference is stronger at a smaller N, where intelligible information can be extracted from MTBs.
Interestingly, in a recent study (Wang and Xu, 2020) we examined the recognition of Mandarin tones that allow lexical access in Mandarin Chinese in MTB.We observed a similar non-monotonic trend of the recognition performance as a function of N, as did in English speech materials, but a greater breakpoint (i.e., N ¼ 8) than other meaningful materials in English.Despite the cross-language difference, it is worth noting that the definition of breakpoint in Wang and Xu (2020) was different from those in Rosen et al. (2013) or Simpson and Cooke (2005) and Miller (1947), in which a "breakpoint" was not specifically defined.While Rosen et al. (2013) based their determination of breakpoints on segmented regression, Wang and Xu (2020) used exponential regression, which does not emphasize the possible improvement in recogniton performance after breakpoints.Therefore, a unified definition of a breakpoint must be established prior to comparing the masking effects across speech materials.
We identified two gaps along this line of research.First, the effects of N on speech recognition have been examined in English consonants, words, and sentences, but not vowels, which comprise the other category of speech sounds.Investigating the effects of N on vowel recognition would extend our current understanding of the role of vowels, such as carrying speech suprasegmental (e.g., intonation and rhythm) and intelligibility information in multiple languages, such as English and Mandarin Chinese, in challenging listening environments (Kewley-Port et al., 2007;Fogerty and Kewley-Port, 2009;Chen et al., 2013;Chen et al., 2015).Second, the absence of a unified definition of breakpoints rendered the crossmaterial and cross-language comparison of masking effects inappropriate.Comparing the masking effects among speech materials would unravel the interplay of informational and energetic masking at different speech levels.Ultimately, these would facilitate the development of improved speech processing strategies against speech-like noise, which remains one of the most complained about hard-to-hear scenarios for hearing prosthesis users (Lesica, 2018;Zeng, 2017).
Given that the exact shape of the non-monotonic functions of N at different signal-to-noise ratios (SNRs) and the breakpoint of each are not known for English vowel recognition, the present study first examined vowel recognition in MTBs with varying N and SNRs.Next, the breakpoints on the speech performance functions of previously and currently used speech materials were determined using a unified method and the masking effects were compared.It is hypothesized that vowel recognition performance changes non-monotonically with N like other English speech materials, and the breakpoint occurs at a greater N than those for word and sentence recognition.

Participants
Twenty native-English-speaking adults (16 females and 4 males) with normal hearing, confirmed by pure-tone audiometry with a 20 dB hearing level (HL) at octave frequencies between 250 and 8,000 Hz, participated in this study.The mean age 6 standard deviation (SD) of participants was 23 6 1.83 years old.No participant reported any history of hearing, speech, or cognitive disorders.

Materials and signal processing
Participants were tested in a vowel identification-in-noise task following a forced-choice paradigm in which they selected the word based on the diotically presented stimuli (see Sec. 2.3 below).The stimuli used in the vowel identification test were the original recordings by Hillenbrand et al. (1995).There were 12 vowels (i.e., /ae, O, e, E, ˘T, i, I, A, o, U, ˆ, u/), each presented in an h- V-d context (i.e., had, hawed, hayed, head, heard, heed, hid, hod, hoed, hood, hud, who'd).The recordings were from four talkers (two female and two male).Thus, a total of 48 tokens (12 vowels Â 4 talkers) were used as the target stimuli.
A corresponding speech-spectrum-shaped noise (SSN) was generated by modulating a random noise with the long-term average speech spectrum of all 48 vowel tokens (concatenated).A 10-min-long SSN was pre-generated, from which a segment with an appropriate length was randomly selected as the SSN masker on each trial.
To generate MTBs, speech narratives spoken by 6 females and 6 males were collected from 12 audio book compact discs (CDs).These narratives were downloaded from the CDs as WAV files, checked for peak clipping, and resampled to 22 050 Hz.Following the methodology of Rosen et al. (2013), any excessively long silent gaps (>0.15 s) in the audio files resulting from transitions between chapters or intentionally inserted pauses were removed for each talker before concatenation.This was done to ensure that the recordings were as continuous as possible and that the audible portions sounded equally loud.While normal continuous speech typically doesn't contain silent gaps of >0.15 s, it is worth noting that removing excessively long silent gaps could potentially affect dip listening to some extent.The final products after audio concatenation were 12 (each from one of the 12 different talkers) 1-h-long continuous speech recordings.The peak amplitude of these recordings was equalized.The N in an MTB masker included 1, 2, 4, 6, 8, 10, and 12.For the onetalker babble condition (N ¼ 1), only 1 talker out of the 12 was randomly selected at a time so that the chance of that talker being a male or a female was equal.For conditions with N > 1, care was taken so that equal numbers of male and female talkers were included.Randomly selected segments with appropriate length from the long recordings of each selected talker were then mixed as the MTB masker on each trial.
The root-mean-square (RMS) level of target tokens was first equalized.Then a 300-ms-long silence was added to the beginning and end of target tokens.The SSN or MTB masker was mixed with the target vowel token at a desired SNR (i.e., -12, -6, or 0 dB) by adjusting the RMS of the masker in relative to the RMS of target (Note: The target RMS was calculated before inserting any silence).A 50-ms cosine ramp was then applied to the onset and offset of the target-masker mixture.

Procedure
The vowel recognition test was administered in a double-walled sound booth.The stimuli were played binaurally through Sennheiser HD280 Professional headphones (Sennheiser, Wedemark, Germany).Participants were seated in front of a computer screen and provided with a computer mouse to respond.The test was programmed in MATLAB (MathWorks, Natick, MA) and followed a 12-alternative forced-choice procedure in which 12 buttons (i.e., had, hawed, hayed, head, heard, heed, hid, hod, hoed, hood, hud, who'd) were displayed on the screen.The participants were asked to select the word they heard and were encouraged to make a guess if not sure.
To familiarize participants with the test, a preview of all 12 h-V-d words with each spoken once by the 4 talkers (2 females and 2 males) was provided.In the meantime, participants were asked to adjust the volume such that the tokens were heard at their most conformable level.Once the volume was determined, participants were not allowed to adjust the volume again in any later tests.On average, the chosen volume was around 65 dB A. The preview was then followed by a practice session in which the participants performed vowel recognition with one male and one female target talker under the SSN and MTB conditions (N ¼ 1, 4, 12) at SNRs of 0 and -6 dB.The practice session contained a total of 192 trials (12 vowels Â 2 talkers Â 4 masker types Â 2 SNRs).Feedback was provided during the practice session.The practice session lasted approximately 20 min for each participant.
The test consisted of 24 conditions [3 SNRs Â (7 MTBs þ 1 SSN)] with 48 tokens (12 h-V-d words Â 4 talkers) under each condition.Under each SNR condition, the presenting order was randomized by masker type and then by token.The presenting order of SNR conditions was counterbalanced across participants.In total, each participant listened to a total of 1152 tokens in the test session and the test took approximately 1.5 to 2.0 h to complete.

Data analysis
The raw data from the vowel perception test were binary (i.e., correct/incorrect) and thus were analyzed with a mixedeffect logistic regression model.The main effects included SNR, masker type, and their two-way interaction.The bysubject and by-item intercepts and slopes were entered as random effects.Confusion matrices were extracted from the responses of vowel recognition test (e.g., Peterson and Barney, 1952;Hillenbrand et al., 1995;Parikh and Loizou, 2005).

Results
We first examined the effects of SNR and N on vowel recognition.Significant main effects of SNR [v 2 (2, N ¼ 20) ¼ 3436.4,P < 0.000] and N [v 2 (7, N ¼ 20) ¼ 1046.4,P < 0.000] were found, and so was the SNR-by-N interaction [v 2 (14, N ¼ 20) ¼ 374.8, P < 0.000].The left panel of Fig. 1 shows the mean vowel recognition score as a function of SNR, and the right panel shows it as a function of N. At 0 dB SNR, the vowel recognition performance did not change with N but hovered around 80% correct.However, the performance continued to decrease from N ¼ 1 to N ¼ 6 at a SNR of -6 dB or from N ¼ 1 to N ¼ 8 at a SNR of -12 dB (all P < 0.05 in Bonferroni adjusted post hoc pairwise comparisons), suggesting that increasing N worsened vowel recognition performance only when the SNR was less optimal (i.e., -6 and -12 dB).At -6 dB, vowel recognition scores were significantly better at SSN (P < 0.01) than those at N ¼ 6, 8, 10, and 12, which did not differ from each other (all P > 0.05 in Bonferroni adjusted post hoc pairwise comparisons).At -12 dB, no significant differences were observed among vowel recognition scores when N was !8 (all P > 0.05 in Bonferroni adjusted post hoc pairwise comparisons).
Next, the breakpoint on the vowel recognition performance curve was determined (Fig. 1, right panel).The breakpoint was defined as the N at which the lower 10% of the recognition performance range was reached on an exponential curve fitted with the form y ¼ ae bðnþcÞ .The parameters were estimated using the ordinary least-square method (same as in Wang and Xu, 2020).The fitting was done separately for the conditions with SNRs of -6 and -12 dB, but not for the 0 dB condition.The breakpoints were found at around N ¼ 5 and N ¼ 6 for the SNRs of -6 and -12 dB, respectively.To facilitate the comparison of breakpoints across different speech materials, the same exponential fitting was applied to consonant (Simpson and Cooke, 2005), word (Miller, 1947), and sentence (Rosen et al., 2013) recognition performance at a common SNR (i.e., -6 dB).The breakpoints were identified at around N ¼ 6 for consonant recognition, N ¼ 3 for word recognition, and N ¼ 1 for sentence recognition (Fig. 2; also see supplementary material Fig. 12 for the fitted performance).
To examine the effects of SNR and N on vowel confusion, we extracted a confusion matrix from the vowel recognition data that were pooled across all SNRs, N, and participants (Fig. 3).The overall confusion matrix was representative because we observed similar patterns of the confusion matrices at different SNRs or with different N. The confusion matrices at each SNR or N can be found in the supplementary material .Vowel confusions accounting for at least 10.0% of the errors were noted here.Mutual confusion was found in /A/ and /O/ (/A/ identified as /O/, 27.5%; /O/ identified as /A/, 27.9%) and /E/ and /ae/ (/ae/ identified as /E/, 10.3%; /E/ identified as /ae/, 13.4%), whereas a significant portion of /A/ was misperceived as /ae/ (18.1%) and a significant proportion of /u/ was identified as /U/ (10.0%).Vowel /i/ was least confused with other vowels.

Discussion
The present study examined the recognition of English vowels in MTBs consisting of various N at different SNRs.As hypothesized, vowel recognition performance was a non-monotonic function of N at challenging SNRs, similar to other speech materials.The breakpoint of the performance curve for vowel recognition occurred at an N greater than those for word and sentence recognition when the performance was fitted using exponential regression (Fig. 2).Vowel confusions in MTBs mainly occurred between /A/ and /O/, /A/ and /ae/, and /E/ and /ae/ (Fig. 3).
We observed that at challenging SNRs, vowel recognition performance showed a non-monotonic pattern as a function of N like other speech materials (Figs. 1 and 2), suggesting that vowel materials were subjective to the combined effect of EM and IM in a similar way to other speech materials.Furthermore, the overall level and breakpoint of vowel recognition performance were more similar to those of consonant recognition than those of word and sentence recognition at the same SNR (Fig. 2).This is consistent with the IM account by Hoen et al. (2007) that materials allowing lexical Fig. 2. Speech recognition performance for Mandarin tones (Wang and Xu, 2020), English consonants (Simpson and Cooke, 2005), vowels (the present study [in red]), words (Miller, 1947), and sentences (Rosen et al., 2013) as a function of the number of talkers in the multi-talker babbles.
ARTICLE asa.scitation.org/journal/jelaccess (e.g., meaningful words and sentences) suffer from stronger IM effects due to lexical interference in additional to acoustic-phonetic interference received by materials activating little lexical access (e.g., consonants and vowels).Another observation that supported the account of Hoen et al. (2007) is that the breakpoints in word and sentence recognition (N ¼ 3 and 1, respectively) were smaller than those of consonant and vowel recognition (N ¼ 6 and 5, respectively).In regard to EM, listening for a consonant or vowel through the dynamic envelope of a babble (i.e., dip listening) might require a smaller temporal window than for a word or sentence; thus, the EM effect was likely more similar between consonants and vowels than materials with longer durations.
Interestingly, the breakpoint on the recognition of Mandarin tones was found to be N ¼ 8 (Wang and Xu, 2020), which is greater than those for English consonants and vowels.Such results might seem to contradict the proposal of Hoen et al. (2007) because Mandarin tones can activate lexical access, thus making tone perception similar to meaningful word recognition.However, a tone-in-noise task in Wang and Xu (2020) could be performed mostly without accessing to lexical information.Specifically, listeners were asked to select the tone they heard out of four possible tones, which were always carried by the same syllable, meaning that a simple strategy such as discriminating the F0 contour could fulfill the task goal.The much higher performance of tone recognition in noise ($90% correct at -6 dB SNR) corroborates the idea that tone recognition in noise is based on F0 and its harmonics.Further evidence that supports tone recognition as a distinct category from consonant, vowel, or sentence recognition comes from studies in listeners with sensorineural hearing loss.In hearing-impaired listeners with up to 70 dB HL of pure-tone average between 0.5 and 4 kHz, tone recognition maintained at $90% correct, whereas consonant, vowel, or sentence recognition was approximately 60% correct (Chen et al., 2020).
We observed that considerable proportions of vowel confusions occurred between /A/ and /O/ and /E/ and /ae/ across different N and SNRs, which was consistent with the confusion patterns observed in vowel recognition in quiet.In many northern American areas, /A/ and /O/ are not distinguished due to the /A/-/O/ merger (Labov, 1998).The confusions between /E/ and /ae/ and /u/ to /U/ were expected as their spectral separation in static F1-F2 space was relatively poor [see Fig. 4 in Hillenbrand et al. (1995)].On the other hand, the front vowel /A/ and the back vowel /ae/, which were less confused in quiet (e.g., Neel, 2008;Hillenbrand et al., 1995), comprised 18.1% of errors in recognizing /A/ in the present study (Fig. 3).Parikh and Loizou (2005) showed that at challenging SNRs, F1 information was more reliably preserved than F2 in SSN and MTB, likely serving as the primary cue for vowel recognition in noise.The relatively poor static F1 separation could be one reason for the /A/ and /ae/ confusion observed in this study.It is also worth noting that the effects of N and SNR on vowel confusion were similar.Both increasing N and decreasing SNR impaired the overall recognition by increasing random errors, which were likely due to increased energetic masking.
The impact of the findings from this study is twofold.First, together with the previous data on other speech materials, we highlight the importance of speech material selection in speech-on-speech masking experiments in that the most efficient babble masker depends on the target material.A two-talker babble, which is most commonly used to simulate a high IM condition for sentences, may not be the most efficient babble masker for vowels and consonants.Second, our data could potentially benefit the design and development of speech perception training for listeners with speech perception difficulty at the sentence level-for example, due to delayed speech/language development.For listeners who have difficulties in identifying particular syllables in noisy listening environments, a personalized recognition program with various difficulty levels (such as by adjusting N in a MTB) may be useful for enhancing the pronunciation-perception mapping for those listeners.
In summary, the present study examined English vowel recognition in SSN and MTB with N in the MTB varying from 1 to 12 at three different SNRs (i.e., 0, -6, and À12 dB).As hypothesized, vowel recognition performance changed with N in MTB in a non-monotonic pattern.However, this was only true at more challenging SNRs (i.e., -6 and À12 dB): the performance decreased as a function of N up to 5 and 6, respectively, after which the performance largely remained at the same level as N further increased.At a more favorable SNR (i.e., 0 dB), the vowel recognition performance was relatively high across N. Also, consistent with our hypothesis, the breakpoint for vowel recognition in MTB (i.e., N ¼ 5) was greater than those for English word and sentence recognition (i.e., N ¼ 3 and 1, respectively) but similar to that of consonant recognition (i.e., N ¼ 6) at the comparable SNR, suggesting an overall similar masking effect of MTB between English consonant and vowel materials.

Fig. 1 .
Fig. 1. (Left) Group mean vowel recognition scores as a function of SNR.(Right) Group mean vowel recognition performance as a function of N. The error bars indicate standard errors.The black dashed lines represent fitting curves with the exponential form y ¼ ae bðnþcÞ .

Fig. 3 .
Fig. 3. Vowel confusion matrix generated from the data of all participants, averaged across SNRs and N. The color and value in each cell represent the percentage of responses for a particular stimulus.The sum of each row of stimulus is 100%.