Although much is known about how normal-hearing listeners process spoken words under ideal listening conditions, little is known about how a degraded signal, such as speech transmitted via cochlear implants, affects the word recognition process. In this study, gated word recognition performance was measured with the goal of describing the time course of word identification by using a noise-band vocoder simulation. The results of this study demonstrate that spectral degradations can impact the temporal aspects of speech processing. These results also provide insights into the potential advantages of enhancing spectral resolution in the processing of spoken words.
Most models of spoken word recognition propose that effective word recognition is achieved through the ongoing interaction between sensory information derived from the speech signal and the listener's linguistic knowledge or prior experience.1,2 The initial segments of a word activate a cohort of lexical candidates that is updated in real time as the word unfolds. The target word can often be identified well before the entire word is “heard” because, at some point, the characteristics of the target word diverge from those of other activated lexical candidates.1 The word gating paradigm has been widely used to assess the assumptions regarding word recognition being an interactive process.3,4 In a typical word gating task, listeners are presented with increasing amounts of word onset information (i.e., a series of increasingly longer gates), and following each gate, they must predict the identity of the target word.3,5 Using the gating task, several studies have consistently reported that in quiet, most target words can be recognized reliably when a little over half of their total duration is presented,3,6–8 supporting the notion that listeners continually compare what they hear to what they already know and do not need to wait until the presentation of final segments to correctly identify a target word.9 The focus of the present study was on how speech degraded by cochlear implant (CI) processing may affect gated word recognition. In this study, the isolation points (IPs) were determined by identifying the specific point within a given word where participants successfully identified the target word for the first time and maintained their response without any subsequent changes after three successive presentations. This method of calculating IPs aligns with the approach used in previous studies investigating gated word recognition.6–8,10,11
The speech processing deficits observed in CI users can be attributed to two primary factors: the inherent limitations in spectral resolution conveyed through CI devices and potential linguistic–cognitive deficits among CI users. Listeners with cochlear implants (CIs) experience reduced spectral resolution, which impairs their access to acoustic–phonetic cues in speech, leading to significant impairments in speech recognition.12,13 Poor spectral resolution in CIs has been shown to influence cognitive processing of speech in CI users and normal hearing (NH) participants listening to CI simulations by increasing listening effort 14 and reducing top-down feedback.14–17 Postlingually deafened adult CI users exhibit delayed IPs compared to individuals with normal NH,8,18 likely due to reduced spectral resolution.8 In addition to reduced spectral resolution, variations in neurocognitive and linguistic skills have been linked to individual differences in speech recognition among CI users.19,20 Some specific neurocognitive operations, such as working memory capacity,21 nonverbal reasoning skills,22 attention/inhibitory control,23 processing speed,24 and verbal learning and memory 25 have been found to be associated with individual differences in speech recognition among CI users. Moradi et al. 7 showed that in listeners with NH, working memory and attention processes play complementary roles in gated word recognition, with working memory facilitating the storage and retrieval of speech information and attention processes allocating relevant cognitive resources. However, in CI users, Collison et al. 18 showed that linguistic–cognitive abilities, including expressive vocabulary, verbal comprehension, and nonlinguistic cognitive skills, did not have a predictive effect on performance in gated word recognition, though the specific role of working memory and attention was not assessed. In a recent study, Patro and Mendel8 investigated gated word recognition performance in three groups: postlingually deafened adult CI users, a NH group listening to vocoder (8-channel noise-band vocoder processed) speech, and another NH group listening to full-spectrum speech. The results showed that the CI group and the NH group listening to vocoder processed speech performed similarly on the gated word recognition task. However, both groups performed significantly worse compared to the NH group that listened to full-spectrum speech. Their results support the assumption that postlingually deafened adult CI users' word processing difficulties may be arising primarily due to the poor spectral resolution transmitted through the CI devices. It is also possible that the degradation of the signal poses challenges for CI users in effectively engaging their cognitive abilities during the gating task.
In CI users, reduced spectral resolution is produced by the limited number of electrodes in the CI array and by the excessive current spread (extensive overlap in the electrical fields) produced by neighboring electrodes.26 Theoretically, having fewer physical electrode contacts translates to each electrode coding a broader spectral region that does not permit the CI users to access to finer spectral details within the speech spectrum. This limitation may result in poor speech understanding in adverse listening conditions.26 With excessive current spread, increasing the number of electrodes does not necessarily increase the number of functional channels.27 Specifically, increased channel interaction caused by excessive current spread smears the “acoustic landmarks” (or reduces the spectral maximas) in the frequency spectrum that also leads to poor speech understanding.28 Although it is believed that the poor spectral resolution explains the lexical processing deficits in CI users,8,29 it is not known how much spectral resolution would be needed to restore the temporal dynamics of word recognition via CIs. While it is impossible to exactly replicate CI processing acoustically, 8-channel vocoder processing has been shown to yield speech recognition performance similar to that of the best-performing listeners with CIs,26,27,30 indicating that listeners with CIs presumably experience a similar degree of spectral degradations in their everyday life. In this study, we used noise-band vocoder simulations to explore how many channels (experiment 1), and how little channel interaction (experiment 2), are required to achieve comparable performance to acoustic (full spectrum) hearing on the gated word recognition task.
2. Experiment 1
In the first experiment, gated word recognition performance was measured using noise-band vocoder processed speech, where spectral resolution was altered by changing the number of spectral channels. Stimuli were either unprocessed or vocoded with 20 1 612, or eight spectral channels. We hypothesized that listeners would require a greater amount of word initial information (resulting in prolonged IPs) to predict the target words for speech processed with eight channels, consistent with the findings reported by Patro and Mendel.8 However, we expected the IPs to progressively decrease as the spectral resolution is improved (by increasing the number of vocoder processing channels).
Seventeen native speakers of American English (three men and 14 women, age range: 20–29 years) who had normal hearing (15 dB HL or better hearing thresholds at the audiometric frequencies of 250–8000 Hz), with no recent (self-reported) history of otological problems, participated in this experiment. Nearly all participants were recruited from undergraduate/graduate programs in the Speech-Language Pathology and Audiology department at Towson University, and most of them were naive to vocoded/gated speech. All listeners were fully informed about the study protocol and gave written informed consent prior to data collection.
2.2 Stimuli and procedure
Gated word stimuli were generated using the procedure described in Patro and Mendel.8 The test stimuli for this experiment consisted of monosyllabic words taken from the consonant–nucleus–consonant (CNC) corpus.31 A total of 100 words were selected and divided into five word lists, each containing 20 words. Each word list was utilized for a specific condition, including full spectrum, 20-, 16-, 12-, and 8-channel conditions. The selection of these specific conditions was based on preliminary pilot evaluations involving a sample of seven participants. During the pilot testing, participants were presented with 8 channels, and it was observed that they required the entire word to be presented before they could identify the target word. Interestingly, even with exposure to the entire word, some participants still faced challenges in identifying the word with 8 channels. However, as the number of channels increased to 12 and 16, we observed greater variability in the data. With 20 channels, a substantial majority of the participants demonstrated a level of performance comparable to that observed in the full-spectrum condition during the gated word recognition task. The target words were shortlisted based on their lexical properties (neighborhood density and frequency of occurrence). The shortlisted words had medium-to-dense neighborhood size (i.e., 13–35 similar sounding words) and medium-to-high frequency of occurrence (9–27 times), determined using an online calculator developed by Balota et al.32,33 The first phonemes of each word were left unaltered, and gating of the time waveform started from the second phoneme onwards. We used a gate size of 33 ms,6,7 and the speech segments were progressively incremented until the gate contained the entire word. A 5 msec raised cosine ramp was applied to reduce spectral artifacts. The word gating stimuli were generated using matlab (Mathworks, Natick, MA). Figure 1 schematically shows the gated word recognition paradigm.
Noise-band vocoder processing was applied to the gated stimuli using the technique that is widely used to simulate CIs.34 The preprocessing input bandwidth was set at 150–8000 Hz. The speech was bandpass filtered into a variable number of spectral bands, depending on the test condition (20, 16, 12, or 8) using fourth-order Butterworth filters (filter slopes: −24 dB/octave). The band cutoff frequencies were distributed according to the Greenwood matching function.35 The speech envelope was then extracted from each spectral band using half-wave rectification and low-pass filtering with a 160 Hz cutoff frequency (–24 dB/octave roll-off). The extracted envelope from each band was used to modulate a corresponding bandpass-filtered noise. Finally, outputs from each of the channels were combined to create noise-vocoded stimuli. Along with the four vocoder conditions, the participants were tested on a control full-spectrum condition.
The participants were seated in a double-walled sound-attenuating booth. The stimuli were played on a personal computer using WavePad (NCH Software, Greenwood Village, CO) and presented monaurally to a randomly chosen ear at 75 dB SPL via a Lynx Hilo sound card (Lynx Studio Technology, Costa Mesa, CA) and Sennheiser HD650 headphones (Sennheiser, Old Lyme, CT). Participants were instructed that they would hear initial segments of the target words and then graduated aspects of the stimuli would be presented in a progressive manner. Their task was to predict (and verbally report) the target words after each presentation, regardless of their confidence about the identity of the targets. Presentation of progressive gates was continued until the target was correctly identified. After the correct identification of the target word, three more gates were presented to make sure that the word was identified correctly and not a mere guess. Presentation of the stimuli was continued until the entire target word was presented. The IP was calculated as the point in the word where the listener first identified the target word successfully without a change in their response after three successive presentations.5,8,10,36 If a target word was not identified correctly, even after the presentation of the entire word, the total duration of the word plus the gate size (i.e., 33 ms) was used to estimate the IP.37,38 To account for the differences in target word duration, the obtained IPs (in ms) for each word were converted to IP percentage (percent of total word length, where a target word was correctly identified), using the following formula: . We implemented randomization techniques to achieve unbiased allocation of conditions and word lists. The presentation of conditions was randomized among participants, and the five word lists were randomly assigned to different experimental conditions for each participant. A short training session was provided before actual data collection began to familiarize the listeners with the task and the conditions. The entire test took 30–40 min to complete.
Results of the first experiment are displayed in Fig. 2. The results suggested that the gated word recognition performance improved with each incremental enhancement in spectral resolution. A repeated measures analysis of variance (ANOVA) with a Greenhouse–Geisser correction determined that mean IPs differed significantly between spectral resolution conditions [F(3.23, 51.67) = 76.2, p < 0.001, ηp2 = 0.83]. To account for multiple comparisons, a more restrictive α level of 0.01 was employed for the post hoc comparisons. The mean IPs obtained in the 8-channel condition were found to be significantly longer than those in the 12-channel (Mdiff = 17.35, p < 0.01, d = 1.7, 95% CI [9.17, 25.53]), 16-channel (Mdiff = 33.18, p < 0.01, d = 3.2, 95% CI [26.69, 39.66]), 20-channel (Mdiff = 47.23, p < 0.01, d = 4.54, 95% CI [40.86, 53.60]), and full spectrum (Mdiff = 48.19, p < 0.01, d = 5.71, 95% CI [41.95, 54.29]) conditions. Moreover, the mean IPs in the 12-channel condition were significantly longer than those in the 16-channel (Mdiff = 15.82, p = 0.01, d = 1.44, 95% CI [7.28, 24.36]), 20-channel (Mdiff = 29.88, p < 0.01, d = 2.71, 95% CI [22.34, 37.41]), and full spectrum (Mdiff = 30.76, p < 0.01, d = 3.31, 95% CI [24.81, 36.72]) conditions. Additionally, the mean IPs in the 16-channel condition were significantly longer than those in the 20-channel (Mdiff = 14.06, p < 0.01, d = 1.26, 95% CI [7.63, 20.48]) and full spectrum (Mdiff = 14.94, p < 0.01, d = 1.6, 95% CI [8.12, 21.76]) conditions. However, the gated word recognition performance for full-spectrum speech did not exhibit a significant difference compared to that for 20 channels of vocoder processed speech (Mdiff = 0.88, p = 1.0, d = 0.09, 95% CI [ –6.76, 8.52]).
For full-spectrum speech, listeners required just over half of the total duration of the target words for correct identification. This finding is consistent with previous studies.7,8 However, when the number of channels was < 20, listeners generally showed delayed IPs for correct word identification. These findings can be attributed to the influence of degraded signal quality on listeners' perceptual uncertainty regarding initial segments of the target words. Consequently, they are more likely to consider a wider range of potential lexical candidates, which can delay the word recognition process. McMurray et al.39 demonstrated that inaccurate identification of word onset information, under conditions of reduced spectral resolution, can lead to the activation of misleading word candidates in the mental lexicon. Therefore, additional acoustic evidence may be necessary to suppress the activation of incongruent lexical candidates, which can lead to delays in the word recognition process. While it is generally accepted that four spectral channels are sufficient to achieve reasonably good vowel and consonant recognition,34 from our results, it appears that even 16 spectral channels may not be enough to guarantee typical processing of spoken words. With 20 channels, the estimated IPs were comparable to those for full-spectrum speech, possibly because acoustic degradations were too subtle to cause any significant word processing deficits and the listeners could still extract the speech cues necessary for achieving near perfect gated word recognition performance.
3. Experiment 2
Experiment 1 confirmed that the minimum number of channels required for typical gated word recognition performance is 20. However, this number may be an overestimate while modelling the consequences of spectral degradation that CI listeners routinely experience because the effects of interactions between the channels were not taken into account. In experiment 2, variable synthesis filter slopes were used to simulate different degrees of channel interactions.
Fifteen (two men, 12 women and one other, age range: 20–28 years) native speakers of American English with normal hearing participated in this experiment. The participant recruitment and selection criteria were the same as for Experiment 1, but none of the listeners in Experiment 1 participated in Experiment 2.
3.2 Stimuli and procedure
The five word lists used in experiment 1 were also used here. Word gating was applied to the stimuli using the same procedure described in experiment 1. Vocoder processing parameters were similar to those used in Experiment 1, but the carrier filter slopes of each channel were varied to simulate broad or limited of channel interaction.26 Steeply sloping filters typically have more frequency specificity, whereas gradually sloping filter slopes have relatively poorer frequency specificity due to channel overlap.40 The input acoustic signal was bandpass-filtered into either 12 or 16 spectral bands, and the slope of the carrier filters was either −24 dB/octave (“broad” channel interaction) or −48 dB/octave (“narrow” channel interaction) depending on the test condition. The selection of test conditions in this experiment was based on pilot evaluations, which involved using filter slopes ranging from −8 to −48 dB/octave for gated word recognition. The results revealed that achieving comparable performance to full spectrum speech required as many as 30–48 channels with filter slopes of –8 to −16 dB/octave. However, due to the practical limitations of most modern CIs, which typically offer 12–24 physical electrode contacts, this high number of channels is not feasible. In contrast, when using −24 and −48 dB/octave filter slopes, participants displayed greater variability in performance with 12–16 channels, but most participants showed comparable performance to full-spectrum speech with –48 dB/octave and 16 channels. Therefore, we decided to use the filter slopes of −24 dB/octave and −48 dB/octave in this experiment, as it provided a reasonable compromise between data variability and the limited number of channels available in commercial CIs. Thus, the participants were tested on five listening conditions: full spectrum, 12 channels-broad, 12 channels-narrow, 16 channels-broad, and 16 channels-narrow. The presentation of conditions was randomized among participants, and the five word lists were randomly assigned to different experimental conditions for each participant. The instructions, procedure, and data analysis in Experiment 2 were exactly the same as for Experiment 1.
The results of this experiment are shown in Fig. 3. Overall, the results of this experiment suggested that reducing the spectral spread resulted in a significant reduction in IPs with 16 channels, while a similar improvement was not observed with 12 channels. A repeated measures ANOVA with a Greenhouse–Geisser correction was performed to compare the effect of test condition on the IPs. The results revealed that mean IPs differed significantly between spectral resolution conditions [F(2.4, 33.66) = 42.25, p < 0.001, ηp2 = 0.75]. Post hoc pairwise comparisons (α level was set at 0.01) revealed that the IPs for 12 channels-broad and 12 channels-narrow conditions did not differ significantly (Mdiff = 1.33, p = 1.0, d = 0.07, 95% CI [−1.52, 4.2]). However, the IPs for 16 channels-narrow condition were found to be significantly shorter than those for 16 channels-broad condition (Mdiff = 15.80, p = 0.009, d = 1.6, 95% CI [7.7, 23.89]). Further, the mean IPs across 16 channels-narrow and for full spectrum conditions did not differ significantly (Mdiff = 1.0, p = 1.09, d = 0.1, 95% CI [−5.23, 7.23]).
The results demonstrate a significant improvement in gated word recognition performance when narrowing the spread of excitation using 16 channels of vocoder processing. This improvement can be attributed to the enhanced spectral resolution and reduced channel interaction provided by this configuration. Conversely, employing a broad excitation spread with 16 channels yielded inferior performance, likely due to the detrimental effects of spectral smearing on speech recognition.40 In this experiment, our findings also indicate that listeners need a minimum of 16 independent channels to achieve similar performance as acoustic (full spectrum) hearing in the gated word recognition task.
Further, reducing the spectral spread resulted in a significant reduction in IPs with 16 channels, while a similar improvement in IPs was not observed with 12 channels. The differential effect of narrowing the spectral spread on gated word recognition performance between the 12 channel and 16 channel conditions can be explained by considering the trade-off between spectral resolution and channel interaction. First, the increased number of spectral channels allows for finer spectral resolution and improved discrimination of phonetic cues.41 Additionally, the slope of the carrier filters plays a crucial role in the level difference between formant peaks and spectral valleys.26 By narrowing the filter slopes, the level differences are accentuated, resulting in clearer and more distinct spectral patterns. This heightened contrast may facilitate more precise discrimination of acoustic cues important for speech perception.26 It is apparent from Fig. 4 that reducing spectral spread with 16 channels improves the preservation of formant structure, including formant frequencies and the valley between the formants (2nd and 3rd formants, in this example). The improvements in formant structure achieved by reducing spectral spread can enhance listeners' ability to resolve and discriminate peak formant frequencies, which are crucial for identifying vowels and consonants.26 This improved resolution can also have a positive impact on recognizing word onset information and potentially enhance gated word recognition performance. On the contrary, narrowing the spectral spread with 12 channels does not consistently increase the level difference between formant peaks and spectral valleys. This lack of formant differentiation can potentially hinder the recognition of word onset information and its subsequent impact on gated word recognition performance.
4. General discussion
Previous research has indicated that listeners typically require a greater amount of word onset information and exhibit delayed IPs for accurate word identification in the presence of background noise.6,7 These findings suggest that listeners heavily rely on word-final information to successfully recognize the target words. This reliance on word-final information can be attributed to the reduced intelligibility of acoustic features that distinguish the target from lexical competitors, leading to uncertainty regarding the exact identity of the target words. Our results from both the experiments corroborate with previous studies that IPs were generally delayed for spectrally degraded speech.6,7 Identifying the specific factors contributing to such word processing delays caused by spectral degradation based solely on data obtained from the gating task is a challenging endeavor. However, studies utilizing the visual world recognition paradigm, which informs about the dynamics of real-time lexical processing have revealed important findings. The results have shown that listener tend to show increased lexical competitor activation (e.g., words that rhyme with the target word), and/or reduced suppression of competitors in mental lexicon (i.e., keep the competing lexical candidates accessible, in case later occurring information requires a revision in their decision).29 Such changes in lexical processing for spectrally degraded speech could benefit the listeners in correcting their misperceptions with the help of later occurring word segments, but it may delay the overall word recognition process. In a recent study by Winn and Teece,42 CI listeners were presented with sentences containing distorted or masked words, allowing for mental repair based on contextual cues. Pupil dilation was an objective measure of listening effort, synchronized with specific perceptual landmarks. Their findings demonstrated a notable increase in cognitive effort when mentally repairing a misperceived word, even in cases where the verbal response was ultimately accurate. Their results support the notion that the correction of misperceptions is associated with a delay in processing, highlighting the impact of cognitive resources on speech perception in CI users.
Previous studies have demonstrated that individuals with NH can achieve high levels of word intelligibility (> 90%) with as few as four channels of noise-band vocoder stimulation.34,43 However, it is important to note that word intelligibility scores alone may not fully capture the specific word processing deficits caused by spectral degradation. The results of this study provide evidence that to restore the temporal aspects of word processing, higher spectral resolution may be required. Tasks, such as gated word recognition, offer substantial contributions by shedding light on the intricate temporal dynamics of word recognition, thereby facilitating a more comprehensive understanding of the underlying mechanisms that govern language processing. Most modern CIs have between 12 and 24 active channels with a current spread equivalent to filter slopes of between –8 and –24 dB/octave (depending on the mode of stimulation).40,44 It seems unlikely that current CI devices will meet the constraints identified in this study, considering that the minimum number of channels required for restoring typical gated word recognition performance appears to be at least 20 (with filter slopes of –24 dB/octave). As a result, it is probable that most CI users experience notable delays in lexical access and may necessitate waiting until they receive the complete auditory input of a word before successfully identifying the target words.
It is worth noting that although vocoder simulations attempt to emulate the experience of listening via CIs, it does not account for several factors that are specific to CI processing, such as pulsatile stimulation, channel peak-picking, and reduced dynamic range. In addition, unlike actual CI users, the participants with NH listening to vocoder simulations have not had the experience processing spectrally degraded signals. Therefore, the results of this study may not provide an accurate estimate of CI users' gated word recognition performance. Furthermore, the influence of age at deafness onset on gated word recognition performance among CI users is a critical factor that demands careful consideration. Individuals with prelingual deafness, experiencing hearing loss from a very early age or since birth, may demonstrate distinctive lexical access patterns compared to those with postlingual deafness, where hearing loss occurs after acquiring some language skills. These differences in age at deafness onset could potentially impact the strategies and effectiveness of word identification during gated word recognition tasks for CI users. It should also be noted that the derived IPs do not inform much about the nature of real-time lexical competition, suppression of lexical competitors, and/or the exact time required for lexical access. Tasks, such as the visual world recognition paradigm, may be an effective tool for precisely characterizing the processing of spoken words. Finally, it is important to acknowledge several stimulus-related factors that were not adequately addressed in our study. Another significant limitation was the absence of phonemic/phonetic balance in the word lists across different conditions. This oversight may have resulted in an uneven distribution of speech sounds or phonetic properties in the stimuli, potentially introducing biases into the experimental outcomes. Another limitation pertains to the potential impact of lexical competition, specifically related to phoneme position, on the determination of IPs. Although we used the initial phoneme as the first gate, we did not specifically investigate situations where all competitors solely differed in the initial phoneme compared to those differing only in the final phoneme. Consequently, the investigation of potential earlier IPs associated with variations in the initial phoneme was not directly conducted. It is crucial for future research to address these limitations comprehensively to enhance the scientific rigor and validity of the findings. Future work on gated word recognition with vocoder speech could involve analyzing error patterns based on verbal recordings. By examining the different ways participants respond to the stimuli, researchers can gain valuable insights into how speech processing unfolds and what specific aspects of the signal are being accurately or inaccurately perceived. Analyzing participant responses can also provide indications of coarticulation cue perception, accuracy of consonant and vowel features, lexical activation, and potential errors in word recognition. By examining the nature and frequency of these errors, researchers can gain a deeper understanding of the challenges and limitations associated with speech perception in vocoder simulations.
The aim of this study was to investigate the effect of spectral resolution on gated word recognition. Two experiments were conducted to explore the number of vocoder channels (experiment 1) and the level of spectral spread (experiment 2) required to achieve performance comparable to full spectrum speech in the gated word recognition task. Results from the first experiment suggested that gated word recognition performance improved as spectral resolution was increased incrementally, and the listeners required 20 channels of spectral resolution to achieve performance levels similar to full-spectrum speech in the gated word recognition task. The results of the second experiment demonstrated that listeners needed a minimum of 16 channels with a narrow spread of excitation to achieve performance levels similar to those observed with full spectrum speech on the gated word recognition task. Overall, our results indicate that degradations in signal acoustics can affect the temporal aspects of speech processing, which are not typically captured by speech intelligibility performance alone.
The authors thank Dr. Matthew Winn for very helpful discussions of this research.