Speech recognition was measured in 24 normal-hearing subjects for unprocessed speech and for speech processed by a cochlear implant Advanced Combination Encoder (ACE) coding strategy in quiet and at various signal-to noise ratios (SNRs). All signals were low- or high-pass filtered to avoid ceiling effects. Surprisingly, speech recognition performance plateaus at approximately 22 dB SNR for both speech types, implying that ACE processing has no effect on the upper limit of the effective SNR range. Speech recognition improved significantly above 15 dB SNR, suggesting that the upper limit used in the Speech Intelligibility Index should be reconsidered.
1. Introduction
Cochlear implant (CI) users often have difficulty understanding speech in noise, even those with good speech recognition (SR) abilities in quiet (Kaandorp et al., 2015). In many real-life situations considered “quiet,” a low intensity background noise is still present. For normal-hearing (NH) listeners, low intensity noises do not affect SR scores (i.e., no energetic masking). Thus, SR scores at these signal-to-noise ratios (SNRs) equal SR scores in quiet. Common CI speech coding strategies like ACE (Advanced Combination Encoder), an n-of-m channel selection algorithm, are sensitive to noise due to wrong maxima selection and the introduction of noise in the speech gaps (Qazi et al., 2013). This could have a detrimental effect on SR, even at high SNRs. Therefore we speculated that low intensity background noises may hamper SR for CI processed speech to a greater extent than for normal speech, and the SNR at which SR scores equal performance in quiet might differ between CI users and NH listeners.
Both the width and limits of the speech dynamic range have been studied extensively. The widely used value of 30 dB for the speech dynamic range was first proposed by Beranek (1947). He assessed the short-term speech spectrum measurements (i.e., estimates of the physical intensity range of speech in quiet) performed by Dunn and White (1940). The 30-dB range was adopted in the Articulation Index (ANSI S.3.5-1969), the Speech Intelligibility index (SII) (ANSI S.3.5-1997), and the Speech Transmission Index (Steeneken and Houtgast, 1980).
The upper limit of the dynamic range has been assumed to be either 12 or 15 dB above the root mean square (rms) speech level for all frequencies (Beranek, 1947; ANSI S.3.5-1969, 1969; ANSI S.3.5-1997, 1997) or to be varying across the frequency spectrum (French and Steinberg, 1947). The SII model assumes a uniform distribution of speech information across the 30-dB range, with the lower and upper limit of the effective dynamic range of speech (EDRS) at −15 and 15 dB relative to the long-term rms speech level at all frequencies. This implies that when mixed with steady-state long-term average speech spectrum (LTASS) noise, all acoustic cues are accessible to a NH listener at 15 dB SNR, given that the presented stimulus is loud enough to be audible. Similarly, below the SNR of −15 dB, all acoustic information is obscured by noise and SR should be impossible, resulting in a 30-dB wide effective SNR range. The word “effective” refers to these values being based on changes in SR performance rather than physical properties of the speech signal (Studebaker and Sherbecoe, 2002). The aim of this study is to compare the highest SNR at which SR is hindered by steady-state LTASS noise between unprocessed speech and CI-simulated speech.
2. Methods
2.1 Subjects
Twenty-four NH subjects were enrolled in this study (5 male/19 female). The age range of the participants was 18.8 to 30.6 years; median age 22.1 years. All participants reported Dutch as their native language. Pure-tone thresholds were measured for both ears at octave frequencies 0.25–8 kHz in a sound-treated booth with a clinical audiometer (Decos Audiology Workstation, Decos Systems, Noordwijk, the Netherlands). Only the better ear was used in all experiments; pure-tone thresholds were ≤20 dB hearing level for all frequencies in the better ear. All protocols were approved by the Medical Ethics Committee of VU University Medical Center Amsterdam. The participants enrolled in the study voluntarily and provided written informed consent; they all received reimbursement for their participation.
2.2 Stimuli
We selected Dutch sentences from the VU-98 corpus (Versfeld et al., 1999) as speech material. These were short meaningful sentences either eight or nine syllables in length uttered by a female speaker. To prevent differences in results between unprocessed speech and CI-simulated speech due to speech information outside the frequency range of the CI processor filterbank, all speech material was bandpass filtered to match the bandwidth of a Cochlear processor [0.183–7.77 kHz]. Sentences were mixed with steady-state masking noise that has a spectral density equal to the average of all speech stimuli presented. The Nucleus Matlab Toolbox (NMT), software emulating an ACE speech coding strategy, was used to convert the mixed sentences in noise-based vocoder simulations. The overall input level of sounds was kept at 65 dB sound pressure level (SPL). Sound processing included an automatic gain control (AGC), pre-emphasis filter, filterbank, and channel selection algorithm (using default settings of 8 maxima, 900 pulses per second per channel, T-SPL 25 dB, C-SPL 65 dB) to simulate electric pulse patterns. These electrodograms were resynthesized in 22-band vocoder stimuli using the “Resynthesizer” function of the NMT. The resynthesizer function modulates a bandpass filtered pink noise with the temporal envelope of the simulated electric pulse sequence of each channel. Before presentation, vocoder stimuli were filtered with the reverse coefficients of the pre-emphasis filter simulated by the NMT software, to reinstall the (low) frequencies attenuated by the pre-emphasis filter.
To prevent ceiling effects in SR scores, all stimuli were filtered, using either a low-pass (LP) filter or high-pass (HP) filter. Filter cutoff frequencies were based on cutoff frequencies of the 22 channels of a default clinical map, and chosen in such a way that SR in quiet was approximately 90% of words correct. The chosen cutoff frequencies are not critical for the results and were based on pilot testing on five NH subjects. For unprocessed speech, the frequency range corresponding to the eight most apical or eight most basal electrodes was used in the LP and HP condition, respectively. Taking into account the initial bandpass filtering of the entire speech set, the corresponding frequency ranges were [0.183–1.17 kHz] for LP and [2.63–7.77 kHz] for HP. The frequency ranges of the CI-simulated speech were larger, corresponding to the bandwidths of 17 electrode channels. Bandwidths for LP and HP were [0.183–3.98 kHz] and [0.796–7.77 kHz], respectively. For CI simulation, LP or HP filtering was done after vocoding to avoid interference with the channel selection algorithm. The function “fir1” was used to generate 2048-order finite-impulse-response (FIR) filters in Matlab. Combining LP and HP filtering and unprocessed and CI-simulated speech resulted in four different testing conditions used in this study.
2.3 Procedure
Stimuli were presented in a sound-treated booth via Sennheiser HDA200 closed back headphones using a laptop with an external soundcard (Creative SoundBlaster X-Fi-HD). All stimuli were presented monaurally at a fixed sound level of 65 dB SPL. Testing was performed in four successive blocks, one for each condition (unprocessed LP, unprocessed HP, CI-simulated LP, and CI-simulated HP). The order of the blocks was counterbalanced across subjects according to a Latin square experimental design. Prior to each testing block, a short training of 20 practice sentences was employed to account for learning effects in the intelligibility of CI-simulated or filtered speech. SR was measured for each condition at a wide range of SNRs (8, 12, 15, 20, 25, 30 dB) and in quiet. The SNRs of 8 and 12 dB were used rather than 5 and 10 dB because of the expected strong decrease in SR in this SNR range. For each SNR and stimulus type, 16 sentences were presented. Hence all subjects performed four testing blocks each consisting of 112 sentences; each block took about 20 min to complete. SNRs were presented in random order within a testing block, and sentences were counterbalanced across conditions and randomized across SNRs for each subject. SR was scored for amount of words repeated correctly relative to the amount of words in the target sentence. The subjects received no feedback on their performance.
2.4 Analysis
SR scores were averaged across subjects for each SNR and fit by nonlinear regression using IBM SPSS 22. Data were fit with a piecewise function which consisted of a sigmoid curve and a horizontal line,
In this equation SR is the speech recognition score in percent correct; parameters b1 and b2 determine the shape of the sigmoid curve. The upper bound of the sub-domain of the sigmoid curve is the value SNRmax; this is the SNR above which supposedly SR no longer improves with increasing SNR.
3. Results
Figure 1 shows the average SR scores of all 24 subjects measured at each SNR and the fitted function [Eq. (1)] for the four conditions. SR does not reach the ceiling value of 100% for any condition which allows for the fitting of SNRmax. The values of SNRmax are featured in Table 1. A two-way repeated measures analysis of variance showed no significant effect of processing (unprocessed vs CI-simulated) on SNRmax, F(1,23) = 0.595, p = 0.489. There was also no significant effect of filtering (LP vs HP), F(1,23) = 3.727, p = 0.066; also the interaction between processing and filtering was not significant, F(1,23) = 0.042, p = 0.839. This suggests that the SNR above which SR does not increase with increasing SNR is not different between unprocessed speech and speech which has been processed by the ACE speech coding strategy. It is evident from Fig. 1 that SR performance has not reached a maximum yet at 15 dB SNR for each stimulus type. Four one-sided t-tests established that all values of SNRmax were significantly higher than 15 dB (p < 0.001).
(Color online) SR at different SNRs, averaged across all 24 subjects with error bars showing the standard error of the mean.
(Color online) SR at different SNRs, averaged across all 24 subjects with error bars showing the standard error of the mean.
Estimated values of SNRmax and confidence intervals for all four testing conditions. There is no statistically significant difference between any of the estimates of SNRmax. All estimates of SNRmax were statistically different from 15 dB.
. | SNRmax (dB) . | 95% confidence interval SNRmax (dB) . | t-value . | df . | p . |
---|---|---|---|---|---|
Unprocessed LP | 22.1 | 19.0–25.1 | 5.827 | 23 | <0.001 |
Unprocessed HP | 21.6 | 19.5–23.6 | 9.789 | 23 | <0.001 |
CI-simulated LP | 23.1 | 21.3–25.0 | 9.217 | 23 | <0.001 |
CI-simulated HP | 22.6 | 20.8–24.3 | 9.775 | 23 | <0.001 |
. | SNRmax (dB) . | 95% confidence interval SNRmax (dB) . | t-value . | df . | p . |
---|---|---|---|---|---|
Unprocessed LP | 22.1 | 19.0–25.1 | 5.827 | 23 | <0.001 |
Unprocessed HP | 21.6 | 19.5–23.6 | 9.789 | 23 | <0.001 |
CI-simulated LP | 23.1 | 21.3–25.0 | 9.217 | 23 | <0.001 |
CI-simulated HP | 22.6 | 20.8–24.3 | 9.775 | 23 | <0.001 |
4. Discussion
For both unprocessed speech and the CI simulation SR plateaus below the ceiling level of 100% due to LP or HP filtering. SR does not improve further with decreasing noise levels at approximately the same SNR for both unprocessed speech and CI-simulated speech. The FIR filters in the current study had slopes <2000 dB octave for the lowest frequencies, probably not steep enough to ascertain no frequencies outside the nominal bandwidth of the passband influenced speech intelligibility (Warren et al., 2004). However, LP and HP filtering was done after mixing speech and noise, thus the SNR was the same within and outside the passband; this makes it unlikely that acoustic information from outside the passband influenced the established value of SNRmax. SR scores are lower for the CI-simulation than for unprocessed speech for all SNRs, although a much larger portion of the frequency spectrum has been removed for the unprocessed speech compared to CI-simulated speech. If the bandwidths of the unprocessed and CI-simulated speech had been equal, the difference in performance would have been even larger. So it seems that, as expected, SR is negatively impacted by removal of temporal fine structure and the maxima selection in the CI-simulation. However, we did not find a significant change in the established value of SNRmax which implies that the ACE speech coding strategy with default clinical parameters is not a determinant of the upper limit of the SNR range for speech in stationary noise.
Figure 2 shows electrodograms of a processed sentence in quiet and at 0, 15, and 22.4 dB SNR. The 22.4 dB SNR electrodogram corresponds to the average value of SNRmax, and 15 dB SNR corresponds to the upper limit of the SII model. The latter is clearly different from the electrodogram of speech in quiet. The added noise introduces undesirable envelope modulations in the signal, and distorts the modulations pertaining to speech. Qazi et al. (2013) identified wrong maxima selection and loss of ON/OFF modulations due to noise filling speech gaps as major causes of loss of speech intelligibility in noisy environments for CI users compared to NH listeners presented with vocoded stimuli. The electrodogram at 22.4 dB shows some noise filling speech gaps, but SR at this SNR equals performance in quiet for the NH listeners in this study. However, SR by CI users with an impaired auditory system and reduced temporal resolution might be affected by these low noise levels. The input sound level could have an effect on the upper limit of the SNR range for the CI-simulation. Because the input sound level was fixed at 65 dB SPL in the current study, the level of the noise was approximately 41 dB SPL at SNRmax. For higher SNRs the level of the noise will further decrease and it becomes less likely that the energy in channels exceeds the threshold set by T-SPL. Then no additional electrical stimulation will occur compared to the quiet condition despite the presence of masking noise. Stimulation for CI users in real-life situations can differ significantly from the experimental situation in our study. The interaction between varying input levels and, for example, the AGC and autosensitivity of the CI processor will very likely give different results from our well-controlled study.
Electrodograms of the Dutch sentence: “Het was een ongekend warme dag,” which translates to “It was a distinctively warm day,” at different SNRs. (a) corresponds to quiet, (b) to 22.4 dB SNR, (c) to 15 dB SNR, and (d) to 0 dB SNR.
Electrodograms of the Dutch sentence: “Het was een ongekend warme dag,” which translates to “It was a distinctively warm day,” at different SNRs. (a) corresponds to quiet, (b) to 22.4 dB SNR, (c) to 15 dB SNR, and (d) to 0 dB SNR.
The SII model states SR performance should not increase above 15 dB SNR, because it is assumed that all acoustic cues are already accessible to the listener. In the current study we found values of SNRmax which are significantly higher than 15 dB for all conditions. These findings suggest that either both limits of the SNR range are shifted to more positive SNRs or the width of the EDRS and the SNR range exceeds the value of 30 dB adopted by models like the SII. Other authors have estimated the EDRS for unprocessed speech to be larger than 30 dB, either based on the physical properties of speech sound or on measurable SR performance (French and Steinberg, 1947; Fletcher and Galt, 1950; Zeng et al., 2002; Lobdell and Allen, 2007; Rhebergen et al., 2009). It has been shown that for some high-level listening conditions, an EDRS of 40 dB is better suitable for predicting speech intelligibility in NH listeners (Studebaker et al., 1999). Also, SR performance data have been published that suggest that the upper limit of the SNR range exceeds 15 dB above the long-term rms value of speech (Studebaker and Sherbecoe, 2002; Leclère et al., 2016). Our results indicate that the width of the EDRS used in models like the SII should be re-evaluated.
In conclusion, the estimated upper limit of the SNR range does not differ for speech processed by the ACE speech coding strategy compared to unprocessed speech. Thus, we did not find evidence that intensities of steady-state background noise hamper SR at lower levels for CI users than for NH listeners. For both unprocessed and CI-simulated speech, performance continued to improve significantly above the value of 15 dB SNR predicted by the SII model. This implies that the limits of the dynamic range of speech used in common speech models need re-evaluation.
Acknowledgments
The authors are thankful to all listeners who participated in the experiments. We would also like to thank Birgit Lissenberg-Witte and Imke Adams for their advice on our data analysis, and the Cochlear Technology Center for providing the NMT.