Many cochlear implant (CI) listeners experience decreased speech recognition in reverberant environments [Kokkinakis et al., J. Acoust. Soc. Am. 129(5), 3221–3232 (2011)], which may be caused by a combination of self- and overlap-masking [Bolt and MacDonald, J. Acoust. Soc. Am. 21(6), 577–580 (1949)]. Determining the extent to which these effects decrease speech recognition for CI listeners may influence reverberation mitigation algorithms. This study compared speech recognition with ideal self-masking mitigation, with ideal overlap-masking mitigation, and with no mitigation. Under these conditions, mitigating either self- or overlap-masking resulted in significant improvements in speech recognition for both normal hearing subjects utilizing an acoustic model and for CI listeners using their own devices.

Reverberant environments are detrimental to speech recognition for cochlear implant (CI) listeners, with a linear increase in reverberation time resulting in an exponential decrease in speech recognition.1 Addressing reverberation in CIs has the potential to significantly improve speech recognition in difficult listening environments. Acoustically, many reverberation mitigation algorithms require an estimation of the room impulse response (RIR) in order to inverse-filter the reverberant signal, obtaining an approximation of the original signal. However, RIR calculations are often computationally demanding and must be re-estimated with changing room configurations and source and microphone locations (e.g., Ref. 2). Alternatively, methods to mitigate reverberation that focus on reverberation effects directly have been developed for CI listeners. These algorithms often use binary masks that reject channels that are estimated to be reverberation-dominant. Unfortunately, many of these techniques require knowledge of the anechoic signal, knowledge of the future signal, calibration periods, or extremely high signal-to-noise ratios, making them not implementable for real-time systems.1,3–5

An alternative approach is to categorize reverberation effects into early reflections, occurring within 50 ms following the source (self-masking), and the subsequent late reflections (overlap-masking) (e.g., Refs. 6 and 7), and to approach mitigation of these effects independently. Self-masking alters the frequency and temporal information present in a phoneme by flattening formant transitions and flattening both F1 and F2 formants, which often results in diphthongs and glides being misclassified as monophthongs. In contrast, overlap-masking results in temporal smearing which often causes one phoneme to be masked by the reverberant energy present in a preceding phoneme. This effect is especially detrimental to CI listeners because many CI speech processors present only the envelopes with the maximum energy in a given time window (e.g., Ref. 8). Consequently, CI processors often select higher-energy overlap-masking vowel information rather than consonant stimuli of low intensity, resulting in a loss of the consonant information.1 

Dividing reverberation effects into two categories allows the development of a reverberation mitigation strategy that focuses on each effect independently. Because self-masking mitigation requires altering the amplitude information of active speech, an estimate of the RIR may be required for mitigation. Alternatively, mitigating overlap-masking requires removing the masking-stimuli entirely, as these segments do not occur during active speech tokens. Hence, focusing on each effect of reverberation separately may allow for an optimized reverberation mitigation strategy. However, it is unknown whether mitigating these effects independently will improve CI speech recognition. Previous research has found that overlap-masking mitigation may provide more benefit than self-masking mitigation because the reflections caused by self-masking may not impede speech recognition for CI listeners as greatly as those caused by overlap-masking.4,7,9 In contrast, Kokkinakis and Loizou10 found that both forms of masking decreased speech recognition, and that self-masking may be more detrimental than overlap-masking. However, their study approximately isolated self-masking effects by replacing reverberant consonants with clean consonants, and overlap-masking effects were isolated by inserting clean vowels in place of reverberant vowels. These approximations assume that consonant information fully contains overlap-masking effects, while self-masking effects are fully contained within vowel stimuli.10 

The current study investigates the benefits of mitigating either self- or overlap-masking by ideally mitigating each effect separately and comparing the resulting speech recognition to that in reverberation after no mitigation. These conditions were presented to normal hearing (NH) listeners via an acoustic model and to CI listeners using the Nucleus Implant Communicator (NIC), which was driven by a research computer.

The participants of this experiment were compensated for their time, and the Institutional Review Board of Duke University approved the use of human subjects in the experiments associated with this study.

Four female and six male NH subjects were recruited for an initial study. Prior to data collection, subjects were asked to self-report any hearing impairments. One subject was an employee at Duke University, while the remaining subjects were undergraduate or graduate students. These subjects were presented with reverberant speech processed by an acoustic model that presents the frequency and temporal information that would be available to a CI listener. CI signal processing was simulated using a sine-wave vocoder consisting of 22 frequency channels. Because many CI processors stimulate 8 to 12 electrodes per time window, the 12 frequency bins with the greatest energy were selected for stimulation during each 2 ms time window in this study.

One female and three male post-lingually deaf subjects, all users of Cochlear Corporation's devices, participated in a subsequent study. Their ages ranged from 50 to 71 yrs, and they all had more than 1 year of experience with their devices. The Advanced Combination Encoder processing strategy8 was used for the speech recognition tasks, as this was the strategy used by all participants. Only electrodes activated in each subject's clinical map were used in this study.

Reverberant conditions were simulated using RIRs generated with the Modified Image Source Method (ISM). The Modified ISM is based on the ISM, which calculates a given RIR by modeling image sources existing in “mirror rooms” extending in all three dimensions. The modified technique improves the accuracy of the existing method, for example, by completing computations in the frequency domain to allow time delays that are not integer multiples of the sampling rate.11 Reverberant signals were created with various reverberation times, room dimensions set to (10.0 × 6.6 × 3.0) m, a source located at position (2.4835 × 2.0 × 1.8) m, and a microphone positioned at (6.5 × 3.8 × 1.8) m, as used by Champagne et al.12 

Once the reverberant signals were created, three conditions were generated: Unmitigated reverberation, and reverberation after ideal self- or overlap-masking mitigation. Unmitigated reverberation required no further processing. Ideal self- and overlap-masking mitigation was completed by assuming knowledge of the quiet signal. To label segments of self- and overlap-masking, uncorrupted stimuli post-CI-processing were smoothed on a channel-by-channel basis via a summing filter containing a length of five time windows, each window with a duration of 2 ms and each filter advancing by 2 ms. The central pulse of each window was labeled as speech if the window's sum was greater than the subject's psychophysical threshold. Otherwise, the pulse was labeled as quiet. For CI listeners, the psychophysical threshold was determined by their clinical settings, while the NH listeners' psychophysical thresholds were set to a constant value for all channels. The same value was used for all NH listeners.

The channel-specific information required smoothing to avoid speech segments from being incorrectly labeled as quiet pulses, resulting from the sporadic nature of CI stimuli which often alternate between an “on” and “off” state within a speech segment. Next, to ideally mitigate self-masking, speech segments from the non-reverberant stimuli were inserted into the corresponding segments of reverberant speech, resulting in a signal containing clean speech but also containing reverberant overlap-masking effects. To ideally mitigate overlap-masking, segments that were labeled as quiet in the uncorrupted speech but that contained pulses in the reverberant speech were removed from the reverberant pulse train, resulting in a speech token containing self-masking effects but free from overlap-masking.

An example comparing self-masking mitigation to overlap-masking mitigation can be seen in the electrodogram shown in Fig. 1. To mitigate self-masking, non-reverberant speech tokens would be inserted into the reverberant speech segments represented by the black pulses, while the gray segments containing overlap-masking effects would remain in the signal. To mitigate overlap-masking, the reverberant black segments that contained self-masking would remain unaltered, while the gray pulses that represent overlap-masking segments would be removed from the stimuli. Figure 2 further illustrates this method. The top left segment of Fig. 2 demonstrates the anechoic stimuli presented in a given channel, while the top right displays a reverberant version of the stimuli. (Segments containing self-masking effects are shown in black, while segments affected by overlap-masking are shown in gray.) The bottom left portion of the figure represents speech that has been altered via self-masking mitigation, correcting only the active speech. The bottom right portion of the figure has been altered by overlap-masking mitigation, as is evident by the removal of the gray overlap-masking pulses. These ideal mitigation strategies are designed to understand the potential benefit of mitigating self- and overlap-masking independently. Since these strategies require knowledge of the anechoic and reverberant stimuli, they are not applicable for real time implementation.

FIG. 1.

Sentence stimuli highlighting self-masking and overlap-masking effects.

FIG. 1.

Sentence stimuli highlighting self-masking and overlap-masking effects.

Close modal
FIG. 2.

One segment of stimuli presented in quiet (top left), reverberation (top right), reverberation after self-masking mitigation (bottom left), and reverberation after overlap-masking mitigation (bottom right).

FIG. 2.

One segment of stimuli presented in quiet (top left), reverberation (top right), reverberation after self-masking mitigation (bottom left), and reverberation after overlap-masking mitigation (bottom right).

Close modal

The NH listeners were presented with sentences from the Hearing in Noise Test (HINT) database13 in reverberation with reverberation times of 0.5, 1.0, and 1.5 s. No additional noise was added to the signals. Under each reverberation time, original reverberant stimuli, reverberant stimuli after ideal self-masking mitigation, and reverberant stimuli after ideal overlap-masking mitigation were presented to the subjects. One list (containing ten sentences) per subject per condition was presented, and the subjects were instructed to type what they were able to hear. Within each reverberation time, the presentation order of the conditions was randomized. Because of the similar performance of NH listeners to each other, their results can be pooled, requiring only one repeat per condition per subject. Intelligibility scores were graded based on the number of correctly identified keywords. Prior to testing, one practice list was presented to each subject in quiet, to familiarize the listeners with vocoded speech. During this training session, subjects were instructed to adjust the volume to a comfortable level.

In the following study, four CI listeners were presented with reverberant sentences from the Central Institute for the Deaf (CID) sentence database.14 The HINT database was not used for these listeners, as they each had previous experience with these sentences from other studies in our lab. The CID database was selected because, like the HINT database, it consists of everyday sentences. Reverberant sentences after no mitigation, ideal self-masking mitigation, and ideal overlap-masking mitigation were presented to the subjects at a reverberation time of 1.5 s. No additional noise was added to the stimuli. Subjects were instructed to write what they heard, and percent correct scores were graded based on keywords. Three lists (each containing ten sentences) per condition were presented, and the presentation order was randomized. The CI listeners required three repeats per condition because, unlike the NH listeners, their data are often dissimilar and cannot be pooled. The 1.5 s reverberation time was selected, as this was the most challenging condition tested with NH listeners. Because the CID sentence database consists of only ten lists, additional reverberation times could not be tested. The CI listening experiment was completed using each subject's clinical parameters and a test speech processor (using the NIC research interface). Because the subjects' everyday parameters were used for testing, a practice list was not presented for familiarization.

To analyze the results, a Beta distribution was fit to the data. The Beta distribution describes the probability of correct keyword identification given the total number of correctly identified keywords and the number of keywords presented across the test sentences.15 The results are shown in Figs. 3 and 4. In both figures, the height of each bar represents the mean of the distribution, and the error bars represent the 95% confidence intervals. The lines connecting two bars indicate statistically significant differences in performance between the two conditions. Statistical significance was determined if (Pcondition1>Pcondition2)0.95.

FIG. 3.

NH speech recognition performance resulting from unmitigated reverberant speech and reverberant speech after ideal self-masking mitigation or ideal overlap-masking mitigation.

FIG. 3.

NH speech recognition performance resulting from unmitigated reverberant speech and reverberant speech after ideal self-masking mitigation or ideal overlap-masking mitigation.

Close modal
FIG. 4.

CI speech recognition results shown for four subjects (rows 1–4) and for combined subject data (row 5) in unmitigated reverberant speech and reverberant speech after self- or overlap-masking mitigation with a reverberation time (RT60) of 1.5 s.

FIG. 4.

CI speech recognition results shown for four subjects (rows 1–4) and for combined subject data (row 5) in unmitigated reverberant speech and reverberant speech after self- or overlap-masking mitigation with a reverberation time (RT60) of 1.5 s.

Close modal

Figure 3 displays the sentence recognition results for the NH study. Mitigating either self- or overlap-masking statistically significantly improved speech recognition for all three reverberation times. At reverberation times of 1.0 and 1.5 s, mitigating overlap-masking resulted in statistically significantly better speech recognition than mitigating self-masking. Presented in order of increasing reverberation time, unmitigated reverberant speech resulted in mean percent correct scores of 72.8%, 45.3%, and 32.8%, while self-masking mitigated tokens resulted in mean scores of 87.2%, 59.0%, and 48.5%, and overlap-masking mitigated speech resulted in mean scores of 87.6%, 73.7%, and 77.8%.

Figure 4 displays speech recognition performance for CI listeners when presented with a reverberation time of 1.5 s. Results for each subject are presented separately, shown in the first four rows of the figure, while the final row contains the results for the combined subjects' data. For all four subjects and for the combined data, mitigating either self- or overlap-masking statistically significantly improved speech recognition compared to unmitigated reverberant speech. Further, although both reverberation mitigation strategies resulted in improved speech perception, self-masking mitigation resulted in greater performance than overlap-masking mitigation for Subject S1. Although previous studies suggest that self-masking reflections may not be as detrimental to CI speech recognition as overlap-masking reflections,4,7,9 Kokkinakis and Loizou10 found the effects of self-masking to be more detrimental than overlap-masking. It is possible, therefore, that some CI listeners experience a greater decrease in speech recognition due to self-masking, while others may be more affected by overlap-masking. Unmitigated reverberant speech resulted in mean percent correct scores of 48.0%, 49.0%, 49.0%, and 45.7% for Subjects S1–S4, respectively. Self-masking mitigated speech resulted in mean scores of 82.1%, 69.8%, 81.0%, and 83.0% and overlap-masking mitigated speech resulted in mean scores of 61.1%, 73.5%, 85.3%, and 78.1% for Subjects S1–S4, respectively.

The results of this study suggest that mitigating either self- or overlap-masking could improve speech recognition for CI listeners in reverberant environments. For three of the four CI listeners, mitigating either self- or overlap-masking provided similar improvements in speech recognition compared to unmitigated reverberant speech. For the remaining CI listener the study found that, although mitigating either effect improved speech recognition, mitigating self-masking resulted in greater speech recognition performance than mitigating overlap-masking. However, for two of the reverberation conditions presented to the NH listeners, subjects performed better when presented with overlap-masking mitigated speech. These results suggest that both effects are detrimental to speech recognition, and that the relative impact of each masking-type may vary between subjects.

Because the current study concluded that both self- and overlap-masking are detrimental to CI speech recognition in reverberant environments, speech processing algorithms could benefit from the mitigation of either effect. Self-masking, which alters the temporal and frequency information within a phoneme, may be more difficult to mitigate than overlap-masking, which contains reflections occurring greater than 50 ms following a source signal. Because overlap-masking occurs once a source has terminated, mitigating its effects in CI pulse trains involves detecting and removing the overlap-masking pulses. Further, it has been suggested that the late reflections resulting from overlap-masking may be more detrimental to CI speech recognition than the early reflections resulting from self-masking.4,7,9 Therefore, it may be worthwhile to focus initial development on an overlap-masking mitigation algorithm; however, mitigation of either effect is likely to provide a significant benefit.

The authors would like to thank the subjects who participated in the experiments associated with this study.

1.
K.
Kokkinakis
,
O.
Hazrati
, and
P. C.
Loizou
, “
A channel-selection criterion for suppressing reverberation in cochlear implants
,”
J. Acoust. Soc. Am.
129
(
5
),
3221
3232
(
2011
).
2.
K.
Kokkinakis
and
P. C.
Loizou
, “
Selective-tap blind dereverberation for two-microphone enhancement of reverberant speech
,”
IEEE Signal Process. Lett.
16
(
11
),
961
964
(
2009
).
3.
O.
Hazrati
and
P. C.
Loizou
, “
Tackling the combined effects of reverberation and masking noise using ideal channel selection
,”
J. Speech, Lang., Hear. Res.
55
(
2
),
500
510
(
2012
).
4.
O.
Hazrati
and
P. C.
Loizou
, “
Reverberation suppression in cochlear implants using a blind channel-selection strategy
,”
J. Acoust. Soc. Am.
133
(
6
),
4188
4196
(
2013
).
5.
O.
Hazrati
,
S. O.
Sadjadi
,
P. C.
Loizou
, and
J. H. L.
Hansen
, “
Simultaneous suppression of noise and reverberation in cochlear implants using a ratio masking strategy
,”
J. Acoust. Soc. Am.
134
(
5
),
3759
3765
(
2013
).
6.
R. H.
Bolt
and
A. D.
MacDonald
, “
Theory of speech masking by reverberation
,”
J. Acoust. Soc. Am.
21
(
6
),
577
580
(
1949
).
7.
Y.
Hu
and
K.
Kokkinakis
, “
Effects of early and late reflections on intelligibility of reverberated speech by cochlear implant listeners
,”
J. Acoust. Soc. Am.
135
(
1
),
EL22
EL28
(
2013
).
8.
A. E.
Vandali
,
L. A.
Whitford
,
K. L.
Plant
, and
G. M.
Clark
, “
Speech perception as a function of electrical stimulation rate: Using the Nucleus 24 cochlear implant system
,”
Ear Hear.
21
(
6
),
608
624
(
2000
).
9.
O.
Hazrati
,
J.
Lee
, and
P. C.
Loizou
, “
Blind binary masking for reverberation suppression in cochlear implants
,”
J. Acoust. Soc. Am.
133
(
3
),
1607
1614
(
2013
).
10.
K.
Kokkinakis
and
P. C.
Loizou
, “
The impact of reverberant self-masking and overlap-masking effects on speech intelligibility by cochlear implant listeners
,”
J. Acoust. Soc. Am.
130
(
3
),
1099
1102
(
2011
).
11.
E. A.
Lehmann
and
A. M.
Johansson
, “
Prediction of energy decay in room impulse responses simulated with an image-source model
,”
J. Acoust. Soc. Am.
124
(
1
),
269
277
(
2008
).
12.
B.
Champagne
,
S.
Bédard
, and
A.
Stéphenne
, “
Performance of time-delay estimation in the presence of room reverberation
,”
IEEE Trans. Speech Audio Process.
4
(
2
),
148
152
(
1996
).
13.
M.
Nilsson
,
S.
Soli
, and
J. A.
Sullivan
, “
Development of the Hearing in Noise Test for the measurement of speech reception thresholds in quiet and in noise
,”
J. Acoust. Soc. Am.
95
(
2
),
1085
1099
(
1994
).
14.
H.
Davis
and
S. R.
Silverman
,
Hearing and Deafness
, 4th ed. (
Holt, Rhinehart, and Winston
,
New York
,
1978
), pp.
537
538
.
15.
S. L.
Tantum
,
L. M.
Collins
, and
C. S.
Throckmorton
, “
Bayesian a posteriori performance estimation for speech recognition and psychophysical tasks
,” URL http://www.ece.duke.edu/sites/ece.duke.edu/files/u23/Tantum_etal_BayesianPerformanceEvaluation.pdf (Last viewed October 2013).