Theoretical studies demonstrate that controlled addition of noise can enhance the amount of information transmitted by a cochlear implant (CI). The present study is a proof-of-principle for whether stochastic facilitation can improve the ability of CI users to categorize speech sounds. Analogue vowels were presented to CI users through a single electrode with independent noise on multiple electrodes. Noise improved vowel categorization, particularly in terms of an increase in information conveyed by the first and second formant. Noise, however, did not significantly improve vowel recognition: the miscategorizations were just more consistent, giving the potential to improve with experience.
1. Introduction
The ability to distinguish between different speech sounds is remarkable given the intense noise intrinsic to signal transduction by hair cells in the inner ear, in particular from the Brownian motion of hair bundles (Denk and Webb, 1989) and the irregular neurotransmitter release at ribbon synapses (Heil , 2007). A fundamental question is whether this internal noise limits categorization of sounds or, rather, enhances it. This question is consequential for the approximately half a million individuals worldwide who rely on cochlear implants (CIs) to hear. In the absence of functioning hair cells, CIs convert sounds into electrical signals to stimulate the cochlear nerve fibres directly [e.g., Clark (2003)]. As well as severely restricting sensory input, the loss of hair cells associated with deafness greatly reduces the amount of intrinsic noise in the peripheral auditory system (Shepherd and Javel, 1997) and it is unknown whether CIs should reintroduce this noise artificially. Generic neural-coding studies have shown that an optimum non-zero noise level can theoretically enhance information transmission, enhancing categorization, by coding the instantaneous amplitude of a signal as the probability of a nerve spike in a population-mediated temporal code (Collins , 1996; Levin and Miller, 1996). This phenomenon, where noise (typically additive Gaussian noise) enhances signal transmission, is specific to nonlinear systems and is more broadly referred to as stochastic facilitation (McDonnell and Ward, 2011) or stochastic resonance (Benzi , 1982; Stocks, 2000). More specific studies of peripheral coding with physiological (Morse and Evans, 1996) and computational (Morse and Evans, 1999) models of electrical stimulation in CIs suggest that adding optimal noise to electrical inputs might improve the representation of speech by the temporal pattern of nerve discharges. Such coding, however, is not necessarily in a form that can be decoded and lead to improved speech perception; the ability to discriminate sound intensity, for example, might be limited by central auditory processing rather than by the information conveyed by the cochlear nerve (Plack and Carlyon, 1995). Moreover, some CI researchers consider that temporal coding in hearing is limited to frequencies below about 400 Hz (Wilson , 1991; Zeng, 2002). It therefore does not follow from the theoretical physiological and computational studies of information transmission that noise will benefit CI users.
Although some studies suggest that noise can enhance the detection (presence/absence) of subthreshold signals in cochlear implantation [e.g., Zeng (2000) and Chatterjee and Robert (2001)], this is distinct from stochastic facilitation of information in general where there are more than two categories (McDonnell and Ward, 2011). Indeed, for signals that are suprathreshold in the absence of noise, noise levels that theoretically enhance information transmission in a nonlinear system simultaneously degrade detection (Stocks, 2000)—the Shannon-Hartley equation (Shannon and Weaver, 1949) relating signal-to-noise ratio (and therefore detection) and information transmission is applicable only to linear systems. Further, stochastic facilitation of information transmission is theoretically optimized when signals are suprathreshold and only when information is pooled across nonlinear elements such as nerve fibres (Stocks, 2000). It cannot therefore be inferred from previous behavioural studies on the detection of subthreshold stimuli that noise will enhance speech categorization for suprathreshold stimuli.
Here, we used advances in CI technology to enable the internal noise in normal hearing to be mimicked and determine its effect on vowel categorization by CI listeners. We test the hypothesis that the addition of Gaussian noise—the most common form of noise in studies of stochastic facilitation—improves vowel categorization from a baseline without noise. A characteristic of stochastic facilitation is the existence of an optimal noise level. Because the optimal level was not known in advance, and could vary between participants, we used a range of noise level as well as a no noise condition. We asked listeners to categorize vowel sounds when the information-bearing signal was restricted to a single electrode, such that categorization would be based solely on the temporal pattern of evoked spikes and delivered noise through multiple other electrodes. Because, we wished to determine whether noise improves vowel categorization through enhancing fine temporal coding, as in previous physiological and computational studies (Morse and Evans, 1996, 1999) we used simultaneous stimulation on multiple channels (often referred to as analogue simulation even when stimulus levels are discretized). This contrasts the speech-coding used in most commercially available CI devices where the signals on the multiple electrodes are temporally interleaved [e.g., Clark (2003)]. Such coding uses envelope extraction, which completely removes the fine temporal structure present in simultaneous stimulation and therefore precludes enhancement of time coding with noise. Channel interaction is thought to be greater for simultaneous stimulation than for interleaved stimulation (Wilson , 1991), but in this test paradigm this is mitigated by the use of a single channel for vowel stimulation.
We determined the ability of CI users to categorize the vowels by obtaining confusion matrices from which we measured both the information transmission and the percent correct recognition. We note that these measures are distinct: if, for example, four symbols A, B, C, and D are always perceived as D, C, B, and A, respectively, then categorization is perfect and information transmission is at its maximum level but there is 0% recognition. In such a case, there is essentially an inability to name the categories.
2. Methods
2.1 Participants
Twelve post-lingually deafened adults (nine females and three males) with at least six months' experience with the HiRes90K CI (Advanced Bionics Corporation, Valencia, CA) participated in this study. The participants were aged between 29 and 81 years old with a mean of 59 years old (see supplementary information for full demographics).1 All the participants were unilaterally implanted and none had previous experience with analogue CI coding strategies. The study was approved by the Aston University Ethics Committee and the Research Ethics Committee of the UK National Health Service (RRK 3569).
2.2 Equipment and software
Direct stimulation of the surgically implanted intracochlear electrodes was via a research Platinum Speech Processor and the Advanced Bionics Corporation Streaming Interface (HRStream), which enabled arbitrary simultaneous stimulation of a participant's electrodes from signals generated on a personal computer. The experiment was controlled using our own C++ software together with library functions provided by Advanced Bionics Corporation.
2.3 Stimuli
The stimuli were the four cardinal vowels (/ɑ/, /æ/, /u/ and /i/, as in “cart,” “hat,” “rude,” and “seat,” respectively) and the central vowel /ə/ (as in “the”). The vowels, which were synthesised using a Klatt synthesiser (Klatt, 1980) in cascade mode (20 kHz sample rate, 16-bit resolution), varied in their first and second formant frequencies (F1 and F2, respectively) such that, where possible, vowel pairs shared a common formant [Fig. 1(A)] and recognition of one formant frequency would not uniquely specify an individual vowel. (See supplementary information for further details of the methods, including the Klatt synthesis parameters.1) We intended the vowels to be distinguished only on the basis of timbre, as determined by the formant frequencies. The vowels all had a fundamental frequency of 100 Hz (akin to a male speaker) and a duration of 500 ms. Intensity cues were removed by balancing the cues for loudness and roving the intensity from presentation to presentation. The vowels were charge balanced and presented only to electrode 8 of the 16 possible electrodes [Fig. 1(B)]. Background noise, when used, was applied by charge-balanced simultaneous stimulation of electrodes 5 to 11 with independent Gaussian noise with a sample frequency of 9942 Hz on each electrode [Fig. 1(B)]. The electrodes would be expected to produce slightly different random noise at different locations along the cochlea through position-dependent summation (Morse , 2007), thus making stochastic facilitation possible for suprathreshold signals (Stocks, 2000). The relative noise levels were based on behavioural measures, as described below.
Stimulus generation and presentation. (A) Vowel space of the stimuli. Five cardinal vowels were synthesized so that there were examples of vowels with low, middle, and high F1 and F2 frequencies. (B). The vowel was presented to electrode 8. Background noise, when used, was applied by simultaneous stimulation of electrodes 5 to 11 with independent Gaussian noise. The red cochlear nerve fibres represent fibres excited by the summed stimulation from multiple electrodes and the blue fibres represent unstimulated fibres.
Stimulus generation and presentation. (A) Vowel space of the stimuli. Five cardinal vowels were synthesized so that there were examples of vowels with low, middle, and high F1 and F2 frequencies. (B). The vowel was presented to electrode 8. Background noise, when used, was applied by simultaneous stimulation of electrodes 5 to 11 with independent Gaussian noise. The red cochlear nerve fibres represent fibres excited by the summed stimulation from multiple electrodes and the blue fibres represent unstimulated fibres.
2.4 Procedures: Setup phase and categorization/recognition task
The experiment was run in two half-day sessions in a sound-proof booth, with the first session used for setting noise and vowel levels, and the second for all the vowel categorization runs. The levels of the vowel and additive noise for each participant were based on their behavioural responses, which were collected using a touchscreen monitor. To enable better repeatability, the noise level on each noise electrode was set relative to the electrode threshold to a 100-Hz sinewave. For each noise electrode in turn, the 100 Hz detection threshold was measured using an adaptive three-interval three-alternative forced-choice task, with an inter-stimulus interval of 1 s, and levels were adjusted with a 3-down–1-up procedure tracking the 79.4% correct level (Levitt, 1971). Ten reversals were obtained and the threshold was taken to be the arithmetic mean of the last 6 reversals. A scaling factor for the noise on each electrode was calculated by dividing the sinewave threshold by the lowest threshold across the noise electrodes. We considered that participants might adapt to the noise so the overall threshold level for the scaled simultaneously presented noise was determined using a threshold tracking procedure, but analysis of the tracks demonstrated little adaptation to the noise level. In the second part of the setup, the vowels were loudness balanced at approximately the most comfortable loudness level. Loudness balancing was repeated without noise and for all the noise conditions.
Following the setup phase, participants categorized the vowels, which were blocked for each of five noise conditions: no noise or noise levels of −2, 0, 2, or 4 dB relative to the noise threshold; the order of the noise conditions was randomized for each participant. The noise was continuous throughout a block and there was a pause of at least 30 s from changing the noise level to the start of a categorization task to enable participants to adapt to the noise. For each noise condition, participants categorized 100 randomly ordered vowels (20 repetitions of each vowel). In each trial, the vowel level was randomly roved from its balance level by up to 10% of a listener's dynamic range to exclude potential loudness cues for categorization. Participants responded by pressing one of five buttons on the touchscreen labelled with the English transcription of the vowel sound (“a,” “aa,” “ee,” “er,” or “oo”) and feedback was given by highlighting the correct button.
The perception of single-channel vowels may be very different to the perception of vowels coded by the participants' everyday speech-processor with pulsatile multichannel stimulation. We therefore provided listeners with training in the experimental paradigm. The first type of training was without noise and took place only before the first experimental condition; it was identical to the experimental condition except that only 50 presentations were made, and after each presentation the correct response button was highlighted to help participants learn the association between vowel sounds and the appropriate response. Before each noise condition, listeners partook of a practice run with ten repetitions of each vowel with the same noise level as the following test; after each presentation a free response was made by the participant. If the response was incorrect, the participant was able to listen repeatedly to the presented vowel and to the vowel associated with the incorrect response so that they could learn the correct associations.
From the closed set of stimuli and responses at each noise level we calculated a confusion matrix. From this we calculated the percentage correct for vowel recognition and used standard methods from information theory (Shannon and Weaver, 1949; Miller and Nicely, 1955) to calculate the total mutual information (MI) for the vowels. We also calculated the mutual information for F1 and F2 separately. For these calculations, F1 and F2 of each stimulus were coded as “High,” “Mid,” or “Low” depending on their relative positions in the vowel space. We considered both /ə/ and /æ/ to have a middle F2 value. We also calculated the chance level for the total MI, and MI for F1and F2 separately by calculating the mean values for 10 000 randomly generated 5 × 5 confusion matrices assuming 20 presentations of each of the 5 vowels.
3. Results
We report the ability of CI users to categorize vowels as a whole and the ability to categorize the F1 and F2 formants at different noise levels through measurement of information transmission. We further report the ability of the participants to recognize the vowels.
The total information without noise was above the chance level of 0.125 bits for all 12 participants (range 0.217 to 1.065 bits, mean 0.404 bits). Ten of the participants got F1 MI above the chance level of 0.029 bits (range 0.019 to 0.607 bits, mean 0.248 bits) and seven got F2 MI above the chance level of 0.030 bits (range 0.013 to 0.208 bits, mean 0.061 bits). The addition of low levels of noise (0 and –2 dB noise) had no notable effect on the mean transmission of total information or the mean MI for F1 and F2 [Fig. 2(A)]. The addition of higher levels of noise (+2 and +4 dB noise), however, increased the mean total information conveyed by the vowels and this paralleled increases in mean MI for F1 and F2. The highest mean total information transmission and highest mean MI for F1 and F2 were obtained at +2 dB noise. Whilst the mean measures of information were lower with +4 dB noise than with +2 dB noise, they were still greater than the corresponding measures with no noise or low noise (–2 and 0 dB noise). This pattern in which a moderate level of noise improved vowel categorization compared with no noise or high noise MI for F1 was substantially greater than that for F2, even though the chance levels were the approximately the same (0.029 and 0.030 bits, respectively).
Results for the 12 post-lingually deafened CI users who identified vowels in a five-alternative forced choice task. Error bars show the standard error about the mean. (A) Mean vowel mutual information (solid red line) at each level, mean F1 mutual information transmission (solid blue line) and mean F2 mutual information transmission (solid black line). The dashed lines plot chance level for the total information (red) and mutual information for F1 and F2 (blue). (B) The difference between the mean vowel mutual information at noise levels of +2 and +4 dB and the baseline calculated by averaging the other three conditions, for each individual listener. (C) Shows the mean percent correct vowel recognition at each noise level. Chance level is shown by the dashed red line.
Results for the 12 post-lingually deafened CI users who identified vowels in a five-alternative forced choice task. Error bars show the standard error about the mean. (A) Mean vowel mutual information (solid red line) at each level, mean F1 mutual information transmission (solid blue line) and mean F2 mutual information transmission (solid black line). The dashed lines plot chance level for the total information (red) and mutual information for F1 and F2 (blue). (B) The difference between the mean vowel mutual information at noise levels of +2 and +4 dB and the baseline calculated by averaging the other three conditions, for each individual listener. (C) Shows the mean percent correct vowel recognition at each noise level. Chance level is shown by the dashed red line.
To investigate further the effect of adding noise to CI stimulation, we averaged the data from the no-noise condition and the two lowest noise levels, which did not lead to notably different results, to generate a more precise baseline for the total MI and MI separately for F1 and F2 and therefore increase the power of the statistical tests. Performance was variable across listeners, but three distinct patterns can be seen in the total MI with +2 and +4 dB noise relative to this baseline [Fig. 2(B)]: seven listeners benefited most with +2 dB noise (red lines) and three listeners benefited most from +4 dB noise (blue lines). For two listeners, the higher noise levels led to a consistent reduction in total information (black lines). A similar pattern was observed for MI conveyed in F1 and F2 separately. To reduce variability in the data, we averaged the data of the conditions +2 and +4 dB noise to produce a high-noise condition, and subtracted baseline MI from this. The mean difference in total MI between the high-noise and baseline conditions was 0.096 bits, which was significantly different from zero [two-tailed one sample t-test (t = 2.736, P = 0.019, df = 11) following a Shapiro-Wilks test for normality (W = 0.952, P = 0.661)]. Similar analyses for MI conveyed by F1 showed that the increase in MI of 0.087 bits from baseline noise to the high-noise condition was also significantly different from zero (t = 2.633, P = 0.019, df = 11). A Shapiro-Wilks test indicated the difference in MI conveyed by F2 between the baseline and the high-noise condition was not normally distributed (W = 0.773, P = 0.005). To this end, the difference was analyzed using a between-participant Wilcoxon signed rank test, and this demonstrated that the increase in MI of 0.061 bits when stochastic noise was added was significantly different from zero (z = −2.353, P = 0.019).
Adding noise to neighbouring channels improved the ability of CI listeners to categorize vowel sounds, but it did not appear to aid vowel recognition per se. Vowel recognition without noise was above chance level of 20% for all participants (range 22% to 60%, mean 32.6%), and was generally above chance level with added noise from 0 to +4 dB noise: one participant had a vowel recognition of 19% with 0 dB noise and one had a vowel recognition of 19% with +4 dB noise (see supplementary information for individual data for information transmission and recognition).1 Some improvement in mean vowel recognition across all the vowels was observed with noise [e.g., +2 dB noise in Fig. 2(C)], but a two-tailed one sample t-test showed that the difference between the mean recognition with +2 and +4 dB noise of 36.5% was not significantly different from the mean of 33.3% for the baseline condition (z = 1.591, P = 0.140). Similarly, analysis of vowel recognition for the individual vowels (Fig. 3) demonstrates that, for all vowels, noise at +2 dB noise always led to an increase in the mean vowel recognition compared with the baseline condition, although this was marginal for /ɑ/. At +4 dB noise, mean vowel recognition tended to be higher than for the baseline condition (except for /æ/) and lower than for 2 dB noise (except for /ɑ/). A two-way repeated measures ANOVA, however, showed that neither the noise level (baseline, +2 dB noise, or +4 dB noise) or vowel was a significant factor in vowel recognition (F = 2.429, P = 0.111, df = 2 and F = 1.193, P = 0.327, df = 4, respectively).
Mean percentage vowel recognition scores across the 12 participants for baseline noise (average of no noise, −2 and 0 dB) and noise levels of +2 and +4 dB relative to the behavioural noise threshold for each participant. Error bars show the standard error about the mean.
Mean percentage vowel recognition scores across the 12 participants for baseline noise (average of no noise, −2 and 0 dB) and noise levels of +2 and +4 dB relative to the behavioural noise threshold for each participant. Error bars show the standard error about the mean.
4. Discussion and conclusions
The data indicate that the addition of noise to neighbouring electrodes on a CI array can enhance the overall information provided to CI listeners by vowel sounds presented via a single electrode. Whilst previous studies show noise can theoretically enhance information transmission in CIs [e.g., Morse and Evans (1996, 1999)], to our knowledge this study offers the first proof-of-principle that additive noise can enhance information transmission in CI listeners. This therefore suggests that the application of noise in CIs could be clinically beneficial. The results, however, relate to synthetic vowel stimuli with a 100 Hz fundamental frequency (akin to a male speaker) and further study is required to determine whether these findings generalize to consonants, naturally spoken phonemes, and to female speakers or children. CI performance is typically highly variable [e.g., Clark (2003)]; sources of this variability include, but are not limited to, differences in the degree to which critical structures, notably the auditory nerve fibres, are degenerated [e.g., Clark (2003)], as well as variability in the depth and patency of electrode insertion (Finley and Skinner, 2008). We have also found a large degree of variability in the change in information with the addition of noise and the ability to recognize vowels. Here, variability in internal noise levels following deafness (Shepherd and Javel, 1997) may be an additional factor and the different effect of noise across listeners here is therefore unsurprising. Optimizing levels of added noise for individual listeners might be one way of improving performance across populations of CI users.
However, whilst noise could enhance the overall MI and the MI provided by F1 and F2 separately—listeners could categorize more effectively—this increase in performance did not extend to vowel recognition. This means that with noise participants were still making mistakes recognizing vowels but the mistakes were less random and participants had a greater tendency to confuse a vowel with another vowel that shared the same F1 or F2 frequency. Such errors are unsurprising given that it can take several months to adapt to a new CI coding strategy [e.g., Clark (2003)] and our strategy was markedly different to the normal multichannel strategies used by our listeners. More extensive training, or just greater familiarity with the strategy, might enable participants to exploit the additional information and recognize vowels per se.
This method could be extended to multichannel CI stimulation that exploits both the place and temporal coding of frequency in normal hearing. We note that complete independence of noise at each nerve fibre is currently impossible to achieve with electrical stimulation given that the deafened ear typically contains about 10 000 nerve fibres (Otte , 1978) and current commercial CIs contain up to 22 electrodes [e.g., Clark (2003)]. Correlation between the noise at adjacent fibres would be expected to reduce the effect of stochastic facilitation as it is equivalent to adding noise to the input signal. The development of a method of noise delivery is therefore a major challenge for this approach. Because of channel interaction it may also be a challenge to extend the method to simultaneous stimulation of the information-bearing signals. Moreover, it appears that the optimum noise level varies between participants. In this study, information transmission for individual participants changed markedly between 0 and +4 dB noise with most participants performing best at either +2 or +4 dB noise (the maximum level tested). Retrospectively, our step size of 2 dB for the noise was coarse and the upper level tested may have been too low; it is likely that finer control of the noise level would lead to higher optimum information transmission for each participant. We also note that it is not clear what form the noise should take. Here, we used additive noise because it is the most common form of noise in studies of stochastic facilitation (McDonnell and Ward, 2011). Theoretically, however, multiplicative (signal-dependent noise), which is also present in the normal auditory system (Evans, 1975), may be advantageous for CIs in terms of lower power requirements and greater information per nerve spike (Morse and Stocks, 2005). Furthermore, Ruszczynski (2001) have shown increased information transmission in some systems using non-Gaussian noise with a 1/f spectrum—the inherent noise in neurones (Derksen and Verveen, 1966).
The findings are also relevant to more general studies of noise in sensory systems. As noted by McDonnell and Ward (2011), physiological experiments have often involved adding exogenous noise to neural systems that already contain sources of endogenous noise [e.g., Douglass (1993) and Levin and Miller (1996)], rather than aiming to control or reduce endogenous noise. The previous observed improvements in performance due to the added noise do not therefore provide evidence for in vivo stochastic facilitation because it is not demonstrated that performance is better than without noise. Here, we do provide such evidence because animal experiments suggest that deafness causes a loss of the spontaneous discharge observed in normal hearing (Kiang and Moxon, 1972). Moreover, the vowel stimuli we have used are behaviourally relevant signals and we demonstrate that the deafened human auditory system likely contains mechanisms in higher brain centres that can decode temporal information that is stochastically facilitated at the electrically stimulated sensory interface.
Considerable evidence exists in normal hearing that the fine-time structure of sounds is critical to performance in many auditory tasks, e.g., for pitch perception (Evans, 1978) and sound localization (McAlpine , 2001). It is generally considered, however, that temporal pitch with CI stimulation is limited to frequencies less than 400 Hz (Eddington , 1978), although Hochmair-Desoyer (1981) found that some “star” CI listeners can discriminate frequencies up to 2000 Hz. Although some researchers have been cautious to conclude that limits to temporal coding with CIs might be overcome by improved electrical stimulation [e.g., Carlyon (2010)] others have taken the general low limit of pitch discrimination to mean that CI listeners are innately unable to exploit fine-time structure for speech categorization or recognition [e.g., Wilson (1991)]. Here, we have shown that with only temporal cues, CI listeners can still exploit information represented by F1 and F2, with F1 between 300 and 700 Hz and F2 between 1000 and 2200 Hz when all intensity and duration cues have been removed. Moreover, we have shown that noise could enhance F1 and F2 cues. It may be that greater levels of intrinsic noise in some CI users extends the pitch saturation limit and the addition of noise might extend the pitch limit for more typical users as well increasing the information transmission for speech. It is also possible that noise has no effect on the pitch saturation frequency and the limit is simply not relevant for a timbral task. Neither possibility detracts from our finding that noise can improve the transmission of information in speech.
Acknowledgments
This research was funded by the U.K. Engineering and Physical Sciences Research Council. We thank the participants and acknowledge technical assistance from Waldo Nogueira and Paddy Boyle (Advanced Bionics Corporation) for implementation of the CI research interface. We also thank Terry Nunn (St Thomas' Hospital, London), Huw Cooper, Claire Fielden, and Alison Riley (University Hospital Birmingham) for participant recruitment, and Nigel Stocks for discussions about stochastic facilitation.
See supplementary material at https://www.scitation.org/doi/suppl/10.1121/10.0010071 for additional methods and participant demographics and individual participant data.