Adding frequency modulations (FM) cues to vocoded (AM) speech aids speech recognition for younger listeners. However, this may not be true for older listeners since they have poorer FM detection thresholds. We measured FM detection thresholds of young and older adults; and in a sentence context examined whether adding FM cues to vocoded speech would assist older adults. Young and old participants were presented vocoded sentences in quiet and multitalker-babble with/without FM cues. Older adults had elevated FM detection thresholds but received the same-size FM benefit as younger adults, showing that they have the capacity to benefit from FM speech cues.
1. Introduction
Speech information is transmitted by a rich and complex signal. Surprisingly, a high level of speech recognition in quiet can be achieved from very sparse representations of the speech signal. For example, cochlear implants (CIs) in general only transmit temporal information from amplitude modulations (i.e., AM cues), and CI users are able to successfully recognize speech in quiet, a result that demonstrates the role of temporal information in speech processing. The extent to which temporal information is required for speech recognition has been examined by simulating cochlear implant hearing in normal hearing participants by using vocoded speech. Note that cochlear implant processing strategies are similar to vocoding.1 Noise-excited vocoded speech consists of band-pass filtered noise that has been shaped by the temporal envelope extracted from a small number of frequency bands. Shannon et al.2 tested listener's recognition of consonants, vowels and sentences derived from 1 to 4 frequency bands in quiet. They found that high levels of speech recognition could be achieved from the amplitude-modulated (AM) cues extracted from as few as three frequency bands. Xu and Zheng3 also reported plateau level vowel and consonant-recognition in quiet when spectral cues were extracted from 12 to 24 frequency bands.
Subsequent research has shown that AM cues are much less effective for speech recognition in background noise4,5 with more frequency bands (12–24) needed for plateau level vowel-recognition and the same is the case with CI listening.6 However, it was found that this disruptive effect of noise on perceiving vocoded speech can be partly offset when the AM signal was combined with slowly varying temporal fine structure (frequency modulation, FM) information derived from band-limiting the FM rate to below 400 Hz.5,7 Such results suggest that adding FM information to cochlear implant processing has potential for improving CI listener's speech perception. However, as research investigating the contribution of adding FM to vocoded speech has only been conducted with younger adults, it is not clear how well older adults might perceive AM and AM + FM speech in quiet and in noise.
Compared to younger adults, older adults are worse in recognizing vocoded speech.8,9 Sheldon et al.8 found that older adults correctly identified fewer monosyllabic words from speech vocoded from eight frequency bands than younger adults. This poorer performance was attributed to older adults having degraded temporal resolution ability. These studies only examined the recognition of short spoken utterances (syllables and words) and only presented the vocoded speech in quiet conditions. It is yet to be determined whether older listeners will show poorer performance with longer continuous speech stimuli (e.g., sentences).
In addition, older adults may not get as much benefit from the FM cues (i.e., a reduced FM facilitation effect). Studies have shown that the ability to detect frequency modulations degrades with advancing age, i.e., older adults (>60 years) have poorer FM detection ability (elevated FM detection thresholds) compared with younger adults.10,11 For example, He et al.10 showed that older adults with normal hearing had impaired FM detection for both sinusoidal and quasitrapezoidal carrier frequencies. Sheft et al.11 also investigated age differences by examining FM detection thresholds for 1000 Hz pure tones, frequency modulated at 5 Hz with or without the presence of continuous speech babble (16 dBSNR). They showed that older adults had elevated FM detection thresholds which correlated significantly with poor speech recognition scores in noise. If older adults are less sensitive to changes in FM signals, then they may not be able to use the FM cues present in speech as effectively as younger adults.
The primary aim of the current study was to determine whether older adults had the capacity to benefit from the addition of FM cues to speech. Understanding how well adults perceive speech with AM and FM cues across their lifespan has important implications for cochlear implant processing strategies. That is, if older adults do not show a FM benefit, it would suggest that they had a more central problem in processing FM cues. Such a finding would undermine any effort to deliver FM cues via cochlear implants to older adults. The current study examined older adults' performance in noise, as this is the most challenging condition and as such is most likely to reveal any FM facilitation effect. In conducting the study, we took into account the potential floor effect that can arise when vocoded speech is presented with background noise. To avoid floor effects, we used auditory-visual presentation in which the face of the talker could be seen (i.e., visual speech), since visual speech cues will enhance the intelligibility of AM speech in noise.11 Note that with younger adults, Kim et al.12 found that visual speech facilitation (an AV effect) was the same size for the AM and AM + FM cue conditions. More recently, Meister et al.13 found that the AV effect was even larger for AM + FM speech compared to AM only speech, although it should be noted that with unfiltered intact speech, older adults show equivalent levels of AV facilitation as younger adults.14 Taken together then, these findings support the use of visual speech with AM or AM + FM speech in order to enhance intelligibility and reduce the floor effects in noise by boosting performance.
In summary, the current study compared younger and older adult's recognition of sentences filtered to retain AM or AM + FM cues in quiet and noise when visual speech cues are available. Past research has found that older adults are poor in identifying syllables and words from AM cues, and also that older adults have higher thresholds in detecting FM differences in non-speech stimuli. Based on these data, it was expected that compared to the younger adults, older adults might show poorer recognition of AM sentences and a reduced FM benefit.
2. Method
2.1 Ethics
The study procedures were approved by the Human Research Ethics Committee of the Western Sydney University. Written informed consent was obtained from each participant prior to the experiment.
2.2 Participants
Twenty native English speaking younger adults (mean age: 27.5 years, age-range 18–40 years, 10 males, 10 females) and 20 older adults (mean age: 73.1 years, age-range 68–78 years, 11 males, 9 females) participated in this experiment. All the younger participants had normal hearing sensitivity in both ears [<20 dB hearing level (HL)] at 500, 1000, 2000, 3000, and 4000 Hz. Older adults with hearing thresholds ≤30 dB HL at 500, 1000, 2000 Hz and ≤50 dB HL at 3000 and 4000 Hz participated in the experiments. All the participants had uncorrected or corrected normal vision. The participants from both the groups were screened on Mini Mental Status Examination15 and each scored 26 or greater, indicating no cognitive impairment. Of the 20 participants in each group, 17 older adults (9 females) and 17 younger adults were tested on a FM detection task centered on a carrier frequency of 500 Hz.
2.3 Test materials
Eighty sentences from the 1969-revised IEEE/Harvard list of phonetically balanced sentences were used. The sentences were spoken by a male native Australian English speaker who stood against a uniform gray background with only the speaker's head and shoulders being shown in the video. To create the speech in noise condition, a sample from a commercially available multi-talker babble track (Auditec, St. Louis, MO) was used as competing noise stimuli. This multi-talker babble consisted of three females and one male. The audio sentences and noise were mixed at −5 dB signal to noise ratio (SNR). This level was chosen based on earlier studies using similar speech materials and methods.5,12 The onset of the masking noise (babble) occurred prior to the onset and exceeded the offset of sentences by 1 s. The FAME algorithm4 was used to extract slowly varying and band-limiting AM and FM signals from sentences mixed with babble noise, using the same parameters as Kim et al.12 To create AM+FM signal, these signals were then summed (see Ref. 12 for detailed signal processing procedure). FAME processing was applied after the sentences were mixed with the babble noise. Once the sentences were processed with FAME they were dubbed back onto the video of the talker using the program virtualdub. Eight frequency bands were used to ensure a sizeable FM benefit in sentence recognition in quiet and in noise.5
2.4 Procedure
A total of 40 sentences were presented in the presence of video of the talker. There were 4 different conditions with 10 sentences in each condition: 2 “noise levels” (quiet vs noise) × 2 “cues” (AM vs AM + FM). The order of blocks and sentences within blocks were randomized. No sentence was repeated in the four different versions of the randomized sentence lists. The premixed sentences at a −5 dB SNR were presented at a comfortable listening level for all the participants. Following the presentation of each sentence, participants typed in their responses and a real-time visual feedback of the typed characters was provided so that any typos could be corrected online. Only the content words from each sentence were scored and credit was only given if the typed word matched the original stimulus (with the exception of obvious typos). The percentage of words correctly identified for each condition was determined by dividing the total number of correctly identified words by the total number of content words in a given condition. These percentage scores were then transformed into rationalised arcsine units (RAU scores)16 to normalize the distribution of the scores across the participants and to make them suitable for analysis of variance (ANOVA).
The FM detection task was centered on a carrier frequency of 500 Hz with a modulation frequency always kept at 5 Hz with an initial modulation index of 10 Hz. The modulation index was subsequently varied based on the participant's responses. All the FM sounds were 500 ms in duration with a cosine-square rise/fall time of 10 ms and were presented at constant level of 75 dB sound pressure level to all the participants (adjusted to their hearing status). To determine the FM detection thresholds, a 2 down, 1 up adaptive psychophysical procedure along with two alternate forced-choice (2AFC) task was used which will estimate 70.7% point on a psychometric function. A geometric mean of the mid-points of the last six reversals was taken as the FM detection threshold. The measurement was repeated twice in all the participants and the average of the two runs was considered as the final FM detection threshold.
3. Results
Before we report the speech recognition in noise results, it is important to note the results of the FM detection task on 17 participants from each group. The results revealed that the older adults had significantly elevated FM detection thresholds (M = 11.31 Hz, SE = 1.75) compared to the younger adults [M = 2.72 Hz, SE = 0.23; t(30) = −4.84, p < 0.001, r = 0.49].
Mean percent correct word recognition scores (RAU transformed) for younger and older adults, as a function of noise levels, in quiet (left panel) and in noise (right panel), are shown in Fig. 1. The mean percent correct word recognition performance (transformed; across AM and AM + FM) in quiet was near ceiling for both age groups (young adults, M = 108 RAU units, SE = 2.81; and older adults, M = 107 RAU, SE = 2.81, based on marginal means). However, adding a masker (multitalker babble) to the AM and AM + FM filtered speech drastically reduced the speech recognition scores for both participant groups (young adults, M = 33 RAU units, SE = 4.89; and older adults, M = 41 RAU, SE = 4.89).
To determine if the groups differed across conditions, a mixed ANOVA was conducted with noise levels and cues as within-subject factors, and age as a between-subjects factor. Since some of the older adults had hearing loss at higher frequencies, their high frequency pure tone average (HFA) for thresholds at 2000, 3000, and 4000 Hz, was used as a co-variate in the ANOVA analyses. The homogeneity of variance for the comparison groups was determined (Levene's test). The data was homogenous for all conditions except for AM + FM quiet condition. Parametric tests are quite robust to these mild violations of homogeneity so this single violation was considered acceptable.
After controlling for the hearing status of the participants, the ANOVA results revealed that there was no significant overall difference between the younger and older adults, F(1,37) = 0.63, p = 0.43, ŋp2 = 0.01. The HFA scores significantly predicted correct word identification across the age groups, F(1,37) = 6.11, p = 0.018, ŋp2 = 0.14. This shows that hearing ability might have an effect on word identification in our participants. There was a significant main effect of the noise levels, F(1,37) = 31.34, p = 0.000, ŋp2 = 0.46, with superior performance in the quiet compared to the noise condition. There was a significant main effect of the cue type, F(1,37) = 8.94, p = 0.005, ŋp2 = 0.20, with higher RAU scores in AM+FM condition than AM only. For speech presented in quiet or noise, there was no significant interaction between age and cues, F(1,37) = 0.60, p = 0.45, ŋp2 = 0.01. In addition, no interaction was found between age and noise levels (quiet and noise), F(1,37) = 0.79, p = 0.39, ŋp2 = 0.02 and noise levels and cues, F(1,37) = 1.62, p = 0.21, ŋp2 = 0.04. This result suggests two main findings: (1) that there was no age difference in identifying AM sentences in quiet and noise; (2) that the size of FM benefit to speech recognition was similar between the older and younger adults.
4. Discussion
The aim of the present study was to compare younger and older adult's ability to identify vocoded spoken sentences, and to contrast the size of the intelligibility benefit obtained from the addition of FM cues. The older adults tested in the current study had elevated FM detection thresholds. The results showed that older and younger adults were equally able to correctly identify vocoded sentences, both in quiet and in noise, and obtained a similar sized FM benefit, despite older adults' impaired FM detection ability.
These results contrast with previous findings that showed that older adults had greater problems identifying vocoded syllables and words.8,9 This difference may be due to the current study using sentence materials to test speech perception. This proposal is consistent with two previous findings: (1) that the recognition of vocoded sentences (in quiet) is better than syllables or words;17 (2) that older adults can make better use of contextual cues present in spoken sentences (continuous speech) than younger adults.18,19 Previous studies have shown that the older adults generally can benefit from contextual cues as much, or even more, than younger adults when listening in challenging environments.18,19 Because older adults are more experienced listeners, they may be able to use their existing stored knowledge of relevant speech structures for communication more effectively than younger adults.19 Also, it has been reported that verbal memory skills predicts vocoded speech recognition (phoneme) in middle-aged and older adults.20 Thus, top-down support for sentence processing may compensate for problems that older adults experience in auditory processing. It should be noted, however, that this explanation is speculative and was not directly tested in the current study. Indeed, it may be the case that the current older participants would perform just as well as younger adults on a test of syllable/word recognition.
Another factor that would have aided the perception of vocoded speech in the older participants (although not relative to the younger adults) was that the auditory stimuli were presented together with the video of the speaker. The provision of the visual speech would have aided speech recognition in noise. It has been shown that older and younger adults show equivalent auditory-visual speech facilitation.14 Indeed, recently, it has been reported that the ability to integrate audio-visual speech information is not affected by age and combining auditory and visual speech information leads to a strong representation of the speech in older adults.21 In addition, older adults in the current study with hearing loss might have used visual speech cues more effectively than their normal hearing peers. It should be pointed out, however, that since visual speech was presented in both AM and AM + FM conditions, better use of visual speech by some of the older adults does not explain the FM benefit; indeed, if older adults used visual speech more in the harder AM condition, then the FM benefit for older adults may have been even larger.
An explanation for the similar FM benefit for older and younger adults is the considerable information present in the AM + FM vocoded sentence stimuli. That is, the AM + FM vocoded speech contains acoustic information related to temporal envelope and temporal fine structure cues for speech recognition. The AM speech envelope conveys suprasegmental and segmental level information and depending upon the AM cut-off frequency, it may also preserve F0-related periodic modulations.22 The FM (fine structure) cues convey speech information regarding fundamental frequency, harmonic structure of the voice. The combination of these cues potentially provides speech information containing rate and rhythm of syllable patterns, gap and duration cues.23 Thus, even though AM and FM processing may be compromised in the older participants, the combination of cues produces a scaffolding effect that may aid speech perception in noise.
The equivalent FM benefit in both groups could also be due to the fact that the FM detection thresholds (2–12 Hz) were below the FM rate (400 Hz) used for vocoding the sentences. This possibility should be explored further by measuring FM detection ability at higher FM modulation rates. The results of the current study suggest that older adults are able to receive benefit from FM cues inherent in the speech. Hence, transmitting FM cues along with AM cues in cochlear implant speech coding strategies, along with the assistance of visual speech cues, may improve speech recognition in older adults.
5. Conclusions
The use of more naturalistic and supportive conditions (sentence stimuli where the talker's face can be seen) provides a positive picture of older adult's ability to process vocoded speech (with and without the addition FM cues). That is, in both quiet and noise conditions the recognition scores of older adults were similar to those of younger adults. Further, older adults received a similar-sized FM benefit as younger adults from the addition of FM speech cues, even though they had poorer FM sensitivity (as measured by FM detection thresholds). The use of non-speech FM detections tasks, while providing an assessment of sensory and perceptual functioning, should not be taken as a definitive indication of older adults' ability to use FM speech information when aided by visual speech cues. Speech recognition in noise can be enhanced in cochlear implant listening if both the AM and FM cues are transmitted to the listener.
Acknowledgments
This research was supported by Australian Research Council Discovery Grant No. DP150104600.