Older people often complain of difficulties in understanding speech in noisy circumstances. The current study tested the hypothesis that problems segmenting speech may contribute to these difficulties. Segmentation ability was measured in young normal-hearing, older normal-hearing and older hearing-impaired listeners. Listeners were presented with sentences in competing speech and resultant misperceptions were analyzed in terms of their accordance with the metrical segmentation strategy. While strong support for this strategy was indicated, no difference in the use of this strategy was found across the three listener groups, suggesting older listeners were unlikely to be experiencing segmentation difficulties at the sub-lexical level.

Many people report that as they get older they have increasing trouble understanding speech in noisy or multiple talker environments. As simply amplifying quiet sounds does not fully resolve these problems (Plomp, 1986) they cannot be solely explained as a loss of sensitivity. Often listeners describe what they hear as “speech like” but have problems reporting the individual words. We hypothesized that this may be the result of difficulties in determining where words start and stop: Accordingly, the present study tested the idea that older or hearing-impaired adults may have problems in segmenting continuous speech.

There has been considerable research on the strategies used to segment continuous speech which has shown that the choice of a particular strategy may depend on the listening environment (Mattys, 2004). In optimal listening conditions lexical information is rich and dominates segmentation but in more impoverished conditions, where lexical information has become degraded, listeners must rely more heavily on sub-lexical cues to guide segmentation (Mattys et al., 2005). If listeners have increasing difficulties utilizing these “fall back” cues as they get older then this could result in difficulties segmenting and therefore understanding speech in noisy environments.

Metrical prosody is one such sub-lexical cue. In English, a large proportion of content words begin with a stressed syllable, and so using the stressed syllables to determine segmentation would be advantageous. This has been termed the “Metrical Segmentation Strategy” or “MSS” (Cutler and Norris, 1988). Several studies have provided support for the use of stress-based segmentation strategies in degraded listening conditions. For instance, Cutler and Butterfield (1992) found that listeners’ misperceptions of faint speech were often consistent with the MSS, i.e., listeners tended to perceive stressed syllables as word-initial and weak syllables as non word-initial. This led to a pattern of segmentation errors being seen: New word boundaries tended to be erroneously inserted before strong, stressed syllables as opposed to weak, unstressed syllables and existing word boundaries tended to be erroneously deleted before weak syllables as opposed to before strong syllables. Similar results have been found for speech presented in noise (Smith et al., 1989) and poorly produced speech (Liss et al., 1998). Spitzer et al. (2009) also showed that cochlear implant users are also able to make use of syllabic stress to help segment speech.

No studies have yet, to our knowledge, examined the possible affect that aging may have on listeners’ abilities to segment speech and in particular their use of metrical prosody. Reducing the contrast between strong and weak syllables has been shown to result in decreased adherence to the MSS (Spitzer et al., 2007) and it is possible that in older listeners age-related changes to the auditory system may lead to an equivalent reduction in syllabic contrast. Older listeners have been shown to have reduced temporal processing (Schneider, 1997) and intensity discrimination (He et al., 1998) and these changes may make it harder to detect the changes in duration and intensity that are believed to be important for the detection of prosodic stress (Kochanski and Orphanidou, 2008).

The current study used two complementary approaches to consider whether older listeners’ problems in understanding speech in multiple talker backgrounds are the result of segmentation difficulties. We tested older listeners both with and without hearing-impairment, as well as a control group of younger listeners. First, we used a misperception test using the same sentence stimuli as Cutler and Butterfield (1992) to measure speech segmentation in a background of competing speech as a function of target-to-masker ratio (TMR). If older listeners are have problems extracting prosodic information from speech presented in background speech then we would expect to see a weaker adherence to the MSS. Second, we measured the difference between the level at which speech could be detected in background competing speech and the level at which it could be fully identified. If older listeners struggle to extract prosodic cues they would need a greater increase in speech level to successfully segment and identify speech than they would to simply detect its presence.

Forty-one listeners participated; 16 young adults with normal hearing, 13 older adults with normal hearing and 12 older adults with hearing-impairment. All listeners were native speakers of British English. The younger adults ranged between 20 and 31 years old (mean=24) and all had self-reported normal hearing. The older adults ranged between 56 and 65 years old (older NH mean=59, older HI mean=62). Their hearing levels were assessed using pure tone-audiometry, with a four frequency average hearing threshold greater than 25 dB HL taken as the criterion for hearing impairment. The average loss for the group of older hearing-impaired listeners was 33 dB, while the average loss for the group of older normal-hearing listeners was 13 dB.

The experiment consisted of two parts: a boundary misperception test, carried out to assess listeners’ use of prosodic cues in speech segmentation, and a detection test carried out to measure the level at which listeners could just detect target speech in a multiple-talker background.

In the boundary misperception test the stimuli were target sentences presented in a two-talker masker. The target sentences were the same recordings of the 48, six-syllable sentences used by Cutler and Butterfield (1992). The sentences were spoken by a native British-English male. The sentences all had an alternating stress pattern of strong (S) and weak (W) syllables. Half of the sentences had a SWSWSW rhythm (e.g., “better budget system” and “angels pinned beneath it;” strong syllables are underlined) while the remaining half had a WSWSWS rhythm (e.g., “within reviewed results” and “debates are grim relief”). The two-talker masker consisted of two sentence streams spoken by a British-English female combined together. Each sentence stream was generated by joining five BKB sentences together with the gaps removed (Bench and Bamford, 1990). A set of 3-s sections were taken at random positions in each stream, 30-ms gates were applied at the onsets and offsets of these sections and then the two streams were added together to create the masker. The target sentences lasted 1.5 s on average, and were added to the middle of the 3-s masker.

In the detection test the stimuli were two intervals of two-talker masker, each being 1000-ms segments of the masker described above. On each trial one of the masker intervals contained a short burst of target speech (a 500-ms segment of one of Cutler-Butterfield sentences chosen randomly and new each time) placed in the middle of the masker. All stimuli in the study were presented using an AudioCapture UA-5 USB audio interface with EDIROL Audio Capture UA-5 D/A converters and Sennheiser HD580 headphones. The maskers were presented at an average level of 70 dB.

In the boundary misperception test, listeners were played target sentences in the two-talker masker. All three groups of listeners were initially played a target sentence at a TMR of −15 dB (chosen as it was considerably below threshold for all listeners) and asked if they could identify any words spoken by the male voice. The experimenter recorded all responses made by the listener. The TMR was progressively increased, in 3 dB steps, on subsequent trials until all the words in the sentence could be successfully identified; this level was taken as the “identification threshold” for the sentence. After correct identification the next sentence was played, again starting at the lowest TMR of −15 dB. Listeners heard 48 sentences in total: 8 practice sentences to familiarize themselves with the task (sentences 6, 11, 12, 25, 39, 40, 41 and 43 in Cutler and Butterfield, 1992), and then 40 test sentences.

The detection test consisted of a two interval two-alternative forced choice test. Two randomly interleaved adaptive tracks were used, each using a three-down, one-up method (Levitt, 1971). Each track began at a random TMR between 5 and 10 dB (a level at which the target speech could be clearly identified). The TMR decreased by 5 dB with every three consecutive correct responses and increased by 5 dB after every wrong response. After three reversals of direction the step size was reduced to 2 dB. Each track terminated after five reversals. Thresholds for each track were calculated by averaging the TMRs of the last two reversals. Listeners completed two blocks of the detection test—one before and one after the boundary misperception test—thus giving four thresholds in all, which were averaged.

Listeners’ responses from the misperception test were analyzed for segmentation errors. Partial responses were included if the listeners were able to indicate where in the sentence they thought the word had occurred. Errors were classified as either insertions before weak syllables (IW) (e.g., rooster reported as roostto); insertions before strong syllables (IS) (e.g., duress reported as to rest); deletions before weak syllables (DW) (e.g., expectit reported as expecting); deletions before strong syllables (DS) (e.g., are just reported as adjust). Both IS and DW errors are consistent with the MSS, but IW and DS errors are not consistent with the MSS. Note that a single misperception can lead to more than one type of boundary error, for example tenderviewers reported as tenreviewers consist of both an IW error and a DS error. As it is unclear which error was the first to be made all boundary errors in each misperception were counted, as per Cutler and Butterfield (1992).

On average younger NH listeners made 1.65 boundary errors per sentence, older NH listeners made 1.18 boundary errors per sentence, and older HI listeners made 1.13 boundary errors per sentence. Younger listeners seemed to be more willing to make partial responses than the older groups of listeners, thus suggesting that the differences in the total number of errors made were likely to be the result of younger listeners making more responses and therefore accruing more errors than older listeners.

Figure 1 shows the mean percentage of each type of error made by the three listener groups (to simplify and give an overview of the data the counts have been averaged over all TMRs tested). The results show that the proportion of errors followed the same pattern for all three groups; IS errors forming the largest proportion, followed by DW errors, IW errors, and DS errors. To examine whether this pattern of errors conformed to the MSS, the mean number of MSS errors (i.e., the sum of IS and DW errors) made by each group of listeners were calculated, as were the mean number of non-MSS errors (i.e., the sum of IW and DS errors). All three groups of listeners made far more MSS than non-MSS errors; on average 74%, 73%, and 75% of errors made by young NH, older NH and older HI listeners respectively were MSS errors, and correspondingly 26%, 27%, and 25% were non-MSS errors. A Pearson’s chi-square test carried out on the frequency of each type of error occurring per group demonstrated that there was no relationship between listener group and the number of MSS or non-MSS errors made [χ2(1)=0.47,p=0.79]. As the proportion of errors made by all three listener groups is in agreement with past findings (Cutler and Butterfield, 1992) this suggests that neither age nor hearing impairment affected listeners’ use of the MSS.

FIG. 1.

Percentage of each type of error made by young NH, older NH and older HI listeners.

FIG. 1.

Percentage of each type of error made by young NH, older NH and older HI listeners.

Close modal

The top panel of Fig. 2 shows the mean percentage of MSS and non-MSS errors made by the three groups of listeners at each value of TMR and the bottom panel shows the mean percentage of sentences correctly identified by listeners of each group at each TMR [note that the range of TMRs is reduced in the top panel compared to the bottom panel as (a) very few responses were made at the very lowest TMRs, and therefore very few errors made, and (b) at the very highest TMRs the segmentation pattern of responses tended to be correct again resulting in very few errors being made]. That the psychometric function for the older HI listeners was displaced rightward from the other two listener groups (bottom panel) indicates that their overall threshold for speech identification was higher. But a clear overlap in the proportion of errors made (top panel) suggests that despite these threshold differences all three groups of listeners made the same proportion of errors: approximately 2.7×3× more MSS than non-MSS errors. The proportions of MSS to non-MSS errors also remained relatively consistent across all TMRs, including the lowest TMRs where very few sentences could be identified correctly. Note that at higher TMRs there was a tendency for errors to become even more biased toward the MSS.

FIG. 2.

The top panel shows the mean percentage of errors at each TMR that can be accounted for by either MSS or non-MSS errors for young NH, older NH and older HI listeners. MSS errors appear at the top of the panel and non-MSS errors at the bottom. The bottom panel shows the mean percentage of sentences correctly identified at each TMR by young NH, older NH and older HI listeners.

FIG. 2.

The top panel shows the mean percentage of errors at each TMR that can be accounted for by either MSS or non-MSS errors for young NH, older NH and older HI listeners. MSS errors appear at the top of the panel and non-MSS errors at the bottom. The bottom panel shows the mean percentage of sentences correctly identified at each TMR by young NH, older NH and older HI listeners.

Close modal

Table I reports the mean detection and identification thresholds measured for young NH, older NH, and older HI listeners. A 2×3 mixed ANOVA demonstrated a significant effect of type of threshold [F(1,38)=1552.8,p<0.001], and a significant effect of listener group [F(2,38)=34.1,p<0.001]. These results indicated that detection thresholds were lower than identification thresholds for all groups of listeners but that both thresholds were higher in the HI listeners than they were in either of the NH groups (post hoc Tukey, p<0.001). A significant interaction between speech thresholds and listener group was also found [F(2,38)=9.70,p<0.001], indicating that the differences between detection and identification thresholds were not the same for all three listener groups: The difference was in fact considerably smaller for the older HI group. An inspection of the means showed that a relatively large range of performance across NH and HI groups was seen for speech detection (10.3 dB difference in detection thresholds between young NH and older HI and 9.3 dB difference between older NH and older HI) but this performance difference was much reduced for speech identification (4.3 dB difference in identification thresholds between young NH and older HI and 3.2 dB difference between older NH and older HI). As the greatest difference in intensity between these groups of listeners was seen for speech detection rather than identification, it is likely that this difference in performance is better explained by the loss of sensitivity in the hearing-impaired group rather than by problems accessing the cues needed to segment speech.

TABLE I.
Mean detection and identification thresholds for young normal-hearing listener, older normal-hearing listeners and older hearing-impaired listeners.
ListenergroupDetectionthreshold(dB)Identificationthreshold(dB)Difference(dB)
Young NH −27.8 −0.8 27.0 
Older NH −26.8 0.3 27.1 
Older HI −17.5 3.5 21.0 
Mean detection and identification thresholds for young normal-hearing listener, older normal-hearing listeners and older hearing-impaired listeners.
ListenergroupDetectionthreshold(dB)Identificationthreshold(dB)Difference(dB)
Young NH −27.8 −0.8 27.0 
Older NH −26.8 0.3 27.1 
Older HI −17.5 3.5 21.0 

Our results have shown that listeners can extract and use metrical prosody to aid segmentation in competing speech despite age or hearing impairment. This provides strong support for the robust nature of metrical cues, demonstrating their use by older and hearing impaired listeners at target-masker ratios well below those needed to give full identifiability of the words in the sentence. Nevertheless, the older listeners used were a relatively young older group, and it is possible that this may be the reason why no effect of age was seen. Studying the segmentation strategies of a more elderly group of listeners may therefore be of future interest. This said, it seems unlikely, however, that any dramatic differences would be seen for an even older group as no decline whatsoever was seen in the current group of listeners’ ability to make use of metrical cues.

We would like to thank Anne Cutler (Max Planck, Institute of Psycholinguistics, Nijmegen) and Sally Butterfield (MRC Cognition and Brain Sciences Unit, Cambridge) for kindly lending us the recordings of the sentences. A.W. was funded by a Ph.D. studentship from the Medical Research Council which was co-hosted by the Department of Psychology, University of Strathclyde. The Scottish Section of the IHR is supported by intramural funding from the Medical Research Council and the Chief Scientist Office of the Scottish Government.

1.
Bench
,
J.
, and
Bamford
,
J.
(
1990
).
Speech-Hearing Tests and the Spoken Language of Hearing Impaired Children
(
Academic
,
London
).
2.
Cutler
,
A.
, and
Butterfield
,
S.
(
1992
). “
Rhythmic cues to speech segmentation: Evidence from juncture misperception
,”
J. Mem. Lang.
31
,
218
236
.
3.
Cutler
,
A.
, and
Norris
,
D.
(
1988
). “
The role of strong syllables in segmentation for lexical access
,”
J. Exp. Psychol. Hum. Percept. Perform.
14
,
113
121
.
4.
He
,
N.
,
Dubno
,
J. R.
, and
Mills
,
J. H.
(
1998
). “
Frequency and intensity discrimination measured in a maximum-likelihood procedure from young and aged normal hearing-subjects
,”
J. Acoust. Soc. Am.
103
,
553
565
.
5.
Kochanski
,
G.
, and
Orphanidou
,
C.
(
2008
). “
What marks the beat of speech
,”
J. Acoust. Soc. Am.
123
,
2780
2791
.
6.
Levitt
,
H.
(
1971
). “
Transformed up-down methods in psychoacoustics
,”
J. Acoust. Soc. Am.
49
,
467
477
.
7.
Liss
,
J. M.
,
Spitzer
,
S.
,
Caviness
,
J. N.
,
Adler
,
C.
, and
Edwards
,
B.
(
1998
). “
Syllabic strength and lexical boundary decisions in the perception of hypokinetic dysarthric speech
,”
J. Acoust. Soc. Am.
104
,
2457
2566
.
8.
Mattys
,
S. L.
(
2004
). “
Stress versus coarticulation: Toward an integrated approach to explicit speech segmentation
,”
J. Exp. Psychol. Hum. Percept. Perform.
30
,
397
408
.
9.
Mattys
,
S. L.
,
White
,
L.
, and
Melhorn
,
J. F.
(
2005
). “
Integration of multiple speech segmentation cues: A hierarchical framework
,”
J. Exp. Psychol. Gen.
134
,
477
500
.
10.
Plomp
,
R.
(
1986
). “
A signal-to-noise ratio model for the speech-reception threshold of the hearing impaired
,”
J. Speech Hear. Res.
29
,
146
154
.
11.
Schneider
,
B. A.
(
1997
). “
Psychoacoustics and aging: Implications for everyday listening
,”
J. Speech Lang. Pathol. Audiol.
21
,
111
124
.
12.
Smith
,
M. R.
,
Cutler
,
A.
,
Butterfield
,
S.
, and
Nimmo-Smith
,
I.
(
1989
). “
The perception of rhythm and word boundaries in noise-masked speech
,”
J. Speech Hear. Res.
32
,
912
920
.
13.
Spitzer
,
M. R.
,
Liss
,
J. M.
,
Spahr
,
T.
,
Dorman
,
M.
, and
Lansford
,
K.
(
2009
). “
The use of fundamental frequency for lexical segmentation in listeners with cochlear implants
,”
J. Acoust. Soc. Am.
125
,
EL236
EL241
.
14.
Spitzer
,
S. M.
,
Liss
,
J. M.
, and
Mattys
,
S. L.
(
2007
). “
Acoustic cues to lexical segmentation: A study of resynthesized speech
,”
J. Acoust. Soc. Am.
122
,
3678
3687
.