The purpose of this study was to investigate the difference between English-native (EN) listeners and Chinese-native (CN) listeners using contextual cues to perceive speech in quiet and four-talker babble in English. Three types of sentences served as speech stimuli: high (semantic and syntactic cues), low (syntactic cues), and zero predictability. Results showed that CN listeners primarily relied on semantic information when perceiving speech, whereas EN listeners used both semantic and syntactic cues more equally. Moreover, the four-talker babble enlarged the group difference similarly across the three types of sentences, indicating that non-native listeners' greater-than-native difficulty in noise depended on speech materials.

It is well known that native listeners outperform non-native listeners in speech recognition under certain listening conditions, particularly in noise (Lecumberri and Cooke, 2006). This native advantage is found across a variety of speech materials, including phonemes (Cutler et al., 2004), words (Bradlow and Pisoni, 1999), and sentences (Mayo et al., 1997). Rogers et al. (2006) reported that Spanish-English bilinguals who had learned English prior to six years old and spoke English without a noticeable foreign accent still may have greater difficulty recognizing words in noisy or reverberant listening conditions than English monolingual speakers. These studies indicate that noise background significantly enlarges the gap of speech recognition between native and non-native speakers. Different from phonetic perception, in which acoustic-phonetic cues are primarily involved, the perception of running speechlike sentences is affected by a greater number of factors, such as acoustic-phonetic, phonological, prosodic, and contextual cues. Since many previous cross-linguistic studies on sentence recognition have focused on overall contextual effects, the primary goal of this study was to investigate how specific contextual cues, e.g., semantic and syntactic information, influence English sentence recognition in quiet and in noise for English- and Chinese-native listeners.

Boothroyd and Nittrouer (1988) examined the contextual effect on sentence recognition by using sentences with different predictability and found that semantic constraints had a significant impact on native English listeners' speech recognition in noise. Moreover, Mayo et al. (1997) reported that English monolingual and English-Spanish early and late bilingual speakers all showed some degree of context dependency under noisy conditions. However, compared with late bilinguals who learned English after the age of 14, English monolingual and early bilinguals who had a higher level of English proficiency derived significantly more benefits from contextual cues in sentence recognition. Moreover, Bradlow and Alexander (2007) found that non-native listeners could take advantage of sematic information when the acoustic-phonetic cues were sufficiently clear to them. When the context was quite limited, e.g., phonemes and nonsense words, non-native listeners sometimes allowed top-down strategies to prevail over bottom-up strategies (Field, 2004). Moreover, two recent studies in our laboratories showed that in quiet, Chinese-native listeners' recognition of English sentences with high contextual cues (e.g., HINT sentences; Jin and Liu, 2012) was comparable to that of English-native listeners, while their recognition of English sentences with low contextual cues (e.g., IEEE sentences) was remarkably worse than their English-native counterparts (Guan et al., 2015). Therefore, in the present study we propose that Chinese-native (CN) listeners might rely on contextual cues to perceive English speech, especially in noise, more than English-native (EN) listeners.

For speech segmentation, Sanders et al. (2002) reported that English non-native listeners employed syntactic cues less effectively than English-native listeners. Therefore, in this study we hypothesized that non-native listeners were more likely to rely on top-down strategies, e.g., semantic and/or syntactic information to perceive sentences. Thus, when semantic or syntactic information was reduced, speech recognition was expected to drop more significantly for non-native listeners than for native listeners. In particular, as more contextual cues were reduced, forcing listeners to rely more on acoustic-phonetic cues, the enlarged group difference by noise was hypothesized to decrease.

Overall, two experimental factors, contextual cues (semantic and syntactic cues) and listening conditions (quiet and noise), were selected to investigate their effects on speech recognition for English- and Chinese-native listeners. Three types of sentences, high, low, and zero predictability, were used as stimuli to manipulate the semantic and syntactic information while multi-talker babbles served as background noise. Four-talker babble was selected as the background noise in this study because it usually generates both energetic and informational masking, creating more challenges than babble with fewer talkers (e.g., one- or two-talker babbles) on both English (Simpson and Cooke, 2005; Humes et al., 2017) and Mandarin Chinese speech recognition (Guan and Liu, 2016).

A group of 20 Chinese-native (CN) listeners and a group of 20 American English-native listeners (EN) participated in this study. All participants were between 18 and 30 years old and had normal hearing. All CN listeners had been living in the United States for less than three years and had the Test of English as a Foreign Language (TOEFL) scores of at least 80 (Internet-based test).

The speech material in this study was the same as the sentence materials employed in Boothroyd and Nittrouer's study (1988). Three types of sentences were used: (a) zero predictability (ZP) sentences, consisting of random sequences of words (e.g., “girls white car blink”); (b) low predictability (LP) sentences that were syntactically appropriate but semantically anomalous (e.g., “ducks eat old tape”); and (c) high predictability (HP) sentences that were both syntactically and semantically appropriate (e.g., “most birds can fly”). All sentences consisted of four mono-syllabic words. A total of 20 HP, 20 LP, and 40 ZP sentences, recorded by a young adult female EN speaker, served as speech stimuli in this study.

Listeners were seated in a sound-treated, IAC booth. Speech stimuli were presented to participants' right ear through a SONY 7506 headphone. There were three listening conditions for each set of sentences (ZP, LP, and HP): quiet and four-talker babble with two signal-to-noise ratios (SNR). The four-talker babble was generated by mixing speech sounds that were originally recorded by two female and two male EN speakers, with speech materials from Child Encyclopedia (Lock, 2009), and then calibrating the sounds to the same root-mean-square level. A total of nine experimental conditions (three listening conditions × three types of sentences) were randomized for each listener. Under each of the three listening conditions, six sentences randomly selected from each sentence type (e.g., HP, LP, and ZP) were presented, for a total of 18 sentences (e.g., 6 HP, 6 LP, and 6 ZP sentences). For a given listener, each sentence was presented only once without being played at any other experimental condition. Considering the relatively large individual variability across listeners, particularly non-native listeners, each listener was examined with all nine experimental conditions, although the ZP sentences comprised the words in the LP and HP sentences. Before the start of each experimental condition, the type of sentences for that condition was shown on an LCD screen to listeners.

Speech stimuli were presented at 70 dB sound pressure level (SPL) in quiet and 65 and 70 dB SPL in babble, with the noise level at 70 dB SPL to reach the SNR of −5 and 0 dB, respectively. The sound pressure levels of speech and babble were calibrated at the output of the SONY headphones (SONY 7506) via an AEC201-A IEC 60318–1 ear simulator by a Larson-Davis sound-level meter (model 2800, Larson Davis, Depew, NY) with a linear weighting band. In noise conditions, each sentence was temporally presented in a three-second four-talker babble randomly selected from the 30-s recording. Listeners were required to type what they heard into a text box on an LCD screen within 60 s after the presentation of each sentence. Listeners were instructed to type in their response as soon as possible or guess if they were not sure. Identification scores were evaluated offline. Since each experimental condition included six sentences, resulting in 24 target words, the percentage correctness was calculated as the number of correct target words divided by 24 in percentage for each condition.

A three-factor (sentence predictability × listening condition × listener group) analysis of variance (ANOVA) was conducted, with the percentage score of sentence recognition as the dependent variable. As shown in Fig. 1, the main effects of listener group [F (1, 38) = 95.692, p = 0.000], sentence predictability [F (2,76) = 21.564, p = 0.000], and listening condition [F (2,76) = 662.568, p = 0.000] were significant. There was a significant interaction effect of sentence predictability and listener group [F (2,76) = 3.611, p = 0.032] and the listening condition and listener group [F (2, 76) = 10.373, p = 0.000]. The two-factor interaction of sentence predictability and listening condition and the three-factor interaction were not significant (both p's > 0.05).

Fig. 1.

English-native and Chinese-native listeners' sentence recognition in percentage correct at the three listening conditions (three panels: quiet and SNRs of 0 and −5 dB) for the three types of sentences: HP (left), LP (middle), ZP (right).

Fig. 1.

English-native and Chinese-native listeners' sentence recognition in percentage correct at the three listening conditions (three panels: quiet and SNRs of 0 and −5 dB) for the three types of sentences: HP (left), LP (middle), ZP (right).

Close modal

To further examine the interaction effect between sentence predictability and listener group, pairwise comparisons of the two listener groups were conducted to examine the listener group effect at each sentence type with the alpha value adjusted to 0.017 (0.05/3). EN listeners had significantly better performance than CN listeners for all three types of sentences (HP sentences, LP sentences, and ZP sentences: p = 0.000), indicating that the extent of the difference between EN and CN listeners varied at different sentence predictability levels. On the other hand, the effect of sentence type was examined for each listener group by one-factor ANOVA, with the alpha level adjusted to 0.025 (0.05/2). CN speakers showed better performance on HP sentences than on LP and ZP sentences (both p = 0.000); however, no significant difference was found between LP and ZP sentences (p > 0.05). These results showed that the deprivation of semantic cues had a significant impact on CN listeners. EN listeners showed significantly better performance on HP sentences than on ZP sentences (p = 0.003). No significant performance differences were found between HP and LP sentences or between ZP and LP sentences (both p's > 0.05).

The interaction effect between listening conditions and listener groups was further measured by pairwise comparisons. The results showed that at every listening condition (quiet and SNRs of 0 and −5 dB), EN listeners performed significantly better than CN listeners (p = 0.000). Thus, the significant interaction of listening conditions and listener groups indicated a greater noise effect on CN listeners.

The purpose of this study was to assess how semantic and syntactic cues affected native and non-native English listeners' English sentence recognition, especially whether the group difference in the contextual effect, if any, was influenced by listening conditions. Given that previous studies have conducted in-depth investigations on semantic effects, few studies have investigated the effects of both semantic and syntactic cues. In many studies the semantic-contextual cues were manipulated by using two types of sentences, in which the keywords were highly or hardly predictable on the contexts (Bradlow and Alexander, 2007; Kalikow et al., 1977; Mayo et al., 1997). In this study, EN and CN listeners listened to three types of sentences, high predictability (HP) with both semantic and syntactic information, low predictability (LP) with only syntactic information, and zero predictability (ZP) with neither semantic nor syntactic information (Boothroyd and Nittrouer, 1988).

The results of this study showed that non-native English listeners were primarily affected by semantic cues, consistent with previous studies. This is reflected in the results showing that CN listeners' performance was significantly reduced when semantic information was removed from HP sentences (e.g., LP and ZP sentences) in the present study. This result supported the statement of Field (2004) that non-native listeners were not confident in their ability to process L2 speech using a bottom-up approach. Therefore, when the semantic information as a top-down cue is missing, CN listeners' English perception decreases significantly more than EN listeners'.

On the other hand, syntactic information did not make a difference on CN listeners' speech identification, possibly because of their limited abilities to process speech with semantic cues being minimized. In fact, in quiet condition, the difference between EN and CN listeners was much larger for the recognition of English sentences with low predictability (Guan et al., 2015) than the recognition of English sentences with high predictability (Jin and Liu, 2012); however, the scores of CN listeners' English phonemic perception (Mi et al., 2013) and sentence recognition with low predictability were comparable.

The results of the current study and the studies noted above indicate that CN listeners rely on sematic cues more heavily than syntactic and acoustic-phonetic cues to perceive English speech. On the other hand, EN listeners were also affected by contextual cues, e.g., a significant performance difference between HP and ZP sentences; however, it appeared that EN listeners used both semantic and syntactic cues with relatively equal weight, i.e., no significant difference between LP and HP or ZP sentences. Overall, these results indicate that CN listeners used semantic cues more effectively than syntactic and/or acoustic-phonetic cues in English speech processing while EN listeners weighted these speech cues more equally in speech recognition.

Another purpose of this study was to investigate whether the group difference in the contextual effect became larger from quiet to noisy conditions. The noise effect on the group difference was dependent on speech materials. In this study, the results suggest that noise enlarged the overall difference between EN and CN listeners (e.g., native advantage) similarly across the three types of sentences (see Fig. 2). This finding was partially consistent with a previous study that used the hearing in noise test (HINT) as test material. In that study, the recognition difference between EN and CN listeners was remarkably enlarged by noise (Jin and Liu, 2012) for HINT sentences that consisted of linguistic and contextual cues, like the HP sentences of this study.

Fig. 2.

The boxplots for the native advantage (e.g., the difference between English-native and Chinese-native listeners in sentence recognition) for the three types of sentences (HP, LP, and ZP) at the three listening conditions (quiet and SNRs of 0 and −5 dB).

Fig. 2.

The boxplots for the native advantage (e.g., the difference between English-native and Chinese-native listeners in sentence recognition) for the three types of sentences (HP, LP, and ZP) at the three listening conditions (quiet and SNRs of 0 and −5 dB).

Close modal

The four-talker babble in this study contained both energetic and informational masking, both possibly leading to the larger-than-quiet group difference in sentence recognition in noise. However, based on previous studies in our laboratories, L2 learners, particularly those with medium and high English proficiency, showed more negative effects from the energetic masking of babbles and similar amounts of the informational masking of babbles for vowel identification (Mi et al., 2013; Xu et al., 2018). The role of EM and IM in babbles on sentence recognition for native and non-native listeners needs further investigation, such as the number of talkers in babbles, the speech content of babbles (e.g., the same or different from the target speech stimuli), and the language of babbles.

On the other hand, it was hypothesized that noise would enlarge the group difference to the largest extent at the HP sentence level because it contained enriched semantic cues that could be most affected by noise. However, the results suggest there was no difference in the noise effect on the difference between the EN and CN identification scores for the three types of sentences. This differs from the findings of previous studies, which showed that as the contextual cues were reduced (e.g., from high to low predictability), the enlarged group difference by noise decreased (Jin and Liu, 2012; Guan et al., 2015).

One possibility for this discrepancy might be the different speech materials used in these studies. As sentence complexity increased (e.g., less predictability and/or longer sentences), the EN-CN difference became larger (e.g., above 40%) in quiet, such that when the group difference was already large enough in quiet, the presentation of noise may not enlarge the native advantage. Thus, the difference between native and non-native groups in quiet may be a key factor in determining whether such difference is enlarged in noise, and the group difference in quiet highly depends on speech materials.

First, limited contextual cues were contained in the sentences of this study. Since every sentence consisted of only four words, even the HP sentences may not contain that much contextual, particularly semantic information when compared with HINT sentences. Second, for the LP and ZP sentences in this study, although contextual cues were reduced, lexical cues were still available. Moreover, the four words in each sentence of this study were all monosyllabic, while the IEEE sentences (Guan et al., 2015) comprised five key words in each sentence with more complicated syllabic structures (e.g., multi-syllabic words).

As native speakers outperformed non-native speakers in auditory memory tasks (Olsthoom et al., 2014), the smaller number of syllabi in the LP and ZP sentences of this study than in the IEEE sentences may relatively reduce the demand on auditory memory and therefore result in an easier perceptual task. As a result, the EN-CN difference in quiet for this study was relatively small (less than 20%; see Fig. 2) due to the short length, simple syllabic structures, and availability of lexical cues, even for the LP and ZP sentences. The noise impact on CN and EN listeners was observed similarly across the three types of sentences. One may speculate that as sentences become more complex, e.g., including multi-syllabic words and/or nonsense words, the noise impact on the EN-CN difference will be smaller; such speculation needs further examination.

In summary, for non-native English listeners, the semantic, cues in English speech sentences have a remarkable impact on their word recognition, possibly due to their lack of competence in using bottom-up strategies to comprehend speech. Although the overall group difference between native and non-native listeners was larger from quiet to noise conditions, such group difference was similar across the three sentence types with different contextual cues, suggesting that whether and how much noise background enlarges the gap between native and non-native listeners depend on speech materials.

This study was supported by the China National Natural Science Foundation Grant No. [31628009]. The authors also thank Michelle Schoenecker for her English editing services.

1.
Boothroyd
,
A.
, and
Nittrouer
,
S.
(
1988
). “
Mathematical treatment of context effects in phoneme and word recognition
,”
J. Acoust. Soc. Am.
84
(
1
),
101
114
.
2.
Bradlow
,
A. R.
, and
Alexander
,
J. A.
(
2007
). “
Semantic and phonetic enhancements for speech-in-noise recognition by native and non-native listeners
,”
J. Acoust. Soc. Am.
121
(
4
),
2339
2349
.
3.
Bradlow
,
A. R.
, and
Pisoni
,
D. B.
(
1999
). “
Recognition of spoken words by native and non-native listeners: Talker-, listener-, and item-related factors
,”
J. Acoust. Soc. Am.
106
(
4
),
2074
2085
.
4.
Cutler
,
A.
,
Weber
,
A.
,
Smits
,
R.
, and
Cooper
,
N.
(
2004
). “
Patterns of English phoneme confusions by native and non-native listeners
,”
J. Acoust. Soc. Am.
116
(
6
),
3668
3678
.
5.
Field
,
J.
(
2004
).
“An insight into listeners' problems: Too much bottom-up or too much top-down?,”
System
32
(
3
),
363
377
.
6.
Guan
,
J.
, and
Liu
,
C.
(
2016
).
“Target word identification in noise with formant enhancement for hearing-impaired listeners,”
J. Acoust. Soc. Am.
140
,
3441
.
7.
Guan
,
J.
,
Liu
,
C.
,
Tao
,
S.
,
Li
,
M.
,
Wang
,
W.
, and
Dong
,
Q.
(
2015
). “
Sentence recognition in temporal modulated noise for native and non-native listeners: Effect of language experience
,”
J. Acoust. Soc. Am.
137
,
2383
.
8.
Hume
s,
L.
,
Kidd
,
G.
, and
Fogerty
,
D.
(
2017
). “
Exploring use of the coordinate response measure in multitalker babble paradigm
,”
J. Speech Lang. Hear. Res.
60
,
741
754
.
9.
Jin
,
S.-H.
, and
Liu
,
C.
(
2012
). “
English sentence recognition in speech-shaped noise and multi-talker babble for English-, Chinese-, and Korean-native listeners
,”
J. Acoust. Soc. Am.
132
(
5
),
EL391
EL397
.
10.
Kalikow
,
D. N.
,
Stevens
,
K. N.
, and
Elliott
,
L. L.
(
1977
). “
Development of a test of speech intelligibility in noise using sentence materials with controlled word predictability
,”
J. Acoust. Soc. Am.
61
(
5
),
1337
1351
.
11.
Lecumberri
,
M. L. G.
, and
Cooke
,
M.
(
2006
). “
Effect of masker type on native and non-native consonant perception in noise
,”
J. Acoust. Soc. Am.
119
(
4
),
2445
2454
.
12.
Lock
,
D.
(
2009
).
The New Children's Encyclopedia
(
DK Publishing
,
New York
), pp.
8
78
.
13.
Mayo
,
L. H.
,
Florentine
,
M.
, and
Buus
,
S.
(
1997
). “
Age of second-language acquisition and perception of speech in noise
,”
J. Speech Lang. Hear. Res.
40
(
3
),
686
.
14.
Mi
,
L.
,
Tao
,
S.
,
Wang
,
W.
,
Dong
,
Q.
,
Jin
,
S.-H.
, and
Liu
,
C.
(
2013
). “
English vowel identification in long-term speech-shaped noise and multi-talker babble for English and Chinese listeners
,”
J. Acoust. Soc. Am.
133
(
5
),
EL391
EL397
.
15.
Olsthoom
,
N. M.
,
Andringa
,
S.
, and
Hulstijn
,
J. H.
(
2014
). “
Visual and auditory digit-span performance in native and non-native speakers
,”
Int. J. Biling.
18
(
6
),
663
673
.
16.
Rogers
,
C. L.
,
Lister
,
J. J.
,
Febo
,
D. M.
,
Besing
,
J. M.
, and
Abrams
,
H. B.
(
2006
). “
Effects of bilingualism, noise, and reverberation on speech perception by listeners with normal hearing
,”
Appl. Psycholing.
27
(
3
),
465
485
.
17.
Sanders
,
L. D.
,
Neville
,
H. J.
, and
Woldorff
,
M. G.
(
2002
). “
Speech segmentation by native and non-native speakers: The use of lexical, syntactic, and stress-pattern cues
,”
J. Speech Lang. Hear. Res.
45
(
3
),
519
530
.
18.
Simpson
,
S.
, and
Cooke
,
M.
(
2005
). “
Consonant identification in N-talker babble is a nonmonotonic function of N
,”
J. Acoust. Soc. Am.
118
,
2775
2778
.
19.
Xu
,
C.
,
Yang
,
X.
,
Wang
,
Y.
,
Zhang
,
H.
,
Ding
,
H.
, and
Liu
,
C.
(
2018
). “
Informational masking of six-talker babble on Mandarin Chinese vowel and tone identification: Comparison between native Chinese and Korean listeners
,”
Stud. Psychol. Behav.
16
,
22
30
.