Flattening the fundamental frequency (F0) contours of Mandarin Chinese sentences reduces their intelligibility in noise but not in quiet. It is unclear, however, how the absence of primary acoustic cue for lexical tones might be compensated with the top-down information of sentence context. In this study, speech intelligibility was evaluated when participants listened to sentences and word lists with or without F0 variations in quiet and noise. The results showed that sentence context partially explained the unchanged intelligibility of monotonous Chinese sentences in quiet and further indicate that F0 variations and sentence context act in concert during speech comprehension.

Speech perception includes both top-down and bottom-up processes, and phonemic as well as lexical identification is heavily influenced by context in degraded conditions (Hickok and Poeppel, 2007). Although F0 contour, the primary prosodic feature, does not determine segmental consonant or vowel identity by itself, it has a variety of linguistic functions, including marking emphasis and phrase boundaries. Previous studies examining the effects of F0 contours on the intelligibility of English sentences produced by healthy (Wingfield et al., 1984; Laures and Weismer, 1999) or hearing-impaired speakers (Maassen and Povel, 1984) have found that the absence of F0 variation decreases the intelligibility of speech compared with normal F0 variation.

In tonal languages like Mandarin Chinese, lexical tone contrasts are phonologically as important as phonemic contrasts. That is, the F0 contours for Chinese lexical tones distinguish lexical meanings from otherwise identical strings of phonemes. However, recent studies with Mandarin Chinese materials surprisingly showed that monotonous sentences with flattened F0 contours were just as intelligible as normal sentences with natural F0 patterns in a quiet listening condition (Patel et al., 2010; Xu et al., 2013). Because the speech materials in these studies used only normal sentences, it is impossible to determine what cues (e.g., the remaining secondary acoustic cues for lexical tones or higher-level semantic information based on the sentential context) are utilized to comprehend the speech when the normal pitch patterns are flattened.

The first motivation for the present study was to examine the role of sentence context in Mandarin Chinese intelligibility in a quiet listening condition. We manipulated the top-down information variable and obtained intelligibility scores using both normal and word list sentences with natural or flat F0 contours. If high-level semantic information plays an important role, the intelligibility of pitch-flattened word list “sentences” would be substantially lower than that of the normal sentence counterparts because the word list sentences are designed to be semantically incoherent to remove the top-down contextual information. On the contrary, if low-level acoustic information is the major determinant for speech perception in quiet, the intelligibility of word list sentences with flattened F0 contour would be comparable to that of the normal sentences.

For speech-in-noise listening conditions, previous studies have consistently demonstrated the importance of dynamic F0 contours to speech intelligibility regardless of whether the target language is a tonal language or not (Laures and Bunton, 2003; Binns and Culling, 2007; Patel et al., 2010; Miller et al., 2010). The results show that naturally varying F0 contours improve speech intelligibility in background noise compared with flat or inverted F0 contours. The explanations proposed for these findings maintain that dynamic changes in F0 direct the listener's attention to the content words of the utterance and assist with the segmentation of words in continuous speech. Unchanging F0 lowers intelligibility of the utterance because it reduces the contrast between words and makes it more difficult to parse continuous speech into meaningful units (Laures and Bunton, 2003; Binns and Culling, 2007). As with the previous studies that examined the intelligibility of speech in quiet, the speech-in-noise investigations did not look into the contribution that semantic information provided by sentence context might make to speech intelligibility.

The objective of the present study was to investigate the role of sentence context and the interaction between sentence context and F0 contours during Mandarin Chinese speech comprehension in both quiet and noise. We hypothesized that sentence context would contribute to the intelligibility of Chinese speech in all conditions, which partially accounts for the unimpaired intelligibility of pitch flattened sentences in quiet. Furthermore, different signal-to-noise ratios (SNRs) associated with the listening conditions (in quiet vs in noise) might modulate the interaction between sentence context and F0 contours.

One hundred and fifty-six undergraduate participants from Beijing Normal University were recruited. Five participants were omitted from the final analysis—three reported a hearing disorder in subject screening, and the other two were omitted due to computer error during data collection. The remaining 151 participants were all native Chinese speakers between the ages of 18 and 23, and all had hearing sensitivity ≤20 dB hearing level for octave frequencies between 250 and 8000 Hz bilaterally. In order to avoid the stimulus order effects, we adopted a mixed-design with F0 contours (normal vs flat) and background noise (quiet, SNR = +5 dB and −5 dB) as between-subject factors and sentence context as a within-subject factor. The subject distribution in the between-subject conditions was as follows: Quiet/normal F0 condition, number of subjects (n) = 26; quiet/flat F0 condition, n = 25; SNR = +5 dB/normal F0 condition, n = 25; SNR = +5 dB/flat F0 condition, n = 24; SNR = −5 dB/normal F0 condition, n = 25; SNR = −5 dB/flat F0 condition, n = 26.

To manipulate the F0 and sentence context effects, four types of target sentences were created: Normal sentences and word list sentences with naturally intonated or unnaturally monotonous contours. The normal sentences were 20 declarative Chinese sentences with a variety of topics, and each sentence was comprised of 5 to 9 words. Words from the entire pool of the normal sentences were pseudo-randomly selected to form the word list sentences, which were syntactically anomalous and semantically meaningless at the whole sentence level. They were matched in length (number of syllables) with the normal sentences. The normal sentences and word list sentences were read by a male native speaker of Chinese. Manipulation of F0 was done using Praat (Institute of Phonetic Sciences, University of Amsterdam; downloadable at www.praat.org). A flat F0 contour was created for each sentence at the sentence's mean F0 and the resulting monotonous sentence was resynthesized using the PSOLA method (Fig. 1).

FIG. 1.

Acoustic features of sample speech stimuli. Broadband spectrograms (SPG: 0 to 5 kHz), intensity envelopes (INT: 50 to 100 dB), and fundamental frequency contours (F0: 0 to 500 Hz) are displayed for (A) normal sentence and its pitch-flattened counterpart; (B) word list sentence and its pitch-flattened counterpart.

FIG. 1.

Acoustic features of sample speech stimuli. Broadband spectrograms (SPG: 0 to 5 kHz), intensity envelopes (INT: 50 to 100 dB), and fundamental frequency contours (F0: 0 to 500 Hz) are displayed for (A) normal sentence and its pitch-flattened counterpart; (B) word list sentence and its pitch-flattened counterpart.

Close modal

Consonant-misplaced sentences were used as masker stimuli. These sentences were constructed by replacing the onset consonant of each syllable in the normal sentences with another consonant, provided that the replacement did not violate the phonotactic rules of Chinese. These consonant-misplaced sentences were syntactically anomalous, unintelligible at both lexical and sentential levels. The masker sentences were read by a female native speaker of Chinese. The choice of a male target speaker and a female masking speaker was to enable the clear instruction “listen to the male speaker” to be used throughout, rather than to train the subjects on the identity of the target speaker (Scott et al., 2004).

Each target sentence was combined with masker noise at 2 SNR levels (+5 and −5 dB). The level of the target sentences was fixed at 70 dB sound pressure level and the level of the competing speech masker varied around the level of the target speech. The masker speech was edited to be, on average, 1 s longer than the target speech (500 ms prior to the beginning and 500 ms at the end of the target sentence) so that no part of the speech target was unmasked.

Listeners were tested individually in a sound-attenuated booth facing a computer monitor. Stimuli were presented via loudspeakers (Edifier R18, Edifier Technology Co. Ltd., Beijing, China). Because sentence context was the only within-subject factor, each listener was presented with a total of 40 trials—20 normal sentences and 20 word list sentences with natural or flat F0 contours in one background condition (quiet or one SNR level). Listeners were instructed that they would be listening to sentences in quiet or noise, and were asked to write down the words read by the male speaker. The task was self-paced; listeners pressed a key to advance from trial to trial. Each sentence could be heard only once. The first author scored the written responses. Incorrect or omitted words were annotated, and performance scores were checked by an independent auditor blind to the experiment. Practice sentences were provided before the experiment, sampling all conditions. After the practice block, the experimenter checked the readability of the participant's handwriting.

Intelligibility was determined by a keyword-correct count (Scott et al., 2004). The number of correct keywords (content words, varied across sentences from 4 to 7) identified by each listener was counted and then converted to the percentage of the total number of words and averaged across listeners. A 2 × 2 × 3 repeated measures analysis of variance (ANOVA), with sentence context as the within-subject factor and F0 contours and background noise as the between-subject factors, was carried out. Results showed that the three main effects were all highly significant [sentence context: F(1, 149) = 414.06, p < 0.001, η2 = 0.741; F0 contours: F(1, 149) = 203.48, p < 0.001, η2 = 0.584; background noise: F(2, 148) = 377.53, p < 0.001, η2 = 0.839], revealing that intelligibility was reduced by the lack of sentence context and natural contours, and in the presence of background noise (Fig. 2).

FIG. 2.

Word-report scores sorted by the main effects of factors. Error bars represent standard deviation across subjects.

FIG. 2.

Word-report scores sorted by the main effects of factors. Error bars represent standard deviation across subjects.

Close modal

Further analyses comparing all possible 2- and 3-way ANOVAs showed that the interactions were all significant [p values < 0.001, η2 ≥ 0.130 for all the 2-way interactions and p = 0.002, η2 = 0.08 for the 3-way interaction]. The significant interactions revealed that intelligibility disproportionately decreased with increasing noise for flat-contour sentences versus natural-contour sentences as well as for word list sentences versus normal sentences. Furthermore, intelligibility degraded to a greater extent for word list sentences than normal sentences when the F0 patterns of the two types of sentences changed from natural contours to flat contours (Fig. 3). Bonferroni adjusted post hoc pairwise comparisons revealed that intelligibility was significantly different for most contrasting pairs of interest. The four exceptional pairs included normal sentences with natural contours versus normal sentences with flat contours in quiet, normal sentences with natural contours in quiet versus their counterparts in 5 dB SNR background noise, word list sentences with nature contours versus normal sentences with flat contours both in quiet and 5 dB SNR background noise.

FIG. 3.

Word-report scores sorted by the interaction effects.

FIG. 3.

Word-report scores sorted by the interaction effects.

Close modal

The present study aimed to investigate the roles of sentence context, F0 contour, background noise, and the interactions among the factors in the intelligibility of Mandarin Chinese speech. The significant main effects indicate that all three factors contribute to the intelligibility of Chinese speech. That is, lack of sentence context and natural F0 variations, and the presence of background noise all lead to a decrease in intelligibility, respectively. The significant interactions indicate that the contributions that sentence context and F0 contours make to the intelligibility of Chinese speech are modulated by the background noise conditions. Specifically, when presented in quiet, normal sentences with flat F0 contours are as intelligible as their counterparts with natural F0 contours. However, when presented in noise, flattening the F0 contours of normal sentences dramatically reduced the intelligibility compared with the mild decrease in intelligibility for normal sentences with natural F0 contours. These results are consistent with previous findings (Patel et al., 2010; Xu et al., 2013), highlighting the importance of natural F0 contours for sentence intelligibility in noise. At the same time, word list sentences with natural F0 contours were less intelligible than their normal sentence counterparts whether presented in quiet or noise, highlighting the contribution of sentence context to speech intelligibility irrespective of the background conditions.

One issue of particular interest to the present study is to what extent the unimpaired intelligibility of sentences with flat F0 contours in quiet is attributed to the remaining phonetic cues or semantic information from the rest of the sentence. Although F0 contour is the primary acoustic cue for lexical tone perception, other cues (e.g., duration and amplitude) are also utilized by native Chinese speakers to recognize the types of tones (Wise and Chong, 1957; Liu and Samuel, 2004). It is also well-documented that semantic information facilitates spoken word recognition and speech comprehension (McClelland and Elman, 1986; Liu and Samuel, 2007). Because only two types of speech materials, i.e., normal sentences with and without natural F0 contours, were used in the previous studies (Patel et al., 2010; Xu et al., 2013), it was difficult to separate the effects of high-level semantic and low-level acoustic information on intelligibility. In the present study, the result that absence of sentence context reduced the intelligibility indicates that sentence context partially accounts for the unimpaired intelligibility of monotonous Chinese sentences in quiet. Furthermore, when both sentence context and natural F0 variations were deprived, the speech material was least intelligible whether in quiet or noise background conditions. This finding indicates that F0 variations and sentence context act in concert during Chinese speech comprehension. When word list sentences with natural F0 contours were directly compared with normal sentences with flat F0 contours, their intelligibility was similar in the quiet and 5 dB SNR noise conditions, whereas the intelligibility of word list sentences with natural F0 contours was slightly higher than the normal sentences with flat F0 contours in the −5 dB SNR noise condition. This result indicates that the relative importance of sentence context and F0 contours for speech intelligibility depends on background conditions.

The test materials, especially the word list sentences and sentences with flat F0 contours were specially designed for this experiment. The lack of ecological validity of the materials may limit the explanation of our results. However, in a recent study, Feng et al. (2012) found that although lexical tone recognition for sine-wave Chinese monosyllabic words is poor, the recognition accuracy for sine-wave sentences is very high, reflecting the compensation effect of contextual information when the tonal information is poor. Our results, together with the findings of Feng et al. (2012), indicate that the functional load of lexical tones on sentence comprehension is limited. That is, lexical meaning access is possible in a sentence context when surface pitch patterns of tones are altered, although lexical tones are as important as segmental phonemes in specifying the meaning of a word in a tone language.

In a recent fMRI study, Xu et al. (2013) found that Mandarin Chinese sentences with natural or flat F0 contours elicited similar activation in the lexical-semantic processing areas (e.g., the left insular, middle, and inferior temporal gyri) and at the same time, monotonous sentences elicited greater activation in the left planum temporale than sentences with natural F0 contours. These results demonstrate that lexical meaning can still be accessed in pitch-flattened Chinese sentences, and that this process is realized by automatic recovery of the phonological representations of lexical tones from the altered tonal patterns. Our results are consistent with the findings of Xu et al. (2013), supporting the models which include an important role for top-down information in guiding speech perception (e.g., Hickok and Poeppel, 2007), given that listeners can automatically use additional neural and cognitive resources to recover distorted tonal patterns in sentences. As there are significant interactions among the three factors (i.e., sentence context, F0 variation, and background noise) in the intelligibility of Chinese speech, further studies are needed to better understand the brain mechanisms involved in these cognitive processes.

The research was supported by grants from the Humanities and Social Sciences Foundation (Projects for Young Scholars) of the Chinese Ministry of Education (Grant No. 10YJCZH223) to L.J.Z., and from the Natural Science Foundation of China (Grant No. 31271082), the Natural Science Foundation of Beijing (Grant No. 7132119), and the Fundamental Research Fund for the Central Universities to H.S.

1.
Binns
,
C.
, and
Culling
,
J. F.
(
2007
). “
The role of fundamental frequency contours in the perception of speech against interfering speech
,”
J. Acoust. Soc. Am.
122
(
3
),
1765
1776
.
2.
Feng
,
Y. M.
,
Xu
,
L.
,
Zhou
,
N.
,
Yang
,
G.
, and
Yin
,
S. K.
(
2012
). “
Sine-wave speech recognition in a tonal language
,”
J. Acoust. Soc. Am.
131
(
2
),
EL133
EL138
.
3.
Hickok
,
G.
, and
Poeppel
,
D.
(
2007
). “
The cortical organization of speech processing
,”
Nat. Rev. Neurosci.
8
(
5
),
393
402
.
4.
Laures
,
J. S.
, and
Bunton
,
K.
(
2003
). “
Perceptual effects of a flattened fundamental frequency at the sentence level under different listening conditions
,”
J. Commun. Disord.
36
(
6
),
449
464
.
5.
Laures
,
J. S.
, and
Weismer
,
G.
(
1999
). “
The effects of a flattened fundamental frequency on intelligibility at the sentence level
,”
J. Speech Lang. Hear. Res.
42
(
5
),
1148
1156
.
6.
Liu
,
S.
, and
Samuel
,
A. G.
(
2004
). “
Perception of Mandarin lexical tones when F0 information is neutralized
,”
Lang Speech
47
(
2
),
109
138
.
7.
Liu
,
S.
, and
Samuel
,
A. G.
(
2007
). “
The role of Mandarin lexical tones in lexical access under different contextual conditions
,”
Lang. Cognit. Processes
22
(
4
),
566
594
.
8.
Maassen
,
B.
, and
Povel
,
D.
(
1984
). “
The effect of correcting fundamental frequency on the intelligibility of deaf speech and its interaction with temporal aspects
,”
J. Acoust. Soc. Am.
76
,
1673
1681
.
9.
McClelland
,
J. L.
, and
Elman
,
J.
(
1986
). “
The TRACE model of speech perception
,”
Cogn. Psychol.
18
(
1
),
1
86
.
10.
Miller
,
S. E.
,
Schlauch
,
R. S.
, and
Watson
,
P. J.
(
2010
). “
The effects of fundamental frequency contour manipulations on speech intelligibility in background noise
,”
J. Acoust. Soc. Am.
128
(
1
),
435
443
.
11.
Patel
,
A. D.
,
Xu
,
Y.
, and
Wang
,
B.
(
2010
). “
The role of F0 variation in the intelligibility of Mandarin sentences
,” in
Proceedings of Speech Prosody 2010
(Chicago, IL).
12.
Scott
,
S. K.
,
Rosen
,
S.
,
Wickham
,
L.
, and
Wise
,
R. J.
(
2004
). “
A positron emission tomography study of the neural basis of informational and energetic masking effects in speech perception
,”
J. Acoust. Soc. Am.
115
(
2
),
813
821
.
13.
Wingfield
,
A.
,
Lombardi
,
L.
, and
Sokol
,
S.
(
1984
). “
Prosodic features and the intelligibility of accelerated speech: Syntactic versus periodic segmentation
,”
J. Speech Hear. Res.
27
,
128
134
.
14.
Wise
,
C. M.
, and
Chong
,
L. P.-H.
(
1957
). “
Intelligibility of whispering in a tone language
,”
J. Speech Hear. Disord.
22
(
3
),
335
338
.
15.
Xu
,
G.
,
Zhang
,
L.
,
Shu
,
H.
,
Wang
,
X.
, and
Li
,
P.
(
2013
). “
Access to lexical meaning in pitch-flattened Chinese sentences: An fMRI study
,”
Neuropsychologia
51
(
3
),
550
556
.