Processing speaker-specific information is an important task in daily communication. This study examined how fundamental frequency (F0) cues were encoded at the subcortical level, as reflected by scalp-recorded frequency-following responses, and their relationship with the listener's ability in processing speech stimuli produced by multiple speakers. By using Mandarin tones with distinctive F0 contours, the results indicated that subcortical frequency-coding errors were significantly correlated with the listener's speaker-variability intolerance for both percent correct and reaction time measures. These findings lay a foundation to help improve the understanding of how speaker information is processed in individuals with normal and impaired auditory systems.

Processing speaker-specific information is an integral part of speech perception. The same word spoken by different speakers can be acoustically very different. For example, in lexical tone languages, a phonologically low tone produced by a female speaker is likely to be acoustically higher in fundamental frequency (F0) than a phonologically high tone produced by a male speaker.1 This is due to the F0 range difference between male and female speakers. Although listeners are usually quite adept at identifying intended tones despite such speaker variability, there is ample evidence that processing multi-speaker tones is less accurate and more time-consuming than processing single-speaker tones,2–4 indicating that adapting to different speakers demands sensory and cognitive resources. The cost of such “speaker normalization”5 has also been reported for non-tone language users with normal hearing6 and listeners with hearing impairment.7 While the effects of speaker variability have been demonstrated through behavioral measures, the neural basis of the listener's ability to adapt to speaker variability remains unclear.

Frequency-following response (FFR) is a scalp-recorded, neurophysiological potential that reflects phase-locked neural activities at the subcortical level that are in sync with the frequency contents of a stimulus.8–11 Unlike most cortical responses that may be highly variable and affected by sleep, FFR is a reliable response12 originating from neural substrates at the subcortical level, primarily in the midbrain area, and thus does not require the listener's attention, alertness, or active participation. Because of these advantages, the FFR has been used to investigate the subcortical neural representation of the various features of speech sounds such as F0 tracking accuracy and timing in normal-hearing adults13 and individuals with hearing impairment.14 Since FFR has been demonstrated to track F0 information in speech effectively, and since F0 is the primary acoustic correlate for speaker identity,15 using FFR to examine the processing of speaker variability provides a unique window to the neural basis of processing linguistic (lexical tone) and indexical (speaker) information in speech perception.

The goal of this study is to examine how English-speaking individuals with normal hearing but without prior knowledge of a tonal language process speaker variability behaviorally and neurophysiologically in Mandarin tone perception. The research question is whether the behavioral and neurophysiological measurements are correlated. To answer this question, a behavioral task and FFR recordings were administered. In the behavioral task, participants were asked to identify Mandarin tones from multi-speaker tone stimuli. On the basis of past research,2–4 it was expected that tone identification would be more accurate and faster when each speaker's stimuli were presented in one test block (i.e., the “blocked-by-speaker” condition), compared to mixing multiple speakers' stimuli all together (i.e., the “mixed-across-speakers” condition). To quantify the extent to which a listener could cope with speaker variability, we further defined the difference in the behavioral task performance between the blocked-by-speaker and mixed-across-speakers presentations as speaker-variability intolerance. In the FFR recording, subcortical frequency-coding errors (i.e., the extent to which the response deviates from the F0 of a stimulus) were measured via scalp-recorded brain waves. Because neural substrates at the subcortical level provide critical information to the auditory cortex and related areas where executive decisions of the tonal stimuli and speaker identification take place, it was hypothesized that subcortical frequency-coding errors would be significantly associated with the listener's speaker-variability intolerance. Using English-speaking participants and Mandarin tone stimuli allows us to examine the processing of speaker variability in linguistically significant tone stimuli, but without the interference of linguistic knowledge. Data from this participant group is essential because it serves as baseline to which performance of other populations (e.g., Mandarin speakers with normal hearing or hearing impairment, or English speakers with hearing impairment) may be compared in the future to clarify the individual and joint contributions of linguistic background and hearing status to processing speaker variability in speech perception.

Research protocols and experimental procedures were approved by the Institutional Review Board at Ohio University. Twenty-one normal-hearing adults (24.9 ± 5.0 year old, 18 females and 3 males) were recruited. All participants were native speakers of English and were primarily college students at Ohio University. Native speakers of English were recruited to avoid ceiling effects in the behavioral and neurophysiological tasks and to rule out possible effects of the listener's linguistic knowledge of Mandarin on the behavioral and neurophysiological responses.

Monosyllabic Mandarin tones, superimposed on the vowel /i/, with four different F0 contours (Tone 1 flat, Tone 2 rising, Tone 3 dipping and Tone 4 falling) were recorded from six male native Mandarin speakers. Each speaker's utterance of the four acoustic stimuli was digitized at a rate 40 000 samples/s by using a 16-bit digital voice recorder. Each acoustic stimulus was processed to a duration of 250 ms with a rising and falling ramp of 10 ms. The mean values of the F0 contours across the 6 speakers were 161 Hz (ranging from 133 to 184 Hz) for Tone 1, 132 Hz (ranging from 120 to 145 Hz) for Tone 2, 103 Hz (ranging from 94 to 110 Hz) for Tone 3, and 141 Hz (ranging from 127 to 159 Hz) for Tone 4.

Behavioral experiments were conducted in a quiet room where each participant was seated in front of a touch-screen monitor with a pair of supra-aural headphones (Logitech H390, Newark, California). Custom-made software written in matlab was utilized to obtain the participant's behavioral responses. The behavioral experiments consisted of four sessions: familiarization, training, blocked-by-speaker, and mixed-across-speakers. Each participant first went through a familiarization session where they could play and listen to the four tones (recorded from speaker 1) for as many times as they needed, by touching on four pictorial representations of the four tones on a computer monitor.

After familiarization, each participant underwent a training session. During the training session, the same four tones recorded from speaker 1 were utilized, but each tone was presented four times in a random order. A one-interval, four-alternative, forced-choice paradigm with visual feedback was utilized to determine the percent correct and reaction time of each participant. That is, after an acoustic stimulus was delivered to the listener's headphone, the participant must decide which tone was presented by touching a corresponding picture on the monitor. If the participant chose the correct tone, a green checkmark would be displayed above the picture of the correct tone. If the participant chose an incorrect tone, a red X mark would be displayed above the picture of the incorrect tone chosen by the participant; in the meantime, a green checkmark would be displayed above the picture of the correct tone that should have been chosen by the participant. When a training session was completed, the participant's performance in terms of percent correct and reaction time was displayed on the monitor. All participants were encouraged to repeat the familiarization and training sessions for as many times as they needed. Each participant was required to obtain ≥50% correct in the training session before moving on to the two experimental sessions: blocked-by-speaker and mixed-across-speakers sessions.

During the blocked-by-speaker session, the four tones recorded from four different speakers (speakers 2, 3, 4 and 5) were utilized and each tone was presented 40 times. Because the four tones recorded from the same speaker were presented in the same block, this resulted in total of 160 stimuli (40 repetitions/tone × 4 tones/speaker) per block. The four blocks were presented in a random order with no breaks in between. The same computer interface and the same one-interval, four-alternative, forced-choice paradigm were employed, but this time with no visual feedback provided and the participant's performance scores were hidden from the monitor.

For the mixed-across-speakers session, all procedures are identical to those used in the blocked-by-speaker session, except that the four tones recorded from the four speakers (speakers 2–5) were all mixed together—resulting in a total of 640 stimuli (40 repetitions/tone × 4 tones/speaker × 4 speakers) to be presented in a random order. The order of the blocked-by-speaker and mixed-across-speakers sessions was randomized across participants. To quantity the effects of talker variability, the term speaker-variability intolerance was defined and calculated by subtracting the listener's percent correct score obtained in the blocked-by-speaker session from that in the mixed-across-speakers session. The same procedure was applied when calculating speaker-variability intolerance for the reaction time measure. All participants completed the behavioral experiment prior to the neurophysiological recording.

FFR recordings were administered in a double-wall sound booth. Each participant was seated in a comfortable recliner with their eyes closed. All participants were encouraged to relax and fall asleep during data collection. Tone 2 recorded from a different speaker (speaker 6) was utilized to elicit FFRs. Because of time constraints, only one tone was utilized to elicit FFR; tone 2 was selected because it tends to elicit the most prominent response compared to the other three tones.16 

The FFR experimental protocol and data analysis procedures were similar to our previous studies.13,17 Stimulus presentation and brain wave recordings were controlled by using custom-designed software written in LabVIEW. The acoustic stimuli were presented through an electro-magnetically shielded insert earphone (Etymotic ER-3A, Elk Grove Village, Illinois) at 75 dB sound pressure level to the participant's right ear. The silent interval between the offset of a stimulus to the onset of the next one was set to 45 ms. Continuous brain waves were recorded through three gold-plated electrodes positioned at high forehead (non-inverting), right mastoid (inverting) and low forehead (ground). Recorded brain waves were amplified (OptiAmp 8002, gain 50 000), digitalized (20 000 samples/s, 16-bit analog-to-digital conversion), and band-pass filtered (90–1500 Hz). A total of 8000 artifact-free recording sweeps were obtained and averaged to derive an FFR for each participant. Artifact rejection criterion was set at 25 μV. Narrow-band sliding window spectrograms (window size = 50 ms, step size = 1 ms) were utilized to delineate the F0 information of the brain waves. This resulted in a total of 201 windowed segments to be analyzed, starting from the beginning of an FFR recording. The same procedure was applied to the stimulus. Frequency error was estimated by computing the mean of the absolute values of the F0 differences between the stimulus and an FFR recording. Figure 1 shows the stimulus spectrogram [Fig. 1(A)], an example spectrogram of FFR [Fig. 1(B)], and how frequency error was calculated [Fig. 1(C)]. Frequency error provides a quantitative index representing the extent to which subcortical frequency coding deviates from the F0 contour of a stimulus.

Fig. 1.

(Color online) Estimates of subcortical frequency-coding errors (frequency error). (A) Amplitude spectrogram of the tone 2 stimulus with a rising F0 contour. (B) A typical FFR spectrogram obtained from a normal-hearing participant (subject S003). (C) F0 contours of the stimulus (black curve) and an FFR recording (red curve). Frequency error is computed by finding the mean of the absolute values of the F0 differences between the stimulus and a recording.

Fig. 1.

(Color online) Estimates of subcortical frequency-coding errors (frequency error). (A) Amplitude spectrogram of the tone 2 stimulus with a rising F0 contour. (B) A typical FFR spectrogram obtained from a normal-hearing participant (subject S003). (C) F0 contours of the stimulus (black curve) and an FFR recording (red curve). Frequency error is computed by finding the mean of the absolute values of the F0 differences between the stimulus and a recording.

Close modal

To quantify speaker-variability intolerance, Wilcoxon signed-rank tests were administered to determine whether the listener's behavioral performance would decrease (i.e., lower percent correct and longer reaction time) when multiple speakers' stimuli were mixed in one block, as opposed to one speaker per block. To further evaluate whether frequency-coding errors at the subcortical level were associated with the listeners' speaker-variability intolerance, Pearson's correlation was administered to evaluate the correlation coefficients between frequency error and speaker-variability intolerance. Linear regressions were conducted to delineate the relationship between the listener's behavioral and neurophysiological responses.

Significant findings were observed. Wilcoxon signed-rank tests demonstrated that the listeners' tone identification accuracy scores in terms of percent correct obtained in the mixed-across-speakers session were significantly smaller (z = 2.856, p = 0.002) than those obtained in the blocked-by-speaker session [Fig. 2(A)]. Similar, but opposite, findings were observed for reaction time. The listeners' reaction times obtained in the mixed-across-speakers session were significantly longer (z = −3.024, p = 0.001) than those obtained in the blocked-by-speaker session [Fig. 2(B)]. The mean values of speaker-variability intolerance in terms of percent correct and reaction time across all participants were −7.427% and 128 ms, respectively.

Fig. 2.

Speaker-variability intolerance in terms of (A) percent correct and (B) reaction time. For percent correct, the listeners' performance scores were significantly smaller in the mixed-across-speakers condition than those obtained in the blocked-by-speaker condition (mean difference = −7.427%, p = 0.002). Reaction times obtained in the mixed-across-speakers condition were significantly longer than those obtained in the blocked-by-speaker condition (mean difference = 128 ms, p = 0.001). Speaker-variability intolerance = mixed-across-speakers – blocked-by-speaker scores. The data are displayed in box plots. The upper and lower boundaries of each box indicate the 25th and 75th percentiles, respectively. The solid line within each box represents the median and the dotted line marks the mean. Whiskers above and below each box indicate the 5th and 95th percentiles, respectively. Black dots above and below the whiskers are data points that fall outside of the 5th and 95th percentile range.

Fig. 2.

Speaker-variability intolerance in terms of (A) percent correct and (B) reaction time. For percent correct, the listeners' performance scores were significantly smaller in the mixed-across-speakers condition than those obtained in the blocked-by-speaker condition (mean difference = −7.427%, p = 0.002). Reaction times obtained in the mixed-across-speakers condition were significantly longer than those obtained in the blocked-by-speaker condition (mean difference = 128 ms, p = 0.001). Speaker-variability intolerance = mixed-across-speakers – blocked-by-speaker scores. The data are displayed in box plots. The upper and lower boundaries of each box indicate the 25th and 75th percentiles, respectively. The solid line within each box represents the median and the dotted line marks the mean. Whiskers above and below each box indicate the 5th and 95th percentiles, respectively. Black dots above and below the whiskers are data points that fall outside of the 5th and 95th percentile range.

Close modal

Pearson's correlation analyses revealed a significant correlation between the listeners' subcortical frequency-coding errors and their behavioral performance in processing multi-speaker information. Frequency errors were negatively correlated (r = −0.434, p = 0.025) with speaker-variability intolerance in terms of percent correct [Fig. 3(A)]. Larger frequency errors (i.e., more frequency-coding errors at the subcortical level) were associated with poorer speaker-variability intolerance (i.e., more negative numbers) in terms of percent correct, and vice versa. Furthermore, frequency errors were positively correlated (r = 0.396, p = 0.039) with speaker-variability intolerance in terms of reaction time [Fig. 3(B)]. In other words, the more frequency errors exhibited at the subcortical level, the longer reaction times there were in the mixed-across-speakers session, and thus a poorer speaker-variability intolerance in terms of reaction time (i.e., more positive numbers).

Fig. 3.

Subcortical frequency-coding errors (FFR frequency error) plotted as a function of the listeners' inability to cope with multiple speakers' tone stimuli (speaker-variability intolerance) in terms of (A) percent correct and (B) reaction time. The oblique line and formula in each panel represent the results of linear regression and Pearson' correlation between FFR frequency error and speaker-variability intolerance.

Fig. 3.

Subcortical frequency-coding errors (FFR frequency error) plotted as a function of the listeners' inability to cope with multiple speakers' tone stimuli (speaker-variability intolerance) in terms of (A) percent correct and (B) reaction time. The oblique line and formula in each panel represent the results of linear regression and Pearson' correlation between FFR frequency error and speaker-variability intolerance.

Close modal

Linear regression delineated the relationship between frequency error and speaker-variability intolerance. The resultant formulae are

FE=2.0780.089SVIPC,
(1)
FE=2.103+0.005SVIRT,
(2)

where FE stands for frequency error, SVIPC = speaker-variability intolerance in terms of percent correct, and SVIRT = speaker-variability intolerance in terms of reaction time.

Significant correlations were observed between subcortical frequency-coding errors (frequency error) and the listener's inability to cope with tonal stimuli produced by multiple speakers (speaker-variability intolerance). These findings provide important information about how individuals with normal hearing use F0 cues to deal with speaker variability, a common challenge in speech perception. These findings also support the idea that subcortical frequency-coding acuity provide basic, yet critical information up to neural structures at the auditory cortex and higher centers where important cognitive decisions are made. These findings have potential implications for how individuals with hearing impairments overcome the challenge of speaker variability in speech perception.

Recent studies have shown that the fidelity and reliability of subcortical neural coding, as reflected by scalp-recorded FFRs, are linked to behavioral speech perception in noise,12,18 consonant differentiation,19 and reading and literacy abilities,20,21 where accurate subcortical neural encoding of acoustic signals predicts better performance. Results of the present study provide further evidence that links subcortical frequency-coding errors to behavioral speaker-variability intolerance and thus fill a gap in our knowledge. In other words, degraded F0 coding fidelity at the subcortical level may propagate to the higher nuclei of the auditory system and eventually leads to the listener's poor speaker identification ability.

Successful identification of the F0 information across multiple speakers exerts substantial demands on the sensory and cognitive resources of the listener. This is particularly true when the acoustic stimuli are foreign to the listener. As shown in this study, native speakers of English without prior knowledge of Mandarin exhibit a wide range of subcortical frequency-coding errors, which in turn are linked to their inability in processing the Mandarin tonal information across multiple speakers. The wide range of performance, while fully expected due to a lack of knowledge of Mandarin, demonstrates substantial individual differences in subcortical speech processing. Importantly, the variability also allows for a valid examination of the relationship between subcortical frequency coding and speaker-variability intolerance. It should be noted that, due to the low-pass nature of FFRs, higher F0s are likely to have weaker representations in the recoded FFRs than lower F0s. To maximize FFRs, this study utilized tonal stimuli produced by six male speakers who tend to have lower F0s than females.

In conclusion, this study demonstrates that subcortical frequency-coding errors, as reflected by scalp-recorded FFRs, are linked to the listener's speaker-variability intolerance, suggesting that FFR may serve as a biological marker for the listener's ability to process speech stimuli across multiple speakers. These findings may assist researchers and clinicians to better understand the underpinning neural mechanisms that are responsible for individuals who have hearing, speech, or cognitive disorders.

The authors would like to thank the participants who participated in this study and the six speakers who contributed their voice.

1.
C.-Y.
Lee
,
A.
Lekich
, and
Y.
Zhang
, “
Perception of pitch height in lexical and musical tones by English-speaking musicians and nonmusicians
,”
J. Acoust. Soc. Am.
135
,
1607
1615
(
2014
).
2.
C.-Y.
Lee
,
L.
Tao
, and
Z. S.
Bond
, “
Speaker variability and context in the identification of fragmented Mandarin tones by native and non-native listeners
,”
J. Phon.
37
,
1
15
(
2009
).
3.
C.-Y.
Lee
,
L.
Tao
, and
Z. S.
Bond
, “
Identification of multi-speaker Mandarin tones in noise by native and non-native listeners
,”
Speech Commun.
52
,
900
910
(
2010
).
4.
C.-Y.
Lee
,
Y.
Zhang
,
X.
Li
,
L.
Tao
, and
Z. S.
Bond
, “
Effects of speaker variability and noise on Mandarin fricative identification by native and non-native listeners
,”
J. Acoust. Soc. Am.
132
,
1130
1140
(
2012
).
5.
K. L.
Johnson
,
T. G.
Nicol
, and
N.
Kraus
, “
Brain stem response to speech: A biological marker of auditory processing
,”
Ear Hear.
26
,
424
434
(
2005
).
6.
J. W.
Mullennix
,
D. B.
Pisoni
, and
C. S.
Martin
, “
Some effects of talker variability on spoken word recognition
,”
J. Acoust. Soc. Am.
85
,
365
378
(
1989
).
7.
K. I.
Kirk
,
D. B.
Pisoni
, and
R. C.
Miyamoto
, “
Effects of stimulus variability on speech perception in listeners with hearing impairment
,”
J. Speech Lang. Hear. Res.
40
,
1395
1405
(
1997
).
8.
N.
Russo
,
T.
Nicol
,
G.
Musacchia
, and
N.
Kraus
, “
Brainstem responses to speech syllables
,”
Clin. Neurophysiol.
115
,
2021
2030
(
2004
).
9.
E.
Skoe
and
N.
Kraus
, “
Auditory brain stem response to complex sounds: A tutorial
,”
Ear Hear.
31
,
302
324
(
2010
).
10.
B.
Chandrasekaran
and
N.
Kraus
, “
The scalp-recorded brainstem response to speech: Neural origins and plasticity
,”
Psychophysiology
47
,
236
246
(
2010
).
11.
G.
Musacchia
,
M.
Sams
,
E.
Skoe
, and
N.
Kraus
, “
Musicians have enhanced subcortical auditory and audiovisual processing of speech and music
,”
Proc. Natl. Acad. Sci. U.S.A.
104
,
15894
15898
(
2007
).
12.
J. H.
Song
,
T.
Nicol
, and
N.
Kraus
, “
Test–retest reliability of the speech-evoked auditory brainstem response
,”
Clin. Neurophysiol.
122
,
346
355
(
2011
).
13.
F.-C.
Jeng
,
C.-D.
Lin
,
J. T.
Sabol
,
G. R.
Hollister
,
M.-S.
Chou
,
C.-H.
Chen
,
J. E.
Kenny
, and
Y.-A.
Tsou
, “
Pitch perception and frequency-following responses elicited by lexical-tone chimeras
,”
Int. J. Audiol.
55
,
53
63
(
2016
).
14.
S.
Anderson
,
A.
Parbery-Clark
,
T.
White-Schwoch
,
S.
Drehobl
, and
N.
Kraus
, “
Effects of hearing loss on the subcortical representation of speech cues
,”
J. Acoust. Soc. Am.
133
,
3030
3038
(
2013
).
15.
J.
Kreiman
and
D.
Sidtis
,
Foundations of Voice Studies: An Interdisciplinary Approach to Voice Production and Perception
, 1st ed. (
Wiley-Blackwell
,
Malden, MA
,
2013
).
16.
F.-C.
Jeng
,
J.
Hu
,
B.
Dickman
,
C.-Y.
Lin
,
C.-D.
Lin
,
C.-Y.
Wang
,
H.-K.
Chung
, and
X.
Li
, “
Evaluation of two algorithms for detecting human frequency-following responses to voice pitch
,”
Int. J. Audiol.
50
,
14
26
(
2011
).
17.
F.-C.
Jeng
,
H.-K.
Chung
,
C.-D.
Lin
,
B.
Dickman
, and
J.
Hu
, “
Exponential modeling of human frequency-following responses to voice pitch
,”
Int. J. Audiol.
50
,
582
593
(
2011
).
18.
S.
Anderson
,
E.
Skoe
,
B.
Chandrasekaran
, and
N.
Kraus
, “
Neural timing is linked to speech perception in noise
,”
J. Neurosci.
30
,
4922
4926
(
2010
).
19.
J.
Hornickel
,
E.
Skoe
,
S.
Zecker
, and
N.
Kraus
, “
Subcortical differentiation of stop consonants relates to reading and speech-in-noise perception
,”
Proc. Natl. Acad. Sci. U.S.A.
106
,
13022
13027
(
2009
).
20.
J.
Hornickel
and
N.
Kraus
, “
Unstable representation of sound: A biological marker of dyslexia
,”
J. Neurosci.
33
,
3500
3504
(
2013
).
21.
B.
Chandrasekaran
,
J.
Hornickel
,
E.
Skoe
,
T.
Nicol
, and
N.
Kraus
, “
Context-dependent encoding in the human auditory brainstem relates to hearing speech in noise: Implications for developmental dyslexia
,”
Neuron
64
,
311
319
(
2009
).