Speech perception requires multiple acoustic cues. Cue weighting may differ across individuals but be systematic within individuals. The current study compared individuals' cue weights within and across contrasts. Forty-two listeners performed a two-alternative forced choice task for four out of five sets of minimal pairs, each varying orthogonally in two dimensions. Individuals' cue weights within a contrast were positively correlated for bet-bat, Luce-lose, and sock-shock, but not for bog-dog and dear-tear. Importantly, individuals' cue weights were also positively correlated across contrasts. This indicates that some individuals are better able to extract and use phonetic information across different dimensions.

While speech perception is often seemingly effortless, there are documented differences in how individuals perform speech perception tasks (e.g., Hazan and Rosen, 1991). These differences have been shown to be stable within a given task over time (Idemaru et al., 2012; Strand et al., 2014; Yu and Lee, 2014; Kong and Edwards, 2016) suggesting that they are not fleeting properties of the participant's mood or mental state at the time of testing. Furthermore, individual differences on speech perception tasks such as speech discrimination do not seem to be related to basic auditory abilities such as pitch discrimination, though they may be related to the ability to recognize familiar sounds (e.g., Surprenant and Watson, 2001; Humes et al., 2013). Thus individual differences in speech perception are likely consistent properties of individuals and not reducible to hearing abilities.

One important aspect of speech perception is combining information from different acoustic dimensions. Each phonological contrast is signaled by many phonetic cues (Lisker, 1986) and cue weights measure how much listeners attend to each of these dimensions when identifying speech categories (Holt and Lotto, 2006; Francis, et al., 2000). Individual differences in cue weighting have been tied to second language learning (e.g., Chandrasekaran et al., 2010) and cochlear implant use (e.g., Moberly et al., 2014), indicating these differences have consequences for real-world challenges. In particular, studies of second language learners have consistently found that for a given contrast, some individuals tend to use one cue and others another cue [e.g., f0 changes vs f0 height for tone (Chandrasekaran et al., 2010); voice onset time (VOT) vs onset f0 for voicing (Schertz et al., 2015); and formant frequency vs duration for vowels (Kim et al., 2018)].

These results raise the question of whether some individuals may be more reliant on certain dimensions (e.g., f0 or duration) or, alternatively, some individuals may simply be better at picking up phonetic information (e.g., Hazan and Rosen, 1991). Shultz et al. (2012) used regression analysis to find individuals' coefficients for VOT and f0 to a voicing contrast (/ba/-/pa/) and found coefficients for the two cues were somewhat positively correlated across individuals but the effect was not statistically reliable. Hazan and Rosen (1991) also observed that individuals who were influenced by burst cues were also influenced by formant transitions within and across two stop-place contrasts. A positive relationship between cue use could indicate that individuals differ not in how much they rely on particular dimensions but rather on how well they can use any dimension. However, this should also be tested across contrasts that rely on different cues. The present study seeks to test a much wider range of contrasts and cues to examine correlations between cue weights within contrasts as well as across contrasts.

Five minimal pairs were chosen, each varying in a different phonological contrast and each signaled by a different combination of phonetic cues (Table 1). Minimal pairs were chosen to cover a range of cues. Furthermore, bet-bat and Luce-lose as well as bog-dog and sock-shock use similar cues but vary in which cues are expected to be primary and secondary. Recordings of each pair were made by different native speakers of English (3 male, 2 female). Recordings were manipulated to create stimuli varying in two dimensions. In the first step, tandem-straight (Kawahara et al., 2008) was used with both frequency and temporal anchors to create 15 step continua from one natural endpoint to the other varying in both spectral and temporal characteristics. For bog-dog and sock-shock the initial burst or fricative portions were separated from the rest of the word then recombined through cross-splicing to create 225 combinations. The PSOLA method in praat (Boersma and Weenink, 2016) was used for bet-bat and Luce-lose to lengthen or shorten each of the 15 steps creating 15 duration steps and for dear-tear the pitch of each token was manipulated creating 15 f0 contours. All stimuli were normalized to a mean amplitude of 75 dB. After pilot testing, 5 steps in each dimension were chosen to include both clear and ambiguous tokens, leaving 25 stimuli for each word pair. Additional details of stimulus construction can be found in the supplementary materials. All stimuli and additional information are available in the Open Science Framework (Clayards, 2018).1

Stimuli were presented using matlab 8.20 (MathWorks, Inc.) on an iMac Desktop through USB headphones (Logitech H390) at a comfortable listening level. To keep the length of the experiment less than one hour only four of the minimal pairs were presented to each participant and two versions of the experiment were created. In both versions of the experiment, participants heard bet-bat, Luce-lose, and bog-dog. In version 1, participants also heard sock-shock; in version 2 they heard dear-tear. Stimuli from four of the continua were intermixed in random order, each repeated 5 times for a total of 500 trials. Participants indicated which of the two words they heard by pressing a key on the keyboard.

Twenty-four participants (16 female) were tested in the first version of the experiment and 19 (12 female) in the second version. All participants were tested in Montreal, Canada and were native English speakers with limited exposure to a second language (mean age: 22, range: 18–41) with no history of speech or hearing problems. One participant from version 2 was excluded because they answered at chance for almost all stimuli.

Group responses to the two dimensions as well as individual cue weights are plotted in Fig. 1. The relative weighting of the two cues was estimated using separate mixed effects logistic regression models for each minimal pair using the lmer package (Bates et al., 2015) in r (R Core Team, 2016), combining data from the two versions of the experiment. A fixed effect and a random slope by participant were fit for each dimension (centered, continuous variables) as well as random intercepts and correlations between random slopes for each participant. Cue A had a larger estimate of fixed effect for each contrast than cue B, confirming the status of cue A as more important for perception (Table 2).

Coefficients for each participant, cue and contrast were calculated using the random slopes by participant from the mixed effects models (Kong and Edwards, 2015) and compared using Pearson's product moment correlations (Shultz et al., 2012) to assess both within and between contrast correlations in cue weights. Correlations of random effects (CRE) that were fit for each model were also examined to assess within contrast correlations. Model comparison using a Chi square test was used to test whether they improved model fit. Both methods for assessing within category correlations are summarized in Table 2.

Cue weights were strongly positively correlated within contrast for bet-bat; sock-shock, and Luce-lose and negatively correlated for bog-dog (Fig. 1). Further inspection of bog-dog responses revealed that many of the participants tended to respond with “dog” for all stimuli except the first step of the burst dimension and other participants tended to respond “bog” for all stimuli except the last step of the vowel dimension, indicating that there may have been problems with these stimuli such that most steps were not ambiguous for at least some participants. Further evidence that this result may be tied to these stimuli comes from Hazan and Rosen (1991) who found a trend towards positive correlations for bait-date stimuli that also varied in burst and formant transitions. There was no significant correlation for dear-tear consistent with Shultz et al. (2012).

Cue A weights across contrasts were also positively correlated for all combinations except those involving bog-dog [bet-bat vs Luce-lose r(42) = 0.40, p < 0.001; bet-bat vs sock-shock r(42) = 0.62, p = 0.001; sock-shock vs Luce-lose r(24) = 0.56, p = 0.005; dear-tear vs Luce-lose, r(18) = 0.58, p = 0.01] and the correlation between bet-bat and dear-tear was not significant [r(18) = 0.40, p = 0.09]. Cue B weights were strongly positively correlated for bet-bat vs Luce-lose [r(42) = 0.61, p < 0.001] and sock-shock vs Luce-lose [r(24) = 0.54, p = 0.006].

Finally, cue A weights as a proportion of the total of both cue weights was calculated (ratio) to provide an estimate of how much one cue dominates perception. If some listeners are especially sensitive to duration for example, we might expect that their perception would be slightly more dominated by duration in contrasts that involve duration dimensions such as bet-bat and Luce-lose. Because duration is the primary cue for Luce-lose, more duration would mean a larger ratio and because duration is a secondary cue for bet-bat more duration would mean a smaller ratio. We therefore expect the ratio scores to be negatively correlated if some listeners are more reliant on duration than others. Similarly for bog-dog and sock-shock if some listeners are more reliant on frication/burst information vs formant transitions. No significant correlations were found between ratio scores across contrasts. Figure 2 compares correlations of regression coefficients and ratios for bet-bat and Luce-lose.

We found a strong positive correlation between primary and secondary cue use for three contrasts bet-bat, Luce-lose, and sock-shock that varied in the type of contrast (vowel, fricative voicing, fricative place) and the cues involved (vowel duration, formant frequencies, noise spectrum, etc.). Previous results had failed to find a reliable relationship between VOT and onset f0 (Shultz et al., 2012) and our results were also not reliably correlated. It should be noted that fewer participants categorized this continuum, making it harder to draw strong conclusions from this minimal pair. In contrast, listeners' use of the two cues manipulated in bog-dog were negatively correlated, though this may have been due to issues with stimulus selection in which the continua did not contain enough ambiguous steps for some listeners. It is thus difficult to draw conclusions from the minimal pairs that did not exhibit strong positive correlations, but the positive relationship between cue use in the other cases may be informative.

The second main finding was that cue weights were also positively correlated for all cue A weights across contrasts (excluding bog-dog) and for some cue B weights. Together these findings—the positive correlations across different types of cues and different contrasts—indicate that these cue weights are a consistent property of individuals and are not tied to particular dimensions. Further support for this conclusion comes from the observation that the relative use of two cues within a contrast was not related across contrasts. If some listeners tended to use particular dimensions more, we might have observed relationships between the ratios. Thus, individuals seem to differ mostly in how consistently they are able to categorize words for any given acoustic dimension.

An important question for future research will be to better understand what aspect of the individuals this cross-cue pattern reflects. Previous research has found that individual differences in speech perception are distinct from abilities such as sensitivity to temporal envelope, loudness, duration and complex spectral and temporal patterns in non-speech stimuli (Surprenant and Watson, 2001; Kidd et al., 2007) but may instead be linked to recognition of familiar sounds (such as animal noises). Previous research has also found that individuals with better speech perceptual abilities also produce more distinct contrasts (Perkell et al., 2004a; Perkell et al., 2004b), which seems to indicate a general speech aptitude that extends beyond perception. These differences might reflect the ability to form and use detailed representations of speech, which may in turn be influenced by individuals' cognitive abilities (e.g., working memory). Recent work has begun to examine possible links between speech perception and cognitive abilities with mixed results (Akeroyd, 2008). Some studies have found a link between speech perception and working memory (Benard et al., 2014; Neger, et al., 2014; Baese-Berk et al., 2015; Kapnoula et al., 2017).

The picture that emerges from this and previous results is that some individuals are better able to make use of phonetic cues in the speech signal. Future research will need to determine what makes this possible and to what extent this provides a benefit to these individuals in other speech tasks.

This research was supported by SSHRC Grant No. 435-2016-0747 to M.C. The author would like to thank David Fleischer and Melanie Oriana for help with stimulus creation and data collection and Dave Kleinschmidt for a tutorial on using tandem-straight.

1

See supplementary material at https://doi.org/10.1121/1.5052025E-JASMAN-144-512808 for details of stimulus construction. See also Clayards (2018) for stimuli.

1.
Akeroyd
,
M. A.
(
2008
). “
Are individual differences in speech reception related to individual differences in cognitive ability? A survey of twenty experimental studies with normal and hearing-impaired adults
,”
Int. J. Audiol.
47
,
S53
S71
.
2.
Baese-Berk
,
M. M.
,
Bent
,
T. T.
,
Borrie
,
S.
, and
McKee
,
M.
(
2015
). “
Individual differences in perception of unfamiliar speech
,” in
Proceedings of ICPhS
, Glasgow, UK.
3.
Bates
,
D.
,
Maechler
,
M.
,
Bolker
,
B.
, and
Walker
,
S.
(
2015
). “
Fitting linear mixed-effects models using lme4
,”
J. Stat. Softw.
67
(
1
),
1
48
.
4.
Benard
,
M. R.
,
Mensink
,
J. S.
, and
Başkent
,
D.
(
2014
). “
Individual differences in top-down restoration of interrupted speech: Links to linguistic and cognitive abilities
,”
J. Acoust. Soc. Am.
135
(
2
),
EL88
EL94
.
5.
Boersma
,
P.
, and
Weenink
,
D.
(
2016
). “
Praat: Doing phonetics by computer
” [computer program], version 6.0.22, http://www.praat.org/ (Last viewed August 28, 2018).
6.
Chandrasekaran
,
B.
,
Sampath
,
P. D.
, and
Wong
,
P. C.
(
2010
). “
Individual variability in cue-weighting and lexical tone learning
,”
J. Acoust. Soc. Am.
128
(
1
),
456
465
.
7.
Clayards
,
M.
(
2018
). “
Individual differences in speech perception cue weights
,” https://osf.io/369my (Last viewed August 28, 2018).
8.
Francis
,
A. L.
,
Baldwin
,
K.
, and
Nusbaum
,
H. C.
(
2000
). “
Effects of training on attention to acoustic cues
,”
Percept. Psychophys.
62
(
8
),
1668
1680
.
9.
Hazan
,
V.
, and
Rosen
,
S.
(
1991
). “
Individual variability in the perception of cues to place contrasts in initial stops
,”
Percept. Psychophys.
49
(
2
),
187
200
.
10.
Holt
,
L. L.
, and
Lotto
,
A. J.
(
2006
). “
Cue weighting in auditory categorization: Implications for first and second language acquisition
,”
J. Acoust. Soc. Am.
119
(
5
),
3059
3071
.
11.
Humes
,
L. E.
,
Kidd
,
G. R.
, and
Lentz
,
J. J.
(
2013
). “
Auditory and cognitive factors underlying individual differences in aided speech-understanding among older adults
,”
Front. Syst. Neurosci.
7
,
55
.
12.
Idemaru
,
K.
,
Holt
,
L. L.
, and
Seltman
,
H.
(
2012
). “
Individual differences in cue weights are stable across time: The case of Japanese stop lengths
,”
J. Acoust. Soc. Am.
132
(
6
),
3950
3964
.
13.
Kapnoula
,
E. C.
,
Winn
,
M. B.
,
Kong
,
E. J.
,
Edwards
,
J.
, and
McMurray
,
B.
(
2017
). “
Evaluating the sources and functions of gradiency in phoneme categorization: An individual differences approach
,”
J. Exp. Psychol.: Human Percept. Perform.
43
(
9
),
1594
1611
.
14.
Kawahara
,
H.
,
Morise
,
M.
,
Takahashi
,
T.
,
Nisimura
,
R.
,
Irino
,
T.
, and
Banno
,
H.
(
2008
). “
TANDEM-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation
,” in
Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2008
, pp.
3933
3936
.
15.
Kidd
,
G. R.
,
Watson
,
C. S.
, and
Gygi
,
B.
(
2007
). “
Individual differences in auditory abilities
,”
J. Acoust. Soc. Am.
122
(
1
),
418
435
.
16.
Kim
,
D.
,
Clayards
,
M.
, and
Goad
,
H.
(
2018
). “
A longitudinal study of individual differences in the acquisition of new vowel contrasts
,”
J. Phonet.
67
,
1
20
.
17.
Kong
,
E. J.
, and
Edwards
,
J.
(
2015
). “
Individual differences in L2 learners' perceptual cue weighting patterns
,” in
Proceedings of ICPhS
, Glasgow, UK.
18.
Kong
,
E. J.
, and
Edwards
,
J.
(
2016
). “
Individual differences in categorical perception of speech: Cue weighting and executive function
,”
J. Phon.
59
,
40
57
.
19.
Lisker
,
L.
(
1986
). “
 ‘Voicing’ in English: A catalogue of acoustic features signalling /b/ versus /p/ in trochees
,”
Lang. Speech
29
(
1
),
3
11
.
20.
Moberly
,
A. C.
,
Lowenstein
,
J. H.
,
Tarr
,
E.
,
Caldwell-Tarr
,
A.
,
Welling
,
D. B.
,
Shahin
,
A. J.
, and
Nittrouer
,
S.
(
2014
). “
Do adults with cochlear implants rely on different acoustic cues for phoneme perception than adults with normal hearing?
,”
J. Speech Lang. Hear. Res.
57
(
2
),
566
582
.
21.
Neger
,
T. M.
,
Rietveld
,
T.
, and
Janse
,
E.
(
2014
). “
Relationship between perceptual learning in speech and statistical learning in younger and older adults
,”
Front. Hum. Neurosci.
8
,
628
.
22.
Perkell
,
J. S.
,
Guenther
,
F. H.
,
Lane
,
H.
,
Matthies
,
M. L.
,
Stockmann
,
E.
,
Tiede
,
M.
, and
Zandipour
,
M.
(
2004a
). “
The distinctness of speakers' productions of vowel contrasts is related to their discrimination of the contrasts
,”
J. Acoust. Soc. Am.
116
(
4
),
2338
2344
.
23.
Perkell
,
J. S.
,
Matthies
,
M. L.
,
Tiede
,
M.
,
Lane
,
H.
,
Zandipour
,
M.
,
Marrone
,
N.
,
Stockman
,
E.
, and
Guenther
,
F. H.
(
2004b
). “
The distinctness of speakers' /s/—/∫/ contrast is related to their auditory discrimination and use of an articulatory saturation effect
,”
J. Speech Lang. Hear. Res.
47
(
6
),
1259
1269
.
24.
R Core Team
(
2016
). “
R: A language and environment for statistical computing
,” R Foundation for Statistical Computing, Vienna, Austria, https://www.R-project.org/ (Last viewed August 28, 2018).
25.
Schertz
,
J.
,
Cho
,
T.
,
Lotto
,
A.
, and
Warner
,
N.
(
2015
). “
Individual differences in phonetic cue use in production and perception of a non-native sound contrast
,”
J. Phonet.
52
,
183
204
.
26.
Shultz
,
A. A.
,
Francis
,
A. L.
, and
Llanos
,
F.
(
2012
). “
Differential cue weighting in perception and production of consonant voicing
,”
J. Acoust. Soc. Am.
132
(
2
),
EL95
EL101
.
27.
Strand
,
J.
,
Cooperman
,
A.
,
Rowe
,
J.
, and
Simenstad
,
A.
(
2014
). “
Individual differences in susceptibility to the McGurk effect: Links with lipreading and detecting audiovisual incongruity
,”
J. Speech Lang. Hear. Res.
57
(
6
),
2322
2331
.
28.
Surprenant
,
A. M.
, and
Watson
,
C. S.
(
2001
). “
Individual differences in the processing of speech and nonspeech sounds by normal-hearing listeners
,”
J. Acoust. Soc. Am.
110
(
4
),
2085
2095
.
29.
Yu
,
A. C. L.
, and
Lee
,
H.
(
2014
). “
The stability of perceptual compensation for coarticulation within and across individuals: A cross-validation study
,”
J. Acoust. Soc. Am.
136
(
1
),
382
388
.

Supplementary Material