Listeners can reliably perceive speech in noisy conditions, but it is not well understood what specific features of speech they use to do this. This paper introduces a data-driven framework to identify the time-frequency locations of these features. Using the same speech utterance mixed with many different noise instances, the framework is able to compute the importance of each time-frequency point in the utterance to its intelligibility. The mixtures have approximately the same global signal-to-noise ratio at each frequency, but very different recognition rates. The difference between these intelligible vs unintelligible mixtures is the alignment between the speech and spectro-temporally modulated noise, providing different combinations of “glimpses” of speech in each mixture. The current results reveal the locations of these important noise-robust phonetic features in a restricted set of syllables. Classification models trained to predict whether individual mixtures are intelligible based on the location of these glimpses can generalize to new conditions, successfully predicting the intelligibility of novel mixtures. They are able to generalize to novel noise instances, novel productions of the same word by the same talker, novel utterances of the same word spoken by different talkers, and, to some extent, novel consonants.

1.
Akahane-Yamada
,
R.
, and
Tohkura
,
Y.
(
1990
). “
Perception and production of syllable-initial english /r/ and /l/ by native speakers of Japanese
,” in
The First International Conference on Spoken Language Processing, ICSLP 1990
, Kobe, Japan (November 18–22).
2.
Akeroyd
,
M. A.
(
2008
). “
Are individual differences in speech reception related to individual differences in cognitive ability? A survey of twenty experimental studies with normal and hearing-impaired adults
,”
Int. J. Audiol.
47
,
S53
S71
.
3.
Alcántara
,
J. I.
,
Moore
,
B. C. J.
,
Kühnel
,
V.
, and
Launer
,
S.
(
2003
). “
Evaluación del sistema de reducción de ruido en un auxiliar auditivo digital comercial
” (“Evaluation of the noise reduction system in a commercial digital hearing aid”),
Int. J. Audiol.
42
,
34
42
.
4.
ANSI
(
1997
). S3.5-1997,
Methods for Calculating the Speech Intelligibility Index
(
American National Standards Institute
,
New York
).
5.
ANSI
(
2004
). S3.21 (R2009),
American National Standard Methods for Manual Pure-Tone Threshold Audiometry
(
American National Standards Institute
,
New York
).
6.
ANSI
(
2010
). S3.6,
American National Standard Specification for Audiometers
(
American National Standards Institute
,
New York
).
7.
Apoux
,
F.
, and
Bacon
,
S.
(
2004
). “
Relative importance of temporal information in various frequency regions for consonant identification in quiet and in noise
,”
J. Acoust. Soc. Am.
116
,
1671
1680
.
8.
Apoux
,
F.
, and
Healy
,
E.
(
2012
). “
Use of a compound approach to derive auditory-filter-wide frequency-importance functions for vowels and consonants
,”
J. Acoust. Soc. Am.
132
,
1078
1087
.
9.
Apoux
,
F.
, and
Healy
,
E. W.
(
2009
). “
On the number of auditory filter outputs needed to understand speech: Further evidence for auditory channel independence
,”
Hear. Res.
255
,
99
108
.
10.
Brungart
,
D. S.
,
Chang
,
P. S.
,
Simpson
,
B. D.
, and
Wang
,
D.
(
2006
). “
Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation
,”
J. Acoust. Soc. Am.
120
,
4007
4018
.
11.
Calandruccio
,
L.
, and
Doherty
,
K.
(
2007
). “
Spectral weighting strategies for sentences measured by a correlational method
,”
J. Acoust. Soc. Am.
121
,
3827
3836
.
12.
Cohen
,
J.
(
1960
). “
A coefficient of agreement of nominal scales
,”
Edu. Psychol. Meas.
20
,
37
46
.
13.
Cooke
,
M.
(
2009
). “
Discovering consistent word confusions in noise
,” in
Proceedings of Interspeech
, pp.
1887
1890
.
14.
Cooke
,
M. P.
(
2006
). “
A glimpsing model of speech perception in noise
,”
J. Acoust. Soc. Am.
119
,
1562
1573
.
15.
Cox
,
D. D.
, and
Savoy
,
R. L.
(
2003
). “
Functional magnetic resonance imaging (fMRI) ‘brain reading’: Detecting and classifying distributed patterns of fMRI activity in human visual cortex
,”
Neuroimage
19
,
261
270
.
16.
Darwin
,
C.
,
Denis McKeown
,
J.
, and
Kirby
,
D.
(
1989
). “
Perceptual compensation for transmission channel and speaker effects on vowel quality
,”
Speech Commun.
8
,
221
234
.
17.
Davis
,
S.
, and
Mermelstein
,
P.
(
1980
). “
Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences
,”
IEEE Trans. Acoust. Speech Sign. Process.
28
,
357
366
.
18.
Doherty
,
K. A.
, and
Turner
,
C. W.
(
1996
). “
Use of the correlational method to estimate a listener's weighting function of speech
,”
J. Acoust. Soc. Am.
100
,
3769
3773
.
19.
Ellis
,
D.
(
2003
). “
Dynamic time warp (DTW) in matlab
,” http://www.ee.columbia.edu/~dpwe/resources/matlab/dtw/ (Last viewed October 2, 2016).
20.
Festen
,
J. M.
, and
Plomp
,
R.
(
1990
). “
Effects of fluctuating noise and interfering speech on the speech-reception threshold for impaired and normal hearing
,”
J. Acoust. Soc. Am.
88
,
1725
1736
.
21.
Glasberg
,
B. R.
, and
Moore
,
B. C. J.
(
1990
). “
Derivation of auditory filter shapes from notched-noise data
,”
Hear. Res.
47
,
103
138
.
22.
Gosselin
,
F.
, and
Schyns
,
P. G.
(
2001
). “
Bubbles: A technique to reveal the use of information in recognition tasks
,”
Vision Res.
41
,
2261
2271
.
23.
Haggard
,
M.
(
1974
). “
Selectivity for distortions and words in speech perception
,”
Brit. J. Psychol.
65
,
69
83
.
24.
Healy
,
E. W.
,
Yoho
,
S. E.
, and
Apoux
,
F.
(
2013
). “
Band-importance for sentences and words re-examined
,”
J. Acoust. Soc. Am.
133
,
463
473
.
25.
Lagacé
,
J.
,
Jutras
,
B.
, and
Gagné
,
J.-P.
(
2010
). “
Auditory processing disorder and speech perception problems in noise: Finding the underlying origin
,”
Am. J. Audiol.
19
,
17
25
.
26.
Landis
,
J. R.
, and
Koch
,
G. G.
(
1977
). “
The measurement of observer agreement for categorical data
,”
Biometrics
33
,
159
174
.
27.
Li
,
F.
,
Menon
,
A.
, and
Allen
,
J. B.
(
2010
). “
A psychoacoustic method to find the perceptual cues of stop consonants in natural speech
,”
J. Acoust. Soc. Am.
127
,
2599
2610
.
28.
Li
,
N.
, and
Loizou
,
P. C.
(
2007
). “
Factors influencing glimpsing of speech in noise
,”
J. Acoust. Soc. Am.
122
,
1165
1172
.
29.
Ma
,
J.
,
Hu
,
Y.
, and
Loizou
,
P.
(
2009
). “
Objective measures for predicting speech intelligibility in noisy conditions based on new band importance functions
,”
J. Acoust. Soc. Am.
125
,
3387
3405
.
30.
Mayo
,
L. H.
,
Florentine
,
M.
, and
Buus
,
S. R.
(
1997
). “
Age of second-language acquisition and perception of speech in noise
,”
J. Speech Lang. Hear. Res.
40
,
686
693
.
31.
Parbery-Clark
,
A.
,
Skoe
,
E.
,
Lam
,
C.
, and
Kraus
,
N.
(
2009
). “
Musician enhancement for speech-in-noise
,”
Ear Hear.
30
,
653
661
.
32.
Rehan Akbani
,
Kwek
,
S.
, and
Japkowicz
,
N.
(
2004
). “
Applying support vector machines to imbalanced datasets
,” in
European Conference on Machine Learning
, edited by
J.-F.
Boulicaut
,
F.
Esposito
,
F.
Giannotti
, and
D.
Pedreschi
, Vol.
3201
of Lecture Notes in Computer Science (
Springer
,
Berlin, Heidelberg
), pp.
39
50
.
33.
Scharenborg
,
O.
(
2007
). “
Reaching over the gap: A review of efforts to link human and automatic speech recognition research
,”
Speech Commun.
49
,
336
347
.
34.
Shannon
,
R. V.
,
Jensvold
,
A.
,
Padilla
,
M.
,
Robert
,
M. E.
, and
Wang
,
X.
(
1999
). “
Consonant recordings for speech testing
,”
J. Acoust. Soc. Am.
106
,
L71
L74
.
35.
Turner
,
C. W.
,
Kwon
,
B. J.
,
Tanaka
,
C.
,
Knapp
,
J.
,
Hubbartt
,
J. L.
, and
Doherty
,
K. A.
(
1998
). “
Frequency-weighting functions for broadband speech as estimated by a correlational method
,”
J. Acoust. Soc. Am.
104
,
1580
1585
.
36.
Varnet
,
L.
,
Knoblauch
,
K.
,
Meunier
,
F.
, and
Hoen
,
M.
(
2013
). “
Using auditory classification images for the identification of fine acoustic cues used in speech perception
,”
Front. Hum. Neurosci.
7
,
865
.
43.
Venezia
,
J. H.
,
Hickok
,
G.
, and
Richards
,
V. M.
(
2016
). “
Auditory ‘bubbles’: Efficient classification of the spectrotempoal modulations essential for speech intelligibility
,”
J. Acoust. Soc. Am.
140
,
1072
1088
.
37.
Watkins
,
A. J.
(
2005
). “
Perceptual compensation for effects of reverberation in speech identification
,”
J. Acoust. Soc. Am.
118
,
249
262
.
38.
Yu
,
C.
,
Wójcicki
,
K. K.
,
Loizou
,
P. C.
,
Hansen
,
J. H. L.
, and
Johnson
,
M. T.
(
2014
). “
Evaluation of the importance of time-frequency contributions to speech intelligibility in noise
,”
J. Acoust. Soc. Am.
135
,
3007
3016
.
39.
Zekveld
,
A. A.
,
Kramer
,
S. E.
, and
Festen
,
J. M.
(
2011
). “
Cognitive load during speech perception in noise: The influence of age, hearing loss, and cognition on the pupil response
,”
Ear Hear.
32
,
498
510
.
40.
Ziegler
,
J. C.
,
Pech-Georgel
,
C.
,
George
,
F.
, and
Lorenzi
,
C.
(
2009
). “
Speech-perception-in-noise deficits in dyslexia
,”
Dev. Sci.
12
,
732
745
.
41.
Zilany
,
M. S. A.
,
Bruce
,
I. C.
,
Nelson
,
P. C.
, and
Carney
,
L. H.
(
2009
). “
A phenomenological model of the synapse between the inner hair cell and auditory nerve: Long-term adaptation with power-law dynamics
,”
J. Acoust. Soc. Am.
126
,
2390
2412
.
42.
Zumach
,
A.
,
Gerrits
,
E.
,
Chenault
,
M. N.
, and
Anteunis
,
L. J. C.
(
2009
). “
Otitis media and speech-in-noise recognition in school-aged children
,”
Audiol. Neuro-otol.
14
,
121
129
.
You do not currently have access to this content.