A monaural speech segregation system is presented that estimates the ideal binary mask from noisy speech based on the supervised learning of amplitude modulation spectrogram (AMS) features. Instead of using linearly scaled modulation filters with constant absolute bandwidth, an auditory-inspired modulation filterbank with logarithmically scaled filters is employed. To reduce the dependency of the AMS features on the overall background noise level, a feature normalization stage is applied. In addition, a spectro-temporal integration stage is incorporated in order to exploit the context information about speech activity present in neighboring time-frequency units. In order to evaluate the generalization performance of the system to unseen acoustic conditions, the speech segregation system is trained with a limited set of low signal-to-noise ratio (SNR) conditions, but tested over a wide range of SNRs up to 20 dB. A systematic evaluation of the system demonstrates that auditory-inspired modulation processing can substantially improve the mask estimation accuracy in the presence of stationary and fluctuating interferers.

1.
Anzalone
,
M. C.
,
Calandruccio
,
L.
,
Doherty
,
K. A.
, and
Carney
,
L. H.
(
2006
). “
Determination of the potential benefit of time-frequency gain manipulation
,”
Ear Hear.
27
,
480
492
.
2.
Bacon
,
S. P.
, and
Grantham
,
D. W.
(
1989
). “
Modulation masking: Effects of modulation frequency, depths, and phase
,”
J. Acoust. Soc. Am.
85
,
2575
2580
.
3.
Brungart
,
D. S.
,
Chang
,
P. S.
,
Simpson
,
B. D.
, and
Wang
,
D. L.
(
2006
). “
Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation
,”
J. Acoust. Soc. Am.
120
,
4007
4018
.
4.
Büchler
,
M. C.
(
2002
). “
Algorithms for sound classification in hearing instruments
,” Ph.D. thesis, Swiss Federal Institute of Technology, Zurich, Switzerland.
5.
Chang
,
C. C.
, and
Lin
,
C. J.
(
2001
). “
LIBSVM: A library for support vector machines
,” Software is available at www.csie.ntu.edu.tw/∼cjlin/libsvm (Last viewed November 2014).
6.
Christiansen
,
S. K.
,
Jepsen
,
M. L.
, and
Dau
,
T.
(
2014
). “
Effects of tonotopicity, adaptation, modulation tuning and temporal coherence in ‘primitive’ auditory stream segregation
,”
J. Acoust. Soc. Am.
135
,
323
334
.
7.
Cooke
,
M.
(
2005
). “
Making sense of everyday speech: A glimpsing account
,” in
Speech Separation by Humans and Machines
, edited by
P.
Divenyi
(
Kluwer Academic
,
Dordrecht, The Netherlands
), Chap. 21, pp.
305
314
.
8.
Cooke
,
M.
(
2006
). “
A glimpsing model of speech perception in noise
,”
J. Acoust. Soc. Am.
119
,
1562
1573
.
9.
Cooke
,
M.
,
Green
,
P.
,
Josifovski
,
L.
, and
Vizinho
,
A.
(
2001
). “
Robust automatic speech recognition with missing and unreliable acoustic data
,”
Speech Commun.
34
,
267
285
.
10.
Dau
,
T.
,
Kollmeier
,
B.
, and
Kohlrausch
,
A.
(
1997a
). “
Modeling auditory processing of amplitude modulation. I. Detection and masking with narrow-band carriers
,”
J. Acoust. Soc. Am.
102
,
2892
2905
.
11.
Dau
,
T.
,
Kollmeier
,
B.
, and
Kohlrausch
,
A.
(
1997b
). “
Modeling auditory processing of amplitude modulation. II. Spectral and temporal integration
,”
J. Acoust. Soc. Am.
102
,
2906
2919
.
12.
Dau
,
T.
,
Püschel
,
D.
, and
Kohlrausch
,
A.
(
1996
). “
A quantitative model of the ‘effective’ signal processing in the auditory system. I. Model structure
,”
J. Acoust. Soc. Am.
99
,
3615
3622
.
13.
Dreschler
,
W. A.
,
Verschuure
,
H.
,
Ludvigsen
,
C.
, and
Westermann
,
S.
(
2001
). “
ICRA noises: Artificial noise signals with speech-like spectral and temporal properties for hearing instrument assessment
,”
Audiology
40
,
148
157
.
14.
Drullman
,
R.
,
Festen
,
J. M.
, and
Plomp
,
R.
(
1994
). “
Effect of temporal envelope smearing on speech reception
,”
J. Acoust. Soc. Am.
95
,
1053
1064
.
15.
Ewert
,
S. D.
, and
Dau
,
T.
(
2000
). “
Characterizing frequency selectivity for envelope fluctuations
,”
J. Acoust. Soc. Am.
108
,
1181
1196
.
16.
Han
,
K.
, and
Wang
,
D. L.
(
2012
). “
A classification based approach to speech segregation
,”
J. Acoust. Soc. Am.
132
,
3475
3483
.
17.
Healy
,
E. W.
,
Yoho
,
S. E.
,
Wang
,
Y.
, and
Wang
,
D.
(
2013
). “
An algorithm to improve speech recognition in noise for hearing-impaired listeners
,”
J. Acoust. Soc. Am.
134
,
3029
3038
.
18.
Heinz
,
M. G.
,
Colburn
,
H. S.
, and
Carney
,
L. H.
(
2001
). “
Evaluating auditory performance limits: I. One-parameter discrimination using a computational model for the auditory nerve
,”
Neural Comput.
13
,
2273
2316
.
19.
Houtgast
,
T.
(
1989
). “
Frequency selectivity in amplitude-modulation detection
,”
J. Acoust. Soc. Am.
85
,
1676
1680
.
20.
Hu
,
G.
, and
Wang
,
D. L.
(
2007
). “
Auditory segmentation based on onset and offset analysis
,”
IEEE Trans. Audio, Speech, Lang. Process.
15
,
396
405
.
21.
Jørgensen
,
S.
, and
Dau
,
T.
(
2011
). “
Predicting speech intelligibility based on the signal-tonoise envelope power ratio after modulation-frequency selective processing
,”
J. Acoust. Soc. Am.
130
,
1475
1487
.
22.
Jørgensen
,
S.
,
Ewert
,
S. D.
, and
Dau
,
T.
(
2013
). “
A multi-resolution envelope-power based model for speech intelligibility
,”
J. Acoust. Soc. Am.
134
,
1
11
.
23.
Kim
,
G.
,
Lu
,
Y.
,
Hu
,
Y.
, and
Loizou
,
P. C.
(
2009
). “
An algorithm that improves speech intelligibility in noise for normal-hearing listeners
,”
J. Acoust. Soc. Am.
126
,
1486
1494
.
24.
Kjems
,
U.
,
Boldt
,
J. B.
,
Pedersen
,
M. S.
,
Lunner
,
T.
, and
Wang
,
D. L.
(
2009
). “
Role of mask pattern in intelligibility of ideal binary-masked noisy speech
,”
J. Acoust. Soc. Am.
126
,
1415
1426
.
25.
Kollmeier
,
B.
, and
Koch
,
R.
(
1994
). “
Speech enhancement based on physiological and psychoacoustical models of modulation perception and binaural interaction
,”
J. Acoust. Soc. Am.
95
,
1593
1602
.
26.
Li
,
N.
, and
Loizou
,
P. C.
(
2008
). “
Factors influencing intelligibility of ideal binary-masked speech: Implications for noise reduction
,”
J. Acoust. Soc. Am.
123
,
1673
1682
.
27.
May
,
T.
, and
Dau
,
T.
(
2013
). “
Environment-aware ideal binary mask estimation using monaural cues
,” in
Proc. WASPAA
(
New Paltz, NY
).
27a.
May
,
T.
, and
Dau
,
T.
(
2014
). “
Requirements for the evaluation of computational speech segregation systems
,”
J. Acoust. Soc. Am.
136
,
EL398
EL404
.
27b.
May
,
T.
, and
Gerkmann
,
T.
(
2014
). “
Generalization of supervised learning for binary mask estimation
,” in
Proceedings of IWAENC
(
Juan les Pins, France
).
28.
May
,
T.
,
van de Par
,
S.
, and
Kohlrausch
,
A.
(
2012a
). “
A binaural scene analyzer for joint localization and recognition of speakers in the presence of interfering noise sources and reverberation
,”
IEEE Trans. Audio, Speech, Lang. Process.
20
,
2016
2030
.
29.
May
,
T.
,
van de Par
,
S.
, and
Kohlrausch
,
A.
(
2012b
). “
Noise-robust speaker recognition combining missing data techniques and universal background modeling
,”
IEEE Trans. Audio, Speech, Lang. Process.
20
,
108
121
.
30.
McDermott
,
J. H.
, and
Simoncelli
,
E. P.
(
2011
). “
Sound texture perception via statistics of the auditory periphery: Evidence from sound synthesis
,”
Neuron
71
,
926
940
.
31.
Meddis
,
R.
,
Hewitt
,
M. J.
, and
Shackleton
,
T. M.
(
1990
). “
Implementation details of a computation model of the inner hair-cell auditory-nerve synapse
,”
J. Acoust. Soc. Am.
87
,
1813
1816
.
32.
Meyer
,
B. T.
,
Brand
,
T.
, and
Kollmeier
,
B.
(
2011
). “
Effect of speech-intrinsic variations on human and automatic recognition of spoken phonemes
,”
J. Acoust. Soc. Am.
129
,
388
403
.
33.
Nielsen
,
J. B.
, and
Dau
,
T.
(
2011
). “
The Danish hearing in noise test
,”
Int. J. Audiol.
50
,
202
208
.
34.
Sroka
,
J. J.
, and
Braida
,
L. D.
(
2005
). “
Human and machine consonant recognition
,”
Speech Commun.
45
,
401
423
.
35.
Tchorz
,
J.
, and
Kollmeier
,
B.
(
2003
). “
SNR estimation based on amplitude modulation analysis with applications to noise suppression
,”
IEEE Trans. Audio, Speech, Lang. Process.
11
,
184
192
.
36.
Varga
,
A. P.
, and
Steeneken
,
H. J. M.
(
1993
). “
Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems
,”
Speech Commun.
12
,
247
251
.
37.
Wang
,
D. L.
(
2005
). “
On ideal binary mask as the computational goal of auditory scene analysis
,” in
Speech Separation by Humans and Machines
, edited by
P.
Divenyi
(
Kluwer Academic
,
Dordrecht, The Netherlands
), Chap. 12, pp.
181
197
.
38.
Wang
,
Y.
, and
Wang
,
D. L.
(
2013
). “
Towards scaling up classification-based speech separation
,”
IEEE Trans. Audio, Speech, Lang. Process.
21
,
1381
1390
.
39.
Wang
,
Y.
,
Han
,
K.
, and
Wang
,
D. L.
(
2013
). “
Exploring monaural features for classificationbased speech segregation
,”
IEEE Trans. Audio, Speech, Lang. Process.
21
,
270
279
.
40.
Wang
,
D. L.
,
Kjems
,
U.
,
Pedersen
,
M. S.
, and
Boldt
,
J. B.
(
2009
). “
Speech intelligibility in background noise with ideal binary time-frequency masking
,”
J. Acoust. Soc. Am.
125
,
2336
2347
.
41.
Zilany
,
M. S. A.
,
Bruce
,
I. C.
, and
Carney
,
L. H.
(
2014
). “
Updated parameters and expanded simulation options for a model of the auditory periphery
,”
J. Acoust. Soc. Am.
135
,
283
286
.
42.
Zilany
,
M. S. A.
,
Bruce
,
I. C.
,
Nelson
,
P. C.
, and
Carney
,
L. H.
(
2009
). “
A phenomenological model of the synapse between the inner hair cell and auditory nerve: Long-term adaptation with power-law dynamics
,”
J. Acoust. Soc. Am.
126
,
2390
2412
.
You do not currently have access to this content.