A listener’s ability to understand a target speaker in the presence of one or more simultaneous competing speakers is subject to two types of masking: energetic and informational. Energetic masking takes place when target and interfering signals overlap in time and frequency resulting in portions of target becoming inaudible. Informational masking occurs when the listener is unable to distinguish target and interference, while both are audible. A computational model of multitalker speech perception is presented to account for both types of masking. Human perception in the presence of energetic masking is modeled using a speech recognizer that treats the masked time-frequency units of target as missing data. The effects of informational masking are modeled as errors in target segregation by a speech separation system. On a systematic evaluation, the performance of the proposed model is in broad agreement with the results of a recent perceptual study.

1.
Barker
,
J.
, and
Cooke
,
M. P.
(
2004
). “
Modelling the intelligibility of multi-talker speech in the CRM task
,” presented on the
International Conference on Auditory Scene Analysis and Speech Perception by Human and Machine
,
Hanse Institute for Advanced Studies
,
Delmenhorst, Germany
.
2.
Barker
,
J. P.
,
Cooke
,
M. P.
, and
Ellis
,
D. P. W.
(
2005
). “
Decoding speech in the presence of other sources
,”
Speech Commun.
45
,
5
25
.
3.
Bird
,
J.
, and
Darwin
,
C. J.
(
1997
). “
Effects of a difference in fundamental frequency in separating two sentences
,” in
Psychophysical and Physiological Advances in Hearing
, edited by
A. R.
Palmer
,
A.
Rees
,
A. Q.
Summerfield
, and
R.
Meddis
(
Whurr
,
London, UK
), pp.
263
269
.
4.
Boersma
,
P.
, and
Weenink
,
D.
(
2002
). “
PRAAT: Doing phonetics by computer, version 4.0.26
,” http://www.fon.hum.uva.nl/praat (last viewed October,
2007
).
5.
Bolia
,
R. S.
,
Nelson
,
W. T.
, and
Ericson
,
M. A.
(
2000
). “
A speech corpus for multitalker communications research
,”
J. Acoust. Soc. Am.
107
,
1065
1066
.
6.
Bregman
,
A. S.
(
1990
).
Auditory Scene Analysis
(
MIT
,
Cambridge, MA
).
7.
Brokx
,
J. P. L.
, and
Nooteboom
,
S. G.
(
1982
). “
Intonation and the perceptual separation of simultaneous voices
,”
J. Phonetics
10
,
23
36
.
8.
Brown
,
G. J.
,
Barker
,
J.
, and
Wang
,
D. L.
(
2001
). “
A neural oscillator sound separator for missing data speech recognition
,” in
Proceedings of the International Joint Conference on Neural Networks’01
, pp.
2907
2912
.
9.
Brown
,
G. J.
, and
Cooke
,
M. P.
(
1994
). “
Computational auditory scene analysis
,”
Comput. Speech Lang.
8
,
297
336
.
10.
Brungart
,
D. S.
,
Chang
,
P. S.
,
Simpson
,
B. D.
, and
Wang
,
D. L.
(
2006
). “
Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation
,”
J. Acoust. Soc. Am.
120
,
4007
4018
.
11.
Brungart
,
D. S.
,
Simpson
,
B. D.
,
Ericson
,
M. A.
, and
Scott
,
K. R.
(
2001
). “
Informational and energetic masking effects in the perception of multiple simultaneous talkers
,”
J. Acoust. Soc. Am.
110
,
2527
2538
.
12.
Carhart
,
R.
,
Tillman
,
T. W.
, and
Greetis
,
E. S.
(
1969
). “
Perceptual masking in multiple sound backgrounds
,”
J. Acoust. Soc. Am.
45
,
694
703
.
13.
Chang
,
P. S.
(
2004
). “
Exploration of behavioral, physiological, and computational approaches to auditory scene analysis
,” Master’s thesis, Department of Computer Science & Engineering,
The Ohio State University
;
14.
Cooke
,
M. P.
(
1993
).
Modeling Auditory Processing and Organization
(
Cambridge University Press
,
Cambridge, UK
).
15.
Cooke
,
M. P.
(
2006
). “
A glimpsing model of speech perception in noise
,”
J. Acoust. Soc. Am.
119
,
1562
1573
.
16.
Cooke
,
M. P.
,
Green
,
P.
,
Josifovski
,
L.
, and
Vizinho
,
A.
(
2001
). “
Robust automatic speech recognition with missing and unreliable acoustic data
,”
Speech Commun.
34
,
267
285
.
17.
Culling
,
J. F.
,
Linsmith
,
G. M.
, and
Caller
,
T. L.
(
2005
). “
Evidence for a cancellation mechanism in perceptual segregation by differences in fundamental frequency
,”
J. Acoust. Soc. Am.
117
,
2600
.
18.
de Cheveigne
,
A.
(
1997
). “
Concurrent vowel identification. III. A neural model of harmonic interference cancellation
,”
J. Acoust. Soc. Am.
101
,
2857
2865
.
19.
de Cheveigne
,
A.
(
2005
). “
The cancellation principle in acoustic scene analysis
,” in
Speech Separation by Humans and Machines
, edited by
P.
Divenyi
(
Kluwer Academic
,
Norwell, MA
), pp.
245
259
.
20.
Fletcher
,
H.
(
1940
). “
Auditory patterns
,”
Rev. Mod. Phys.
12
,
47
65
.
21.
Fletcher
,
H.
(
1953
).
Speech and Hearing in Communication
(
Van Nostrand
,
Princeton, NJ
).
22.
Freyman
,
R. L.
,
Helfer
,
K. S.
,
McCall
,
D. D.
, and
Clifton
,
R. K.
(
1999
). “
The role of perceived spatial separation in the unmasking of speech
,”
J. Acoust. Soc. Am.
106
,
3578
3588
.
23.
Hu
,
G.
, and
Wang
,
D. L.
(
2004
). “
Monaural speech segregation based on pitch tracking and amplitude modulation
,”
IEEE Trans. Neural Netw.
15
,
1135
1150
.
24.
Hu
,
G.
, and
Wang
,
D. L.
(
2005
). “
Separation of fricatives and affricates
,” in
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing’05
, pp.
1101
1104
.
25.
Hu
,
G.
, and
Wang
,
D. L.
(
2007
). “
Auditory segmentation based on onset and offset analysis
,”
IEEE Trans. Audio, Speech, Lang. Process.
15
,
396
405
.
26.
Huang
,
X.
,
Acero
,
A.
, and
Hon
,
H.
(
2001
).
Spoken Language Processing
(
Prentice-Hall
,
Upper Saddle River, NJ
).
27.
Lippmann
,
R. P.
, and
Carlson
,
B. A.
(
1997
). “
Using missing feature theory to actively select features for robust speech recognition with interruptions, filtering, and noise
,” in
Proceedings of the European Conference on Speech Communication and Technology’97
, pp.
37
40
.
28.
Mayer
,
A. M.
(
1876
). “
Research in acoustics
,”
Philos. Mag.
2
,
500
507
.
29.
McLachlan
,
G. J.
, and
Basford
,
K. E.
(
1988
).
Mixture Models: Inference and Applications to Clustering
(
Dekker
,
New York
).
30.
Oh
,
E. L.
, and
Lutfi
,
R. A.
(
2000
). “
Effect of masker harmonicity on informational masking
,”
J. Acoust. Soc. Am.
108
,
706
709
.
31.
Palomaki
,
K. J.
,
Brown
,
G. J.
, and
Wang
,
D. L.
(
2004
). “
A binaural processor for missing data speech recognition in the presence of noise and small-room reverberation
,”
Speech Commun.
43
,
361
378
.
32.
Patterson
,
R. D.
,
Nimmo-Smith
,
I.
,
Holdsworth
,
J.
, and
Rice
,
P.
(
1988
). “
An efficient auditory filterbank based on the gammatone function
,” MRC Applied Psychology Unit (APU) Report No. 2341, Cambridge, UK.
33.
Pollack
,
I.
(
1975
). “
Auditory informational masking
,”
J. Acoust. Soc. Am.
57
, Supplement 1, p.
55
.
34.
Roman
,
N.
,
Wang
,
D. L.
, and
Brown
,
G. J.
(
2003
). “
Speech segregation based on sound localization
,”
J. Acoust. Soc. Am.
114
,
2236
2252
.
35.
Seltzer
,
M. L.
,
Raj
,
B.
, and
Stern
,
R. M.
(
2000
). “
Classifier-based mask estimation for missing feature methods of robust speech recognition
,” in
Proceedings of the International Conference on Spoken Language Processing ’00
, pp.
538
541
.
36.
Shao
,
Y.
, and
Wang
,
D. L.
(
2006
). “
Model-based sequential organization in cochannel speech
,”
IEEE Trans. Audio, Speech, Lang. Process.
14
,
289
298
.
37.
Srinivasan
,
S.
(
2006
). “
Integrating computational auditory scene analysis and automatic speech recognition
,” Ph.D. thesis, Biomedical Engineering Department,
The Ohio State University
, Columbus, OH.
38.
Srinivasan
,
S.
, and
Wang
,
D. L.
(
2005a
). “
Modeling the perception of multitalker speech
,” in
Proceedings of the Interspeech ’05
, pp.
1265
1268
.
39.
Srinivasan
,
S.
, and
Wang
,
D. L.
(
2005b
). “
Robust speech recognition by integrating speech separation and hypothesis testing
,” in
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing ’05
, Vol.
1
, pp.
89
92
.
40.
Srinivasan
,
S.
, and
Wang
,
D. L.
(
2005c
). “
A schema-based model for phonemic restoration
,”
Speech Commun.
45
,
63
87
.
41.
Srinivasan
,
S.
, and
Wang
,
D. L.
(
2007
). “
Transforming binary uncertainties for robust speech recognition
,”
IEEE Trans. Audio, Speech, Lang. Process.
15
,
2130
2140
.
42.
Steeneken
,
H. J. M.
, and
Houtgast
,
T.
(
1980
). “
A physical method for measuring speech-transmission quality
,”
J. Acoust. Soc. Am.
67
,
318
326
.
43.
Tanner
,
W. P.
, Jr.
(
1958
). “
What is masking
?”
J. Acoust. Soc. Am.
30
,
919
921
.
44.
van Hamme
,
H.
(
2004
). “
Robust speech recognition using cepstral domain missing data techniques and noisy masks
,” in
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing ’04
, Vol.
1
, pp.
213
216
.
45.
Wang
,
D. L.
(
2005
). “
On ideal binary mask as the computational goal of auditory scene analysis
,” in
Speech Separation by Humans and Machines
, edited by
P.
Divenyi
(
Kluwer Academic
,
Norwell, MA
), pp.
181
197
.
46.
Wang
,
D. L.
, and
Brown
,
G. J.
(
1999
). “
Separation of speech from interfering sounds based on oscillatory correlation
,”
IEEE Trans. Neural Netw.
10
,
684
697
.
47.
Wang
,
D. L.
, and
Brown
,
G. J.
(
2006
).
Computational Auditory Scene Analysis: Principles, Algorithms and Applications
(
Wiley
,
New York
,
IEEE
,
Hoboken, NJ
).
48.
Watson
,
C. S.
(
2005
). “
Some comments on informational masking
,”
Acta. Acust. Acust.
91
,
502
512
.
49.
Wu
,
M.
,
Wang
,
D. L.
, and
Brown
,
G. J.
(
2003
). “
A multipitch tracking algorithm for noisy speech
,”
IEEE Trans. Speech Audio Process.
11
,
229
241
.
50.
Young
,
S.
,
Kershaw
,
D.
,
Odell
,
J.
,
Valtchev
,
V.
, and
Woodland
,
P.
(
2000
).
The HTK Book (for HTK Version 3.0)
(
Microsoft Corporation
,
Redmond, WA
).
You do not currently have access to this content.