Speaker separation is a special case of speech separation, in which the mixture signal comprises two or more speakers. Many talker-independent speaker separation methods have been introduced in recent years to address this problem in anechoic conditions. To consider more realistic environments, this paper investigates talker-independent speaker separation in reverberant conditions. To effectively deal with speaker separation and speech dereverberation, extending the deep computational auditory scene analysis (CASA) approach to a two-stage system is proposed. In this method, reverberant utterances are first separated and separated utterances are then dereverberated. The proposed two-stage deep CASA system significantly outperforms a baseline one-stage deep CASA method in real reverberant conditions. The proposed system has superior separation performance at the frame level and higher accuracy in assigning separated frames to individual speakers. The proposed system successfully generalizes to an unseen speech corpus and exhibits similar performance to a talker-dependent system.

1.
Bai
,
S.
,
Kolter
,
J. Z.
, and
Koltun
,
V.
(
2018
). “
An empirical evaluation of generic convolutional and recurrent networks for sequence modeling
,” arXiv:1803.01271.
2.
Brungart
,
D. S.
(
2001
). “
Informational and energetic masking effects in the perception of two simultaneous talkers
,”
J. Acoust. Soc. Am.
109
,
1101
1109
.
3.
Culling
,
J. F.
,
Hodder
,
K. I.
, and
Toh
,
C. Y.
(
2003
). “
Effects of reverberation on perceptual segregation of competing voices
,”
J. Acoust. Soc. Am.
114
,
2871
2876
.
4.
Delfarah
,
M.
,
Liu
,
Y.
, and
Wang
,
D. L.
(
2020
). “
Talker-independent speaker separation in reverberant conditions
,” in
Proc. ICASSP
, pp.
8723
8727
.
5.
Delfarah
,
M.
, and
Wang
,
D. L.
(
2019
). “
Deep learning for talker-dependent reverberant speaker separation: An empirical study
,”
IEEE/ACM Trans. Audio, Speech, Lang. Process.
27
,
1839
1848
.
6.
Du
,
J.
,
Tu
,
Y.
,
Xu
,
Y.
,
Dai
,
L.
, and
Lee
,
C.-H.
(
2014
). “
Speech separation of a target speaker based on deep neural networks
,” in
Proc. ICSP
, pp.
473
477
.
7.
Festen
,
J. M.
, and
Plomp
,
R.
(
1990
). “
Effects of fluctuating noise and interfering speech on the speech-reception threshold for impaired and normal hearing
,”
J. Acoust. Soc. Am.
88
,
1725
1736
.
8.
Garofolo
,
J.
,
Graff
,
D.
,
Paul
,
D.
, and
Pallett
,
D.
(
1993
). “
CSR-I (WSJ0) complete LDC93S6A
,” (Linguistic Data Consortium, Philadelphia).
9.
Grais
,
E. M.
,
Roma
,
G.
,
Simpson
,
A. J.
, and
Plumbley
,
M. D.
(
2017
). “
Two-stage single-channel audio source separation using deep neural networks
,”
IEEE/ACM Trans. Audio, Speech, Lang. Process.
25
,
1773
1783
.
10.
Healy
,
E. W.
,
Delfarah
,
M.
,
Johnson
,
E. M.
, and
Wang
,
D. L.
(
2019
). “
A deep learning algorithm to increase intelligibility for hearing-impaired listeners in the presence of a competing talker and reverberation
,”
J. Acoust. Soc. Am.
145
,
1378
1388
.
11.
Healy
,
E. W.
,
Delfarah
,
M.
,
Vasko
,
J. L.
,
Carter
,
B. L.
, and
Wang
,
D. L.
(
2017
). “
An algorithm to increase intelligibility for hearing-impaired listeners in the presence of a competing talker
,”
J. Acoust. Soc. Am.
141
,
4230
4239
.
12.
Helfer
,
K. S.
, and
Wilber
,
L. A.
(
1990
). “
Hearing loss, aging, and speech perception in reverberation and noise
,”
J. Speech Lang. Hear. Res.
33
,
149
155
.
13.
Hershey
,
J. R.
,
Chen
,
Z.
,
Le Roux
,
J.
, and
Watanabe
,
S.
(
2016
). “
Deep clustering: Discriminative embeddings for segmentation and separation
,” in
Proc. ICASSP
, pp.
31
35
.
14.
Huang
,
G.
,
Liu
,
Z.
,
Van Der Maaten
,
L.
, and
Weinberger
,
K. Q.
(
2017
). “
Densely connected convolutional networks
,” in
Proc. CVPR
, pp.
4700
4708
.
15.
Huang
,
P.-S.
,
Kim
,
M.
,
Hasegawa-Johnson
,
M.
, and
Smaragdis
,
P.
(
2014
). “
Deep learning for monaural speech separation
,” in
Proc. ICASSP
, pp.
1562
1566
.
16.
Huang
,
P.-S.
,
Kim
,
M.
,
Hasegawa-Johnson
,
M.
, and
Smaragdis
,
P.
(
2015
). “
Joint optimization of masks and deep recurrent neural networks for monaural source separation
,”
IEEE/ACM Trans. Audio, Speech, Lang. Process.
23
,
2136
2147
.
17.
Hummersone
,
C.
,
Mason
,
R.
, and
Brookes
,
T.
(
2010
). “
Dynamic precedence effect modeling for source separation in reverberant environments
,”
IEEE Trans. Audio, Speech, Lang. Process.
18
,
1867
1871
.
18.
IEEE
(
1969
). “
IEEE recommended practice for speech quality measurements
,”
IEEE Trans. Audio Electroacoust.
17
,
225
246
.
19.
Jensen
,
J.
, and
Taal
,
C. H.
(
2016
). “
An algorithm for predicting the intelligibility of speech masked by modulated noise maskers
,”
IEEE/ACM Trans. Audio, Speech, Lang. Process.
24
,
2009
2022
.
20.
Kingma
,
D.
, and
Ba
,
J.
(
2015
). “
Adam: A method for stochastic optimization
,” in
Proc. ICML
.
21.
Kolbæk
,
M.
,
Yu
,
D.
,
Tan
,
Z.-H.
,
Jensen
,
J.
,
Kolbaek
,
M.
,
Yu
,
D.
,
Tan
,
Z.-H.
, and
Jensen
,
J.
(
2017
). “
Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks
,”
IEEE/ACM Trans. Audio, Speech, Lang. Process.
25
,
1901
1913
.
22.
Lea
,
C.
,
Vidal
,
R.
,
Reiter
,
A.
, and
Hager
,
G. D.
(
2016
). “
Temporal convolutional networks: A unified approach to action segmentation
,” in
European Conference on Computer Vision
, pp.
47
54
.
23.
Liu
,
Y.
, and
Wang
,
D. L.
(
2019
). “
Divide and conquer: A deep CASA approach to talker-independent monaural speaker separation
,”
IEEE/ACM Trans. Audio, Speech, Lang. Process.
27
,
2092
2102
.
24.
Luo
,
Y.
, and
Mesgarani
,
N.
(
2019
). “
Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation
,”
IEEE/ACM Trans. Audio, Speech, Lang. Process.
27
,
1256
1266
.
25.
Miller
,
G. A.
(
1947
). “
The masking of speech
,”
Psychol. Bull.
44
,
105
129
.
26.
Moore
,
B. C.
(
2007
).
Cochlear Hearing Loss: Physiological, Psychological and Technical Issues
(
Wiley
,
Chichester, UK
).
27.
Panayotov
,
V.
,
Chen
,
G.
,
Povey
,
D.
, and
Khudanpur
,
S.
(
2015
). “
LibriSpeech: An ASR corpus based on public domain audio books
,” in
Proc. ICASSP
, pp.
5206
5210
.
28.
Rix
,
A. W.
,
Beerends
,
J. G.
,
Hollier
,
M. P.
, and
Hekstra
,
A. P.
(
2001
). “
Perceptual evaluation of speech quality (PESQ)—A new method for speech quality assessment of telephone networks and codecs
,” in
Proc. ICASSP
, pp.
749
752
.
29.
Ronneberger
,
O.
,
Fischer
,
P.
, and
Brox
,
T.
(
2015
). “
U-net: Convolutional networks for biomedical image segmentation
,” in
Med. Image. Comput. Comput. Assist. Interv.
, pp.
234
241
.
30.
Shi
,
Z.
,
Lin
,
H.
,
Liu
,
L.
,
Liu
,
R.
, and
Han
,
J.
(
2019
). “
FurcaNeXt: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks
,” arXiv:1902.04891.
31.
Smaragdis
,
P.
(
2006
). “
Convolutive speech bases and their application to supervised speech separation
,”
IEEE Trans. Audio, Speech, Lang. Process.
15
,
1
12
.
32.
Taal
,
C. H.
,
Hendriks
,
R. C.
,
Heusdens
,
R.
, and
Jensen
,
J.
(
2011
). “
An algorithm for intelligibility prediction of time–frequency weighted noisy speech
,”
IEEE Trans. Audio, Speech, Lang. Process.
19
,
2125
2136
.
33.
Tan
,
K.
, and
Wang
,
D. L.
(
2018
). “
A two-stage approach to noisy cochannel speech separation with gated residual networks
,” in
Proc. Interspeech
, pp.
3484
3488
.
34.
Vincent
,
E.
,
Gribonval
,
R.
, and
Févotte
,
C.
(
2006
). “
Performance measurement in blind audio source separation
,”
IEEE Trans. Audio, Speech, Lang. Process.
14
,
1462
1469
.
35.
Wang
,
D. L.
(
2005
). “
On ideal binary mask as the computational goal of auditory scene analysis
,” in
Speech Separation by Humans and Machines
, edited by
P.
Divenyi
(
Springer
,
Kluwer Academic, Norwell, MA
), pp.
181
197
.
36.
Wang
,
D. L.
, and
Brown
,
G. J.
,
eds
. (
2006
). in
Computational Auditory Scene Analysis: Principles, Algorithms, and Applications
(
Wiley-IEEE
,
Hoboken, NJ
).
37.
Wang
,
D. L.
, and
Chen
,
J.
(
2018
). “
Supervised speech separation based on deep learning: An overview
,”
IEEE/ACM Trans. Audio, Speech, Lang. Process.
26
,
1702
1726
.
38.
Wang
,
Y.
,
Du
,
J.
,
Dai
,
L.-R.
, and
Lee
,
C.-H.
(
2017
). “
A gender mixture detection approach to unsupervised single-channel speech separation based on deep neural networks
,”
IEEE/ACM Trans. Audio, Speech, Lang. Process.
25
,
1535
1546
.
39.
Wang
,
Y.
,
Narayanan
,
A.
, and
Wang
,
D. L.
(
2014
). “
On training targets for supervised speech separation
,”
IEEE/ACM Trans. Audio, Speech, Lang. Process.
22
,
1849
1858
.
40.
Wang
,
Y.
, and
Wang
,
D. L.
(
2013
). “
Towards scaling up classification-based speech separation
,”
IEEE Trans. Audio, Speech, Lang. Process.
21
,
1381
1390
.
41.
Wang
,
Z.-Q.
,
Le Roux
,
J.
, and
Hershey
,
J. R.
(
2018
). “
Multi-channel deep clustering: Discriminative spectral and spatial embeddings for speaker-independent speech separation
,” in
Proc. ICASSP
, pp.
1
5
.
42.
Wang
,
Z.-Q.
, and
Wang
,
D. L.
(
2018
). “
Combining spectral and spatial features for deep learning based blind speaker separation
,”
IEEE/ACM Trans. Audio, Speech, Lang. Process.
27
,
457
468
.
43.
Weiss
,
R. J.
, and
Ellis
,
D. P.
(
2010
). “
Speech separation using speaker-adapted eigenvoice speech models
,”
Comput. Speech Lang.
24
,
16
29
.
44.
Williamson
,
D. S.
,
Wang
,
Y.
, and
Wang
,
D. L.
(
2016
). “
Complex ratio masking for monaural speech separation
,”
IEEE/ACM Trans. Audio, Speech, Lang. Process.
24
,
483
492
.
45.
Yu
,
D.
,
Kolbæk
,
M.
,
Tan
,
Z.-H.
, and
Jensen
,
J.
(
2017
). “
Permutation invariant training of deep models for speaker-independent multi-talker speech separation
,” in
Proc. ICASSP
, pp.
241
245
.
46.
Zhang
,
X.-L.
, and
Wang
,
D. L.
(
2016
). “
A deep ensemble learning method for monaural speech separation
,”
IEEE/ACM Trans. Audio, Speech, Lang. Process.
24
,
967
977
.
47.
Zhao
,
Y.
,
Wang
,
Z.-Q.
, and
Wang
,
D. L.
(
2019
). “
Two-stage deep learning for noisy-reverberant speech enhancement
,”
IEEE/ACM Trans. Audio, Speech, Lang. Process.
27
,
53
62
.
You do not currently have access to this content.