This study proposes an approach to improve the perceptual quality of speech separated by binary masking through the use of reconstruction in the time-frequency domain. Non-negative matrix factorization and sparse reconstruction approaches are investigated, both using a linear combination of basis vectors to represent a signal. In this approach, the short-time Fourier transform (STFT) of separated speech is represented as a linear combination of STFTs from a clean speech dictionary. Binary masking for separation is performed using deep neural networks or Bayesian classifiers. The perceptual evaluation of speech quality, which is a standard objective speech quality measure, is used to evaluate the performance of the proposed approach. The results show that the proposed techniques improve the perceptual quality of binary masked speech, and outperform traditional time-frequency reconstruction approaches.

1.
Anzalone
,
M. C.
,
Calandruccio
,
L.
,
Doherty
,
K. A.
, and
Carney
,
L. H.
(
2006
). “
Determination of the potential benefit of time-frequency gain manipulation
,”
Ear Hear.
27
,
480
492
.
2.
Araki
,
S.
,
Makino
,
S.
,
Sawada
,
H.
, and
Mukai
,
R.
(
2005
). “
Reducing musical noise by a fine-shift overlap-add method applied to source separation using a time-frequency mask
,” in Proceedings of ICASSP, Vol.
3
, pp.
81
84
.
3.
Blumensath
,
T.
, and
Davis
,
M. E.
(
2007
). “
Compressed sensing and source separation
,” in
Independent Component Analysis and Blind Source Separation
, edited by
M. E.
Davies
,
C. J.
James
,
S.
Abdallah
, and
M. D.
Plumbley
(
Springer Verlag
,
New York
), pp.
341
348
.
4.
Boersma
,
P.
, and
Weeknink
,
D.
(
2012
). “
Praat: Doing phonetics by computer (Version 5.3.32)
,” Available: http://www.praat.org/ (Last viewed 5/30/13).
5.
Brungart
,
D.
,
Chang
,
P.
,
Simpson
,
B.
, and
Wang
,
D.
(
2006
). “
Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation
,”
J. Acoust. Soc. Am.
120
,
4007
4018
.
6.
Candes
,
E. J.
,
Romberg
,
J.
, and
Tao
,
T.
(
2006
). “
Stable signal recovery from incomplete and inaccurate measurements
,”
Commun. Pure Appl. Math.
59
,
1207
1223
.
7.
Cao
,
S.
,
Li
,
L.
, and
Wu
,
X.
(
2011
). “
Improvement of intelligibility of ideal binary-masked noisy speech by adding background noise
,”
J. Acoust. Soc. Am.
129
,
2227
2236
.
8.
Carmi
,
A.
,
Gurfil
,
P.
,
Kanevsky
,
D.
, and
Ramabhadran
,
B.
(
2009
). “
ABCS: Approximate Bayesian compressed sensing
,” Tech. Rep., Human Language Technologies, IBM, pp. 1–18.
9.
Choi
,
S.
(
2008
). “
Algorithms for orthogonal nonnegative matrix factorization
,” in Proceedings IJCNN, pp.
1828
1832
.
10.
Cichocki
,
A.
,
Amari
,
S. I.
,
Zdunek
,
R.
,
Kompass
,
R.
,
Hori
,
G.
, and
He
,
Z.
(
2006
). “
Extended SMART algorithms for non-negative matrix factorization
,” in Proceedings of ICAISC, pp.
548
562
.
11.
Donoho
,
D. L.
(
2006
). “
Compressed sensing
,”
IEEE Trans. Inf. Theory
52
,
1289
1306
.
12.
Eggert
,
J.
, and
Korner
,
E.
(
2004
). “
Sparse coding and NMF
,” in
IEEE Int. Conf. Neural Networks
4
,
2529
2533
.
13.
Elad
,
M.
, and
Aharon
,
M.
(
2006a
). “
Image denoising via learned dictionaries and sparse representation
,” in
IEEE Comput. Soc. Conf. Comput. Vision Pattern Recognit.
1
,
895
900
.
14.
Elad
,
M.
, and
Aharon
,
M.
(
2006b
). “
Image denoising via sparse and redundant representations over learned dictionaries
,”
IEEE Trans. Image Proc.
15
,
3736
3745
.
15.
Gemmeke
,
J.
, and
Cranen
,
B.
(
2008
). “
Using sparse representations for missing data imputation in noise robust speech recognition
,” in Proceedings of EUSIPCO, pp.
1
5
.
16.
Gemmeke
,
J.
,
Van Hamme
,
H.
,
Cranen
,
B.
, and
Boves
,
L.
(
2010
). “
Compressive sensing for missing data imputation in noise robust speech recognition
,”
IEEE J. Sel. Top. Signal Process.
4
,
272
287
.
17.
Gemmeke
,
J. F.
(
2011
). “
Noise robust ASR: missing data techniques and beyond
,” Ph.D. thesis,
Radboud University Nijmegen, The Netherlands
, pp.
1
169
.
18.
Gemmeke
,
J. F.
,
ten Bosch
,
L.
,
Boves
,
L.
, and
Cranen
,
B.
(
2009
). “
Using sparse representations for exemplar based continuous digit recognition
,” in Proceedings of EUSIPCO, pp.
1755
1759
.
19.
Gemmeke
,
J. F.
,
Virtanen
,
T.
, and
Hurmalainen
,
A.
(
2011
). “
Exemplar-based sparse representations for noise robust automatic speech recognition
,”
IEEE Trans. Audio, Speech, Lang. Process.
19
,
2067
2080
.
20.
Grindlay
,
G.
(
2010
). “
NMFLib
,” Available: http://code.google.com/p/nmflib/ (Last viewed 5/30/13).
21.
Healy
,
E. W.
,
Yoho
,
S. E.
,
Wang
,
Y.
, and
Wang
,
D. L.
(
2013
). “
An algorithm to improve speech recognition in noise for hearing-impaired listeners
,”
J. Acoust. Soc. Am.
134
,
3029
3038
.
22.
ITU-T
. (
2001
). “
Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs
,” p. 862.
23.
Kim
,
G.
,
Lu
,
Y.
,
Hu
,
Y.
, and
Loizou
,
P.
(
2009
). “
An algorithm that improves speech intelligibility in noise for normal-hearing listeners
,”
J. Acoust. Soc. Am.
126
,
1486
1494
.
24.
Lee
,
D.
, and
Seung
,
H. S.
(
1999
). “
Learning the parts of objects by non-negative matrix factorization
,”
Nature
401
,
788
791
.
25.
Li
,
N.
, and
Loizou
,
P.
(
2008
). “
Factors influencing intelligibility of ideal binary-masked speech: Implications for noise reduction
,”
J. Acoust. Soc. Am.
123
,
1673
1682
.
26.
Linde
,
Y.
,
Buzo
,
A.
, and
Gray
,
R. M.
(
1980
). “
An algorithm for vector quantizer design
,”
IEEE Trans. Commun.
28
,
84
95
.
27.
Madhu
,
N.
,
Breithaupt
,
C.
, and
Martin
,
R.
(
2008
). “
Temporal smoothing of spectral masks in the cepstral domain for speech separation
,” in Proceedings of ICASSP, pp.
45
48
.
28.
Mairal
,
J.
,
Bach
,
F.
,
Ponce
,
J.
, and
Sapiro
,
G.
(
2009
). “
Online dictionary learning for sparse coding
,” International Conference on Machine Learning, pp.
689
696
.
29.
Mairal
,
J.
,
Bach
,
F.
,
Ponce
,
J.
, and
Sapiro
,
G.
(
2010
). “
Online learning for matrix factorization and sparse coding
,”
J. Mach. Learn. Res.
11
,
19
60
.
30.
Mairal
,
J.
,
Elad
,
M.
, and
Sapiro
,
G.
(
2008
). “
Sparse representation for color image restoration
,”
IEEE Trans. Image Process.
17
,
53
69
.
31.
Moore
,
B. C. J.
(
2003
).
An Introduction to the Psychology of Hearing
, 5th ed. (
Academic
,
San Diego, CA
), Chap. 3, pp.
89
147
.
32.
Mowlaee
,
P.
,
Saeidi
,
R.
,
Christensen
,
M. G.
,
Tan
,
Z.
,
Kinnunen
,
T.
,
Franti
,
P.
, and
Jensen
,
S. H.
(
2012
). “
A joint approach for single-channel speaker identification and speech separation
,”
IEEE Trans. Audio, Speech, Lang. Process.
20
,
2586
2601
.
33.
Pati
,
Y. C.
,
Rezaiifar
,
R.
, and
Krishnaprasad
,
P. S.
(
1993
). “
Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition
,” in Proceedings of the 27th Annual Asilomar Conference on Signals, Systems and Computers, Vol.
1
,
40
44
.
34.
Radfar
,
M. H.
,
Dansereau
,
R. M.
, and
Sayadiyan
,
A.
(
2007
). “
Monaural speech segregation based on fusion of source-driven with model-driven techniques
,”
Speech Commun.
49
,
464
476
.
35.
Raj
,
B.
,
Seltzer
,
M. L.
, and
Stern
,
R. M.
(
2004
). “
Reconstruction of missing features for robust speech recognition
,”
Speech Commun.
43
,
275
296
.
36.
Raj
,
B.
,
Virtanen
,
T.
,
Chaudhuri
,
S.
, and
Singh
,
R.
(
2010
). “
Non-negative matrix factorization based compensation of music for automatic speech recognition
,” in Proceedings of Interspeech, pp.
717
720
.
37.
Rothauser
,
E. H.
,
Chapman
,
W. D.
,
Guttman
,
N.
,
Hecker
,
M. H. L.
,
Nordby
,
K. S.
,
Silbiger
,
H. R.
,
Urbanek
,
G. E.
, and
Weinstock
,
M.
(
1969
). “
IEEE recommended practice for speech quality measurements
,”
IEEE Trans. Audio Electroacoust.
17
,
225
246
.
38.
Sainath
,
T. N.
,
Ramabhadran
,
B.
,
Picheny
,
M.
,
Nahamoo
,
D.
, and
Kanevsky
,
D.
(
2011
). “
Exemplar-based sparse representation features: from TIMIT to LVCSR
,”
IEEE Trans Audio, Speech, Lang. Process.
19
,
2598
2613
.
39.
Schmidt
,
M.
(
2007
). “
Speech separation using non-negative feature and sparse non-negative matrix factorization
,” Tech. Report, pp. 1–15.
40.
Schmidt
,
M. N.
, and
Olsson
,
R. K.
(
2007
). “
Linear regression on sparse features for single-channel speech separation
,” IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp.
26
29
.
41.
Seung
,
H. S.
, and
Lee
,
D.
(
2001
). “
Algorithms for non-negative matrix factorization
,”
Adv. Neural Inf. Process. Syst.
13
,
556
562
.
42.
Shashanka
,
M. V. S.
,
Raj
,
B.
, and
Smaragdis
,
P.
(
2007
). “
Sparse overcomplete decomposition for single channel speaker separation
,” in Proceedings of ICASSP, pp.
641
644
.
43.
Smaragdis
,
P.
(
2004
). “
Non negative matrix factor deconvolution: extraction of multiple sound sources from monophonic inputs
,” Independent Component Analysis and Blind Signal Separation, pp.
494
499
.
44.
Smaragdis
,
P.
(
2007
). “
Convolutive speech bases and their application to supervised speech separation
,”
IEEE Trans. Audio, Speech, Lang. Process.
15
,
1
12
.
45.
Srinivasan
,
S.
,
Roman
,
N.
, and
Wang
,
D. L.
(
2006
). “
Binary and ratio time-frequency masks for robust speech recognition
,”
Speech Commun.
48
,
1486
1501
.
46.
Taal
,
C. H.
,
Hendriks
,
R. C.
,
Heusdens
,
R.
, and
Jensen
,
J.
(
2011
). “
An algorithm for intelligibility prediction of time frequency weighted noisy speech
,”
IEEE Trans. Audio, Speech, Lang. Process.
19
,
2125
2136
.
47.
Virtanen
,
T.
(
2007
). “
Monaural sound source separation by nonnegative matrix factorization with temporal continuity and spareness criteria
,”
IEEE Trans. Audio, Speech, Lang. Process.
15
,
1066
1074
.
48.
Wang
,
D. L.
(
2005
). “
On ideal binary mask as the computational goal of auditory scene analysis
,” in
Speech Separation by Humans and Machines
, edited by
P.
Divenyi
(
Kluwer Academic
,
Norwell, MA
), pp.
181
197
.
49.
Wang
,
D. L.
(
2008
). “
Time–frequency masking for speech separation and its potential for hearing aid design
,”
Trends Amplif.
12
,
332
353
.
50.
Wang
,
D. L.
, and
Brown
,
G.
, Eds. (
2006
). “
Fundamentals of computational auditory scene analysis
,” in
Computational Auditory Scene Analysis: Principles, Algorithms, and Applications
(
Wiley-IEEE Press
,
Hoboken, NJ
), Chap. 1, pp.
1
37
.
51.
Wang
,
D. L.
,
Kjems
,
U.
,
Pedersen
,
M. S.
,
Boldt
,
J. B.
, and
Lunner
,
T.
(
2009
). “
Speech intelligibility in background noise with ideal binary time-frequency masking
,”
J. Acoust. Soc. Am.
125
,
2336
2347
.
52.
Wang
,
Y.
,
Han
,
K.
, and
Wang
,
D. L.
(
2013
). “
Exploring monaural features for classification-based speech segregation
,”
IEEE Trans. Audio, Speech, Lang. Process.
21
,
270
279
.
53.
Wang
,
Y.
, and
Wang
,
D. L.
(
2013
). “
Towards scaling up classification-based speech separation
,”
IEEE Trans. Audio, Speech, Lang. Process.
21
,
1381
1390
.
54.
Wilson
,
K.
,
Raj
,
B.
,
Smaragdis
,
P.
, and
Divakaran
,
A.
(
2008
). “
Speech denoising using nonnegative matrix factorization with priors
,” in Proceedings of ICASSP, pp.
4029
4032
.
55.
Zhao
,
X.
,
Shao
,
Y.
, and
Wang
,
D. L.
(
2012
). “
CASA-based robust speaker identification
,”
IEEE Trans. Audio, Speech, Lang. Process.
20
,
1608
1616
.
You do not currently have access to this content.