An unsupervised single channel audio separation method from pattern recognition viewpoint is presented. The proposed method does not require training knowledge and the separation system is based on non-uniform time-frequency (TF) analysis and feature extraction. Unlike conventional research that concentrates on the use of spectrogram or its variants, the proposed separation algorithm uses an alternative TF representation based on the gammatone filterbank. In particular, the monaural mixed audio signal is shown to be considerably more separable in this non-uniform TF domain. The analysis of signal separability to verify this finding is provided. In addition, a variational Bayesian approach is derived to learn the sparsity parameters for optimizing the matrix factorization. Experimental tests have been conducted, which show that the extraction of the spectral dictionary and temporal codes is more efficient using sparsity learning and subsequently leads to better separation performance.

1.
P.
Hursky
,
W. S.
Hodgkiss
, and
W. A.
Kuperman
, “
Matched field processing with data-derived modes
,”
J. Acoust. Soc. Am.
109
,
1355
1366
(
2001
).
2.
J. C.
Junqua
, “
The Lombard reflex and its role on human listeners and automatic speech recognizers
,”
J. Acoust. Soc. Am.
93
,
510
524
(
1993
).
3.
E. C.
Cherry
, “
Some experiments on the recognition of speech, with one and with two ears
,”
J. Acoust. Soc. Am.
25
,
975
979
(
1953
).
4.
N.
Roman
,
S.
Srinivasan
, and
D.
Wang
, “
Binaural segregation in multi-source reverberant environments
,”
J. Acoust. Soc. Am.
120
,
4040
4051
(
2006
).
5.
P.
Hursky
,
M.
Siderius
,
M. B.
Porter
, and
V. K.
McDonald
, “
High-frequency (8−16 kHz) model-based source localization
,”
J. Acoust. Soc. Am.
115
,
3021
3032
(
2004
).
6.
K.
Han
and
D.
Wang
, “
A classification based approach to speech segregation
,”
J. Acoust. Soc. Am.
132
,
3475
3483
(
2012
).
7.
M. B.
Priestley
, “
Evolutionary spectra and non-stationary processes
,”
J. R. Statist. Soc. Ser. B (Methodological)
27
(
2
),
204
237
(
1965
).
8.
D.
Lee
and
H.
Seung
, “
Learning the parts of objects by nonnegative matrix factorisation
,”
Nature
401
(
6755
),
788
791
(
1999
).
9.
S.
Bucak
and
B.
Gunsel
, “
Incremental subspace learning via non-negative matrix factorization
,”
Pattern Recognit.
42
(
5
),
788
797
(
2009
).
10.
N.
Bertin
,
R.
Badeau
, and
E.
Vincent
, “
Enforcing harmonicity and smoothness in Bayesian non-negative matrix factorization applied topolyphonic music transcription
,”
IEEE Trans. Audio, Speech, Lang. Process.
18
(
3
),
538
5493
(
2010
).
11.
W.
Liu
and
N.
Zheng
, “
Non-negative matrix factorization based methods for object recognition
,”
Pattern. Recognit. Lett
.
25
(
8
),
893
897
(
2004
).
12.
S.
Rickard
and
A.
Cichocki
, “
When is non-negative matrix decomposition unique?
,”
42nd Annual Conference on Information Sciences and Systems, CISS2008
(March
2008
), pp.
1091
1092
.
13.
S. A.
Abdallah
and
M. D.
Plumbley
, “
Polyphonic transcription by non-negative sparse coding of power spectra
,” in
Proceedings of the 5th Conference on Music Information Retrieval (ISMIR'04)
, Spain (October
2004
), pp.
318
325
.
14.
R. M.
Parry
and
I.
Essa
, “
Incorporating phase information for source separation via spectrogram factorization
,” in
Proceedings. of the Conference on Acoustics, Speech and Signal Processing (ICASSP'07)
, Hawaii (April
2007
), pp.
661
664
.
15.
R.
Kompass
, “
A generalized divergence measure for nonnegative matrix factorization
,”
Neural Comput.
19
(
3
),
780
791
(
2007
).
16.
A.
Cichocki
,
R.
Zdunek
, and
S. I.
Amari
, “
Csisz'ar's divergences for non-negative matrix factorization: family of new algorithms
,” in
Proceedings of the 6th International Conference on Independent Component Analysis and Signal Separation (ICA'06)
, Charleston, SC (March
2006
), pp.
32
39
.
17.
T.
Virtanen
, “
Monaural sound source separation by non-negative matrix factorization with temporal continuity and sparseness criteria
,”
IEEE Trans. Audio, Speech, Lang. Process.
15
(
3
),
1066
1074
(
2007
).
18.
M. H.
Radfa
and
R. M.
Dansereau
, “
Single-channel speech separation using soft mask filtering
,”
IEEE Trans. Audio, Speech Language Process.
15
(
6
),
2299
2310
(
2007
).
19.
S.
Roweis
, “
One microphone source separation
,”
Adv. Neural Inf. Process. Syst.
13
,
793
799
(
2000
).
20.
M.
Morup
and
M. N.
Schmidt
, “
Sparse non-negative matrix factor 2-D deconvolution
,” Technical Report, Denmark,
2006
.
21.
M. N.
Schmidt
and
M.
Morup
, “
Nonnegative matrix factor 2-D deconvolution for blind single channel source separation
,” in
Proceedings of the 6th International Conference on Independent Component Analysis and Signal Separation (ICA'06)
, Charleston, SC (March
2006
), pp.
700
707
.
22.
K.
Gröchenig
,
Foundations of Time-Frequency Analysis
, 1st ed. (
Birkhäuser
,
Boston
,
2001
), Chap. 2.
23.
J. C.
Brown
, “
Calculation of a constant Q spectral transform
,”
J. Acoust. Soc. Am.
89
(
1
),
425
434
(
1991
).
24.
G.
Hu
and
D. L.
Wang
, “
Monaural speech segregation based on pitch tracking and amplitude modulation
,”
IEEE Trans. Neural Networks
15
(
5
),
1135
1150
(
2004
).
25.
R.
Curtis
,
The Computer Music Tutorial
(
MIT Press
,
Cambridge, MA
,
1996
), Chap. 7.
26.
S.
Schulz
and
T.
Herfet
, “
Binaural source separation in non-ideal reverberant environments
,” in
Proceedings of the 10th International Conference on Digital Audio Effects (DAFx-07)
, Bordeaux, France (September
2007
), pp.
10
15
.
27.
D. L.
Wang
, “
On ideal binary mask as the computational goal of auditory scene analysis
,” in
Speech Separation by Humans and Machines
, edited by
P.
Divenyi
(
Kluwer
,
Norwell, MA, 2005
), pp.
181
197
.
28.
Y. Q.
Lin
, “
l1-norm sparse Bayesian learning: theory and applications
,” Ph.D. thesis,
University of Pennsylvania
,
2008
.
29.
F.
Sha
,
L. K.
Saul
, and
D. D.
Lee
, “
Multiplicative updates for nonnegative quadratic programming in support vector machines
,”
Proc Adv. Neural Inf. Process. Systems
,
15
,
1041
1048
(
2002
).
30.
M.
Goto
,
H.
Hashiguchi
,
T.
Nishimura
, and
R.
Oka
, “
RWC music database: Music genre database and musical instrument sound database
,” in
Proceedings of the International Symposium on Music Information Retrieval (ISMIR)
, Baltimore, Maryland (October
2003
), pp.
229
230
.
31.
E.
Vincent
,
S.
Araki
,
F. J.
Theis
,
G.
Nolte
,
P.
Bofill
,
H.
Sawada
,
A.
Ozerov
,
B. V.
Gowreesunker
,
D.
Lutter
, and
N. Q. K.
Duong
, “
The signal separation evaluation campaign (2007–2010): Achievements and remaining challenges
,”
Signal Process.
92
,
1928
1936
(
2012
).
32.
O.
Yilmaz
and
S.
Rickard
, “
Blind separation of speech mixtures via time-frequency masking
,”
IEEE Trans. Signal Process.
52
(
7
),
1830
1847
(
2004
).
33.
L. R.
Rabiner
, “
A tutorial on hidden Markov models and selected applications in speech recognition
,”
Proc. IEEE
77
(
2
),
257
286
(
1989
).
34.
B.
Gao
,
W. L.
Woo
, and
S. S.
Dlay
, “
Single channel source separation using EMD-subband variable regularized sparse features
,”
IEEE Trans. Audio, Speech Lang. Process.
19
,
961
976
(
2011
).
You do not currently have access to this content.