An innovative method of single-channel blind source separation is proposed. The proposed method is a complex-valued non-negative matrix factorization with probabilistically optimal L1-norm sparsity. This preserves the phase information of the source signals and enforces the inherent structures of the temporal codes to be optimally sparse, thus resulting in more meaningful parts factorization. An efficient algorithm with closed-form expression to compute the parameters of the model including the sparsity has been developed. Real-time acoustic mixtures recorded from a single-channel are used to verify the effectiveness of the proposed method.

Single-channel blind source separation (SCBSS) has many useful variety fields, such as sound communication,1 biomedical,2 and audio recognitions.3 The aim of SCBSS is to extract separated individual source signals from a single mixture recording without using the training information of the sources. Many SCBSS solutions in various approaches have been proposed but the dominant approach is the non-negative matrix factorization (NMF)4–7 and its variants.8–15 In this paper, we introduced a new method based on complex non-negative matrix factorization (CMF).

CMF is a part-based representation for complex-valued signals which extracts the recurrent patterns of magnitude spectra and as well as phase estimates of constituent signals. CMF shares similarity with the NMF in its ability to generate non-negative matrices W and H. However, it differs from NMF in that the input matrix is assumed to be a complex matrix and the algorithm generates a third-rank complex-valued tensor as Zf,tk=ejϕf,tk. The short-time Fourier transform (STFT) of a signal, Xf,t in frequency bin f and time frame t, is composed of K complex-valued elements, Xf,t=k=1K|af,tk|ejϕf,tk. Each af,tk is a magnitude spectrum which is constant up to the gain over time as |af,tk|=WfkHtk(f,kWfk0,k,tHtk0) and a time-varying phase spectrum arg(af,tk)=ϕf,tk. Thus the CMF model can be expressed as Xf,t=kWfkHtkejϕf,tk where Wfk corresponds to the recurring magnitude spectral pattern, Htk to time-varying temporal code, and ϕf,tk to time-varying phase spectra.

Given an observed complex spectrum, Yf,tF×T, the problem at hand is to estimate the optimal parameters θ={Wfk,Htk,ϕf,tk}. We derive a new factorization method termed as the L1-sparse complex non-negative matrix factorization (L1-SCNMF). The generative model is given by

Yf,t=k=1KWfkHtkZf,tk=Xf,t+ϵf,t,
(1)

and the reconstruction error ϵf,tN(0,σ2) is assumed to be independently and identically distributed as complex Gaussian distribution with white noise having zero mean and variance σ2, which denotes the model error. The likelihood of θ={W,H,ϕ} is thus written as P(Y|θ)=f,t(1/πσ2)exp(|Yf,tXf,t|2/σ2). We can assume that the prior distributions for W, H, and ϕ are statistically independent, which yields P(θ|λ)=P(W)P(H|λ)P(ϕ). The prior P(H|λ) corresponds to the sparsity cost, for which a natural choice is the generalized Gaussian prior; namely, P(H|λ)=k,t[pλtk/2Γ(1/p)]exp((λtk)p|Htk|p) where λtk and p are the shape parameters of the distribution. When p=1, P(H|λ) promotes the L1-norm sparsity which has been shown to be probabilistically equivalent to the pseudo-norm, L0, which is the theoretical optimum sparsity. The maximum a posteriori estimation leads to the following optimization problem with respect to θ:

minθf(θ)=f,t|Yf,tXf,t|2+k,t[λtk|Htk|logλtk]subjecttofWfk=1(k=1,,K).
(2)

Optimizing Eq. (2) using a gradient method leads to slow convergence; instead, we develop an iterative algorithm by minimizing the auxiliary function below:

f+(θ,θ¯)=k,f,t|Y¯f,tk|22WfkHtkRe[Y¯f,tk*ejϕf,tk]+Wfk2Htk2βf,tk+k,tλtk(|H¯tk|1Htk2H¯tk)k,tlogλtk,
(3)

where θ¯={Y¯f,tk,H¯tk}. It can be shown that f+(θ,θ¯)f(θ) and f+(θ,θ¯) is minimized with respect to θ¯ when Y¯f,tk=WfkHtkejϕf,tk+βf,tk(Yf,tXf,t) and H¯tk=Htk. The closed-form solution is given by

Wfk=tHtkβf,tkRe[Y¯f,tk*ejϕf,tk]tHtk2βf,tk,Htk=fWfkβf,tkRe[Y¯f,tk*ejϕf,tk]fWfk2βf,tk+λtk|H¯tk|1,ejϕf,tk=Y¯f,tk|Y¯f,tk|.
(4)

The update formulas are set to βf,tk=(WfkHtk/nWfkHtk) and Htk(Htk/nHtk), for projection onto the constraint space. The estimation λtk is more involved. We define the following terms: y¯=vec({Yf,t}), h¯=vec({Htk}), λ¯=vec({λtk}), W¯=[IW11IW12...IWFK], ejϕ¯¯t=[ejϕ¯:,t1ejϕ¯:,t2...ejϕ¯:,tK] and A¯=diag[{W¯ejϕ¯¯t}] where “” and “” represent the Kronecker and Hadamard products, respectively. The maximum-likelihood estimation of λtk is determined using the variational approach:

λ¯ML=argmaxλ¯lnp(y¯|λ¯,A¯,σ2)=argmaxλ¯Q(h¯)lnp(h¯|λ¯)dh¯,
(5)

where Q(h¯)=Zh1exp[F(h¯)] is the posterior distribution of h¯ using Gibbs's energy distribution representation, Zh=exp[F(h¯)]dh¯, and F(h¯,λ¯)=(1/σ2)y¯A¯h¯F2+λ¯Th¯(logλ¯)T1¯. The sparsity of h¯ naturally partitions its elements into distinct subsets h¯P and h¯M consisting of components pP such that hP=0, and components mM such that hm>0. Likewise, this partition will also leads to A¯=[A¯PA¯M] and λ¯=[λ¯Pλ¯M]. Equation (5) is solved using variational calculus which eventually leads to the following closed-form expression for adapting the sparsity parameter:

λg={1hgQ(h¯M)dh¯M=1hgMAPifgM1hgQ(h¯P)dh¯P=1ugifgP,uPuPb̂P+b̂P2+4(Ĉu¯)PuP2(Ĉu¯)P,
(6)

where hgMAP is given by the closed-form solution in Eq. (4), b̂P=(C¯h¯MAP[1/σ2]W¯Ty¯+λ¯)P, Ĉ=C¯P+diag(C¯P), C¯=(1/σ2)A¯TA¯, and C¯P=(1/σ2)A¯PTA¯P. Similarly, the inference for σ2 can be computed as

σ2=1F×TQ(h¯)(y¯A¯h¯̂2)dh¯whereĥg={hgMAPifgMugifgP.
(7)

The proposed method can be summarized as follows: Step (1) Compute Yf,t=STFT(y(t)); (Step 2) initialize Wfk, Htk, and ϕf,tk with non-negative random values; Step (3) update βf,tk using βf,tk=(WfkHtk/nWfkHtk), fixing the value of ϕ at ejϕf,tk=(Yf,t/|Yf,t|) and update up using Eq. (6); Step (4) calculate λg and σ2 using Eqs. (6) and (7); Step (5) update θ¯={Y¯,H¯} using Y¯f,tk=WfkHtkejϕf,tk+βf,tk(Yf,tXf,t) and H¯tk=Htk, update θ={Wfk,Htk,ϕf,tk} using Eq. (4), βf,tk=(WfkHtk/nWfkHtk) and Htk(Htk/nHtk); Step (6) repeat Steps (3)–(5) until convergence is reached; Step (7) obtain an estimation of each source by multiplying the respective rows of the spectral components Wfk with the corresponding columns of the temporal code Htk and time-varying phrase spectrum ejϕf,tk; Step (8) convert the time-frequency representation into time domain to obtain the separated sources.

The proposed L1-SCMF method is tested by separating acoustic sources. The acoustic signals are 4 s in duration and the sampling frequency is 16 kHz. Thirty speech sources (15 male, 15 female) are selected from the TIMIT database and 45 music sources (15 jazz, 15 piano, 15 drum) are selected from the RWC database. Given above, three types of mixture can be obtained: (1) Music and music mixture, (2) speech mixed with music, and (3) speech and speech mixture. Thus, there are 225 different realizations for each mixture type. The number of CMF components is K=20 for speech signals and K=10 for music signals. The time-frequency representation is computed by normalizing the time-domain signal to unit power and computing the STFT using 1024-point Hanning window with 50% overlap. All experiments are conducted using a PC with Intel® Core i5 central processing unit 650 at 3.2 GHz and 4 GB random access memory. We have evaluated separation performance in terms of the Signal-to-Distortion Ratio (SDR) which is a global measure unifying the Source-to-Interference Ratio (SIR).

In this evaluation, we compare the proposed method with similar class of matrix factorization and blind methods, e.g., gradient-based CMF,16 SNMF with second-order optimization,17 NMF with Itakura-Saito divergence (NMF-ISD),18 and single-channel independent component analysis (SCICA).19 Figure 1 shows an example of separating a single channel mixture of the piano and drum using the proposed method and the comparison with other current methods. Panels (a) and (b) show the original sources of the piano and drum, respectively. Panels (c)–(l) show the estimated sources using the CMF, SNMF, NMF-ISD, SCICA, and proposed method. Analyzing panels (d)–(j), it is visually clear that the source separation using latent components methods require a joint optimization of the sparsity regularization for each element code λtk and the parameters of the CMF, i.e., {Wfk,Htk,ϕf,tk}. Panels (k) and (l) show the case of the separations results of the proposed method when the joint process is optimized while panels (c)–(j) show that the other SCBSS methods did not fully separate the music mixture. Many spectral and temporal components are missing from the recovered sources and these have been highlighted (marked red box) in the panels. The other SCBSS methods fail to take into account specific sparsity associated with each code, and this has resulted in ambiguous estimation of each source spectrum and thereby discarding the temporal information. When the temporal structure and the pitch change are not properly estimated in the model, the mixing ambiguity is still contained in each separated source.

Fig. 1.

(Color online) Separated signals of music mixture (piano and drum) in time-domain. (a) and (b) Original piano and drum music. (c) and (d) Estimated sources using CMF. (e) and (f) Estimated sources using SNMF. (g) and (h) Estimated sources using NMF-ISD. (i) and (j) Estimated sources using SCICA. (k) and (l) Estimated sources using the proposed method (L1-SCMF) (initial K = 10).

Fig. 1.

(Color online) Separated signals of music mixture (piano and drum) in time-domain. (a) and (b) Original piano and drum music. (c) and (d) Estimated sources using CMF. (e) and (f) Estimated sources using SNMF. (g) and (h) Estimated sources using NMF-ISD. (i) and (j) Estimated sources using SCICA. (k) and (l) Estimated sources using the proposed method (L1-SCMF) (initial K = 10).

Close modal

Table 1 further gives the SDR and SIR comparison results between our proposed method and the above four methods. The improvement of our method compared with the CMF, SNMF, NMF-ISD, and SCICA can be summarized as follows: (1) For the music and music, the average improvement per source in SDR is 7.6 dB and in SIR 8.4 dB; (2) for music and speech, the average improvement per source in SDR is 5.0 dB and in SIR 4.2 dB; (3) for male speech and female speech, the average improvement per source in SDR is 2.8 dB and in SIR 3.5 dB. The proposed method exhibits much better average SDR of approximately 62.9%, 130.2%, 142.8%, and 454.4% improvement over CMF, SNMF, NMF-ISD, and SCICA, respectively. Analyzing the separation results and SDR performance, the proposed method leads to the best separation performance for both recovered sources. The SCICA method, in the case where the basis functions have significant overlap with each other, e.g., mixture of two speech sources where the basis functions for two sources are very similar, this method performs very poorly. The parts decomposed by the SNMF and NMF-ISD methods are not adequate to capture the temporal dependency of the frequency patterns within the audio signal. Additionally, if the data does not span the positive octant adequately, a rotation of W and opposite H can give the same results. The conventional CMF method exploits the phase information of the sources which is inherently ignored by SNMF and NMF-ISD, and this has led to an improved performance of about 2 dB in SDR. However, there is no sparsity parameter tuning in the conventional CMF and it cannot avoid under- or over-fitting the model. In addition, it suffers from poor convergence.20 Both issues are resolved using the proposed method by closed-form parameters estimation and by automatically adapting the sparsity parameter for each individual temporal code using the L1-norm leading to a more pronounced performance improvement.

Table 1.

Comparison of average SDR and SIR performance on three types of mixtures.

MixturesMethodsSDRSIR
Music and music Proposed method (L1-SCMF) 12.72 14.87 
CMF (Ref. 166.11 7.53 
SNMF (Ref. 175.23 7.14 
NMF-ISD (Ref. 185.17 6.31 
SCICA (Ref. 193.85 4.86 
Music and speech Proposed method (L1-SCMF) 8.92 9.84 
CMF (Ref. 166.06 6.64 
SNMF (Ref. 174.52 6.11 
NMF-ISD (Ref. 183.55 6.62 
SCICA (Ref. 191.43 3.12 
Male speech and female speech Proposed method (L1-SCMF) 4.53 6.78 
CMF (Ref. 163.19 5.01 
SNMF (Ref. 171.62 3.23 
NMF-ISD (Ref. 182.06 4.27 
SCICA (Ref. 19−0.56 1.25 
MixturesMethodsSDRSIR
Music and music Proposed method (L1-SCMF) 12.72 14.87 
CMF (Ref. 166.11 7.53 
SNMF (Ref. 175.23 7.14 
NMF-ISD (Ref. 185.17 6.31 
SCICA (Ref. 193.85 4.86 
Music and speech Proposed method (L1-SCMF) 8.92 9.84 
CMF (Ref. 166.06 6.64 
SNMF (Ref. 174.52 6.11 
NMF-ISD (Ref. 183.55 6.62 
SCICA (Ref. 191.43 3.12 
Male speech and female speech Proposed method (L1-SCMF) 4.53 6.78 
CMF (Ref. 163.19 5.01 
SNMF (Ref. 171.62 3.23 
NMF-ISD (Ref. 182.06 4.27 
SCICA (Ref. 19−0.56 1.25 

The impetus behind the proposed work is that NMF cannot estimate the phase spectra of underlying constituent signals, and sparseness achieved by the conventional CMF is not efficient enough. A novel L1-sparse CMF has been proposed to solve the problem. The proposed method decomposes an information-bearing matrix into complex of factor matrices which represent the sparse dictionary. The sparse parameter is learned from the data and is automatically tuned adaptively for each individual temporal code to obtain the desired sparse decomposition. The proposed method can extract recurrent patterns of magnitude spectra that underlie observed complex spectra and the phase estimates of constituent signals, thus enabling the features of the components to be extracted more efficiently. Experimental evaluations have shown very promising separation results using the proposed method.

1.
J.
Nix
and
V.
Hohmann
, “
Sound source localization in real sound fields based on empirical statistics of interaural parameters
,”
J. Acoust. Soc. Am.
119
(
1
),
463
479
(
2006
).
2.
Y.
Guo
,
G. R.
Naik
, and
H.
Nguyen
, “
Single channel blind source separation based local mean decomposition for biomedical applications
,” in
Proceedings of the IEEE 35th Annual International Conference on Engineering in Medicine and Biology Society (EMBC)
, July 2013, pp.
6812
6815
.
3.
E. W.
Healy
,
S. E.
Yoho
,
Y.
Wang
, and
D.
Wang
, “
An algorithm to improve speech recognition in noise for hearing-impaired listeners
,”
J. Acoust. Soc. Am.
134
,
3029
3038
(
2013
).
4.
D. D.
Lee
and
H. S.
Seung
, “
Algorithms for non-negative matrix factorization
,” in
Proceedings of Neural Information Processing Systems
, pp.
556
562
(
2000
).
5.
Bin
Gao
,
W. L.
Woo
, and
S. S.
Dlay
, “
Adaptive sparsity nonnegative matrix factorization for single channel source separation
,”
IEEE J. Sel. Topics Signal Process.
5
(
5
),
989
1001
(
2011
).
6.
Bin
Gao
,
W. L.
Woo
, and
S. S.
Dlay
, “
Single channel blind source separation using EMD-subband variable regularized sparse features
,”
IEEE Trans. Audio, Speech Lang. Process.
19
(
4
),
961
976
(
2011
).
7.
Bin
Gao
,
W. L.
Woo
, and
L. C.
Khor
, “
Cochleagram-based audio pattern separation using two-dimensional non-negative matrix factorization with automatic sparsity adaptation
,”
J. Acoust. Soc. Am.
135
(
3
),
1171
1185
(
2014
).
8.
N.
Tengtrairat
,
B.
Gao
,
W. L.
Woo
, and
S. S.
Dlay
, “
Single-channel blind separation using pseudo-stereo mixture and complex 2-D histogram
,”
IEEE Trans. Neural Networks Learn. Sys.
24
(
11
),
1722
1735
(
2013
).
9.
B.
Gao
,
W. L.
Woo
, and
S. S.
Dlay
, “
Unsupervised single channel separation of non-stationary signals using gammatone filterbank and Itakura-Saito nonnegative matrix two-dimensional factorizations
,”
IEEE Trans. Circuits Sys. I
60
(
3
),
662
675
(
2013
).
10.
B.
Gao
,
W. L.
Woo
, and
S. S.
Dlay
, “
Variational regularized two-dimensional nonnegative matrix factorization
,”
IEEE Trans. Neural Networks Learn. Sys.
23
(
5
),
703
716
(
2012
).
11.
B.
Gao
,
W. L.
Woo
, and
B. W.-K.
Ling
, “
Machine learning source separation using Maximum A Posteriori nonnegative matrix factorization
,”
IEEE Trans. Cybernetics
44
(
7
),
1169
1179
(
2014
).
12.
E. M.
Grais
,
M. U.
Sen
, and
H.
Erdogan
, “
Deep neural networks for single channel source separation
,” in
Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP)
, pp.
3734
3738
(
2014
).
13.
N.
Tengtrairat
and
W. L.
Woo
, “
Single-channel separation using underdetermined blind method and least absolute deviation
,”
Neurocomputing
147
,
412
425
(
2015
).
14.
N.
Tengtrairat
and
W. L.
Woo
, “
Extension of DUET to single-channel mixing model and separability analysis
,”
Signal Process.
96
,
261
265
(
2014
).
15.
Qi
Wang
,
W. L.
Woo
, and
S. S.
Dlay
, “
Informed single channel speech separation using HMM-GMM user-generated exemplar source
,”
IEEE Trans. Audio, Speech and Lang. Process
22
(
12
),
2087
2100
(
2014
).
16.
B. J.
King
and
L.
Atlas
, “
Single-channel source separation using complex matrix factorization
,”
IEEE Trans. Speech, Audio, Lang. Process
19
(
8
),
2591
2597
(
2011
).
17.
R.
Zdunek
and
A.
Cichocki
, “
Non-negative matrix factorization with constrained second-order optimization
,”
Signal Process.
87
(
8
),
1904
1916
(
2007
).
18.
C.
Févotte
,
N.
Bertin
, and
J.-L.
Durrieu
, “
Nonnegative matrix factorization with the Itakura-Saito divergence. With application to music analysis
,”
Neural Comput.
21
,
793
830
(
2009
).
19.
M. E.
Davies
and
C. J.
James
, “
Source separation using single channel ICA
,”
Signal Process.
87
(
8
),
1819
1832
(
2007
).
20.

The algorithm is first initialized using NMF (Ref. 4). The CMF is set run for 2000 iterations with normalized-power step size and a stopping criterion in terms of rate of gradient change is applied to determine steady-state convergence.