An innovative method of single-channel blind source separation is proposed. The proposed method is a complex-valued non-negative matrix factorization with probabilistically optimal *L*_{1}-norm sparsity. This preserves the phase information of the source signals and enforces the inherent structures of the temporal codes to be optimally sparse, thus resulting in more meaningful parts factorization. An efficient algorithm with closed-form expression to compute the parameters of the model including the sparsity has been developed. Real-time acoustic mixtures recorded from a single-channel are used to verify the effectiveness of the proposed method.

## 1. Introduction

Single-channel blind source separation (SCBSS) has many useful variety fields, such as sound communication,^{1} biomedical,^{2} and audio recognitions.^{3} The aim of SCBSS is to extract separated individual source signals from a single mixture recording without using the training information of the sources. Many SCBSS solutions in various approaches have been proposed but the dominant approach is the non-negative matrix factorization (NMF)^{4–7} and its variants.^{8–15} In this paper, we introduced a new method based on complex non-negative matrix factorization (CMF).

CMF is a part-based representation for complex-valued signals which extracts the recurrent patterns of magnitude spectra and as well as phase estimates of constituent signals. CMF shares similarity with the NMF in its ability to generate non-negative matrices $W$ and $H$. However, it differs from NMF in that the input matrix is assumed to be a complex matrix and the algorithm generates a third-rank complex-valued tensor as $Zf,tk=ej\varphi f,tk$. The short-time Fourier transform (STFT) of a signal, $Xf,t\u2208\u2102$ in frequency bin $f$ and time frame $t$, is composed of $K$ complex-valued elements, $Xf,t=\u2211k=1K|af,tk|ej\varphi f,tk$. Each $af,tk$ is a magnitude spectrum which is constant up to the gain over time as $|af,tk|=WfkHtk(\u2200f,kWfk\u22650,\u2009\u2009\u2200k,tHtk\u22650)$ and a time-varying phase spectrum $arg(af,tk)=\varphi f,tk$. Thus the CMF model can be expressed as $Xf,t=\u2211kWfkHtk\u22c5ej\varphi f,tk$ where $Wfk$ corresponds to the recurring magnitude spectral pattern, $Htk$ to time-varying temporal code, and $\varphi f,tk$ to time-varying phase spectra.

## 2. Proposed *L*_{1}-sparse CMF with closed-form expression

Given an observed complex spectrum, $Yf,t\u2208\u2102F\xd7T$, the problem at hand is to estimate the optimal parameters $\theta ={Wfk,Htk,\varphi f,tk}$. We derive a new factorization method termed as the *L*_{1}-sparse complex non-negative matrix factorization (*L*_{1}-SCNMF). The generative model is given by

and the reconstruction error $\u03f5f,t\u223cN\u2102(0,\sigma 2)$ is assumed to be independently and identically distributed as complex Gaussian distribution with white noise having zero mean and variance $\sigma 2$, which denotes the model error. The likelihood of $\theta ={W,\u2009H,\varphi}$ is thus written as $P(Y|\theta )=\u220ff,t(1/\pi \sigma 2)\u2009exp(\u2212|Yf,t\u2212Xf,t|2/\sigma 2)$. We can assume that the prior distributions for $W$, $H$, and $\varphi $ are statistically independent, which yields $P(\theta |\lambda )=P(W)P(H|\lambda )P(\varphi )$. The prior $P(H|\lambda )$ corresponds to the sparsity cost, for which a natural choice is the generalized Gaussian prior; namely, $P(H|\lambda )=\u220fk,t[p\lambda tk/2\Gamma (1/p)]\u2009exp(\u2212(\lambda tk)p|Htk|p)$ where $\lambda tk$ and $p$ are the shape parameters of the distribution. When $p=1$, $P(H|\lambda )$ promotes the *L*_{1}-norm sparsity which has been shown to be probabilistically equivalent to the pseudo-norm, *L*_{0}, which is the theoretical optimum sparsity. The *maximum a posteriori* estimation leads to the following optimization problem with respect to $\theta $:

Optimizing Eq. (2) using a gradient method leads to slow convergence; instead, we develop an iterative algorithm by minimizing the auxiliary function below:

where $\theta \xaf={Y\xaff,tk,H\xaftk}$. It can be shown that $f+(\theta ,\theta \xaf)\u2265f(\theta )$ and $f+(\theta ,\theta \xaf)$ is minimized with respect to $\theta \xaf$ when $Y\xaff,tk=WfkHtk\u22c5ej\varphi f,tk+\beta f,tk(Yf,t\u2212Xf,t)$ and $H\xaftk=Htk$. The closed-form solution is given by

The update formulas are set to $\beta f,tk=(WfkHtk/\u2211nWfkHtk)$ and $Htk\u2190(Htk/\u2211nHtk)$, for projection onto the constraint space. The estimation $\lambda tk$ is more involved. We define the following terms: $y\xaf=vec({Yf,t})$, $h\xaf=vec({Htk})$, $\lambda \xaf=vec({\lambda tk})$, $W\xaf=[I\u2297W11\vdots I\u2297W12\vdots \u2009...\vdots I\u2297WFK]$, $ej\varphi \xaf\xaft=[ej\varphi \xaf:,t1\vdots ej\varphi \xaf:,t2\vdots ...\vdots ej\varphi \xaf:,tK]$ and $A\xaf=diag[{W\xaf\u25cbej\varphi \xaf\xaft}]$ where “$\u2297$” and “$\u2218$” represent the Kronecker and Hadamard products, respectively. The maximum-likelihood estimation of $\lambda tk$ is determined using the variational approach:

where $Q(h\xaf)=Zh\u22121\u2009exp\u2009[\u2212F(h\xaf)]$ is the posterior distribution of $h\xaf$ using Gibbs's energy distribution representation, $Zh=\u222b\u2009exp\u2009[\u2212F(h\xaf)]d\u2009h\xaf$, and $F(h\xaf,\lambda \xaf)=(1/\sigma 2)\Vert y\xaf\u2212A\xafh\xaf\Vert F2+\lambda \xafTh\xaf\u2212(log\u2009\lambda \xaf)T1\xaf$. The sparsity of $h\xaf$ naturally partitions its elements into distinct subsets $h\xafP$ and $h\xafM$ consisting of components $\u2200p\u2009\u2208P$ such that $hP\u2009=0$, and components $\u2200m\u2009\u2208M$ such that $hm\u2009>0$. Likewise, this partition will also leads to $A\xaf=[A\xafP\vdots A\xafM]$ and $\lambda \xaf=[\lambda \xafP\vdots \lambda \xafM]$. Equation (5) is solved using variational calculus which eventually leads to the following closed-form expression for adapting the sparsity parameter:

where $hgMAP$ is given by the closed-form solution in Eq. (4), $b\u0302P=(C\xafh\xafMAP\u2212[1/\sigma 2]W\xafTy\xaf+\lambda \xaf)P$, $C\u0302=C\xafP+diag(C\xafP)$, $C\xaf=(1/\sigma 2)A\xafTA\xaf$, and $C\xafP=(1/\sigma 2)A\xafPTA\xafP$. Similarly, the inference for $\sigma 2$ can be computed as

The proposed method can be summarized as follows: Step (1) Compute $Yf,t=STFT(y(t))$; (Step 2) initialize $\u2009Wfk$, $Htk$, and $\varphi f,tk$ with non-negative random values; Step (3) update $\beta f,tk$ using $\beta f,tk=(WfkHtk/\u2211nWfkHtk)$, fixing the value of $\varphi $ at $ej\varphi f,tk=(\u2009Yf,t/|\u2009Yf,t|)\u2009\u2009\u2009$ and update $up$ using Eq. (6); Step (4) calculate $\lambda g$ and $\sigma 2\u2009$ using Eqs. (6) and (7); Step (5) update $\theta \xaf={\u2009Y\xaf,H\xaf}$ using $Y\xaff,tk=WfkHtk\u22c5ej\varphi f,tk+\beta f,tk(Yf,t\u2212Xf,t)$ and $H\xaftk=Htk$, update $\theta ={Wfk,Htk,\u2009\varphi f,tk}\u2009$ using Eq. (4), $\u2009\beta f,tk=(WfkHtk/\u2211n\u2009WfkHtk)$ and $Htk\u2009\u2190\u2009(Htk/\u2211nHtk)$; Step (6) repeat Steps (3)–(5) until convergence is reached; Step (7) obtain an estimation of each source by multiplying the respective rows of the spectral components $Wfk$ with the corresponding columns of the temporal code $Htk$ and time-varying phrase spectrum $ej\varphi f,tk$; Step (8) convert the time-frequency representation into time domain to obtain the separated sources.

## 3. Results and analysis

The proposed *L*_{1}-SCMF method is tested by separating acoustic sources. The acoustic signals are 4 s in duration and the sampling frequency is 16 kHz. Thirty speech sources (15 male, 15 female) are selected from the TIMIT database and 45 music sources (15 jazz, 15 piano, 15 drum) are selected from the RWC database. Given above, three types of mixture can be obtained: (1) Music and music mixture, (2) speech mixed with music, and (3) speech and speech mixture. Thus, there are 225 different realizations for each mixture type. The number of CMF components is $K=20$ for speech signals and $K=10$ for music signals. The time-frequency representation is computed by normalizing the time-domain signal to unit power and computing the STFT using 1024-point Hanning window with 50% overlap. All experiments are conducted using a PC with Intel^{®} Core^{™} i5 central processing unit 650 at 3.2 GHz and 4 GB random access memory. We have evaluated separation performance in terms of the Signal-to-Distortion Ratio (SDR) which is a global measure unifying the Source-to-Interference Ratio (SIR).

In this evaluation, we compare the proposed method with similar class of matrix factorization and blind methods, e.g., gradient-based CMF,^{16} SNMF with second-order optimization,^{17} NMF with Itakura-Saito divergence (NMF-ISD),^{18} and single-channel independent component analysis (SCICA).^{19} Figure 1 shows an example of separating a single channel mixture of the piano and drum using the proposed method and the comparison with other current methods. Panels (a) and (b) show the original sources of the piano and drum, respectively. Panels (c)–(l) show the estimated sources using the CMF, SNMF, NMF-ISD, SCICA, and proposed method. Analyzing panels (d)–(j), it is visually clear that the source separation using latent components methods require a joint optimization of the sparsity regularization for each element code $\lambda tk$ and the parameters of the CMF, i.e., ${Wfk,Htk,\u2009\varphi f,tk}$. Panels (k) and (l) show the case of the separations results of the proposed method when the joint process is optimized while panels (c)–(j) show that the other SCBSS methods did not fully separate the music mixture. Many spectral and temporal components are missing from the recovered sources and these have been highlighted (marked red box) in the panels. The other SCBSS methods fail to take into account specific sparsity associated with each code, and this has resulted in ambiguous estimation of each source spectrum and thereby discarding the temporal information. When the temporal structure and the pitch change are not properly estimated in the model, the mixing ambiguity is still contained in each separated source.

Table 1 further gives the SDR and SIR comparison results between our proposed method and the above four methods. The improvement of our method compared with the CMF, SNMF, NMF-ISD, and SCICA can be summarized as follows: (1) For the music and music, the average improvement per source in SDR is 7.6 dB and in SIR 8.4 dB; (2) for music and speech, the average improvement per source in SDR is 5.0 dB and in SIR 4.2 dB; (3) for male speech and female speech, the average improvement per source in SDR is 2.8 dB and in SIR 3.5 dB. The proposed method exhibits much better average SDR of approximately 62.9%, 130.2%, 142.8%, and 454.4% improvement over CMF, SNMF, NMF-ISD, and SCICA, respectively. Analyzing the separation results and SDR performance, the proposed method leads to the best separation performance for both recovered sources. The SCICA method, in the case where the basis functions have significant overlap with each other, e.g., mixture of two speech sources where the basis functions for two sources are very similar, this method performs very poorly. The parts decomposed by the SNMF and NMF-ISD methods are not adequate to capture the temporal dependency of the frequency patterns within the audio signal. Additionally, if the data does not span the positive octant adequately, a rotation of $W$ and opposite $H$ can give the same results. The conventional CMF method exploits the phase information of the sources which is inherently ignored by SNMF and NMF-ISD, and this has led to an improved performance of about 2 dB in SDR. However, there is no sparsity parameter tuning in the conventional CMF and it cannot avoid under- or over-fitting the model. In addition, it suffers from poor convergence.^{20} Both issues are resolved using the proposed method by closed-form parameters estimation and by automatically adapting the sparsity parameter for each individual temporal code using the *L*_{1}-norm leading to a more pronounced performance improvement.

Mixtures . | Methods . | SDR . | SIR . |
---|---|---|---|

Music and music | Proposed method (L_{1}-SCMF) | 12.72 | 14.87 |

CMF (Ref. 16) | 6.11 | 7.53 | |

SNMF (Ref. 17) | 5.23 | 7.14 | |

NMF-ISD (Ref. 18) | 5.17 | 6.31 | |

SCICA (Ref. 19) | 3.85 | 4.86 | |

Music and speech | Proposed method (L_{1}-SCMF) | 8.92 | 9.84 |

CMF (Ref. 16) | 6.06 | 6.64 | |

SNMF (Ref. 17) | 4.52 | 6.11 | |

NMF-ISD (Ref. 18) | 3.55 | 6.62 | |

SCICA (Ref. 19) | 1.43 | 3.12 | |

Male speech and female speech | Proposed method (L_{1}-SCMF) | 4.53 | 6.78 |

CMF (Ref. 16) | 3.19 | 5.01 | |

SNMF (Ref. 17) | 1.62 | 3.23 | |

NMF-ISD (Ref. 18) | 2.06 | 4.27 | |

SCICA (Ref. 19) | −0.56 | 1.25 |

Mixtures . | Methods . | SDR . | SIR . |
---|---|---|---|

Music and music | Proposed method (L_{1}-SCMF) | 12.72 | 14.87 |

CMF (Ref. 16) | 6.11 | 7.53 | |

SNMF (Ref. 17) | 5.23 | 7.14 | |

NMF-ISD (Ref. 18) | 5.17 | 6.31 | |

SCICA (Ref. 19) | 3.85 | 4.86 | |

Music and speech | Proposed method (L_{1}-SCMF) | 8.92 | 9.84 |

CMF (Ref. 16) | 6.06 | 6.64 | |

SNMF (Ref. 17) | 4.52 | 6.11 | |

NMF-ISD (Ref. 18) | 3.55 | 6.62 | |

SCICA (Ref. 19) | 1.43 | 3.12 | |

Male speech and female speech | Proposed method (L_{1}-SCMF) | 4.53 | 6.78 |

CMF (Ref. 16) | 3.19 | 5.01 | |

SNMF (Ref. 17) | 1.62 | 3.23 | |

NMF-ISD (Ref. 18) | 2.06 | 4.27 | |

SCICA (Ref. 19) | −0.56 | 1.25 |

## 4. Conclusion

The impetus behind the proposed work is that NMF cannot estimate the phase spectra of underlying constituent signals, and sparseness achieved by the conventional CMF is not efficient enough. A novel *L*_{1}-sparse CMF has been proposed to solve the problem. The proposed method decomposes an information-bearing matrix into complex of factor matrices which represent the sparse dictionary. The sparse parameter is learned from the data and is automatically tuned adaptively for each individual temporal code to obtain the desired sparse decomposition. The proposed method can extract recurrent patterns of magnitude spectra that underlie observed complex spectra and the phase estimates of constituent signals, thus enabling the features of the components to be extracted more efficiently. Experimental evaluations have shown very promising separation results using the proposed method.

## References and links

The algorithm is first initialized using NMF (Ref. 4). The CMF is set run for 2000 iterations with normalized-power step size and a stopping criterion in terms of rate of gradient change is applied to determine steady-state convergence.