Relative impulse responses (ReIRs) have several applications in speech enhancement, noise suppression and source localization for multi-channel speech processing in reverberant environments. Estimating the ReIRs can be reduced to a system identification problem. A system identification method using an empirical Bayes framework is proposed and its application for spatial source subtraction in audio signal processing is evaluated. The proposed estimator allows for incorporating prior structure information of the system into the estimation procedure, leading to an improved performance especially in the presence of noise. The estimator utilizes the sparse Bayesian learning algorithm with appropriate priors to characterize both the early reflections and reverberant tails. The mean squared error of the proposed estimator is studied and an extensive experimental study with real-world recordings is conducted to show the efficacy of the proposed approach over other competing approaches.
I. INTRODUCTION
System identification has applications in a wide spectrum of engineering problems and has been analyzed in detail (Ljung, 1998; Pillonetto and De Nicolao, 2010). Classical approaches to system identification are based on a parametric model assumption (Chen et al., 2012; Pillonetto et al., 2011), where a set of required parameters are obtained by minimizing an appropriate cost function. Bayesian methods for system identification (Au, 2012; Beck, 2010; Bottegal et al., 2014; Carli et al., 2012) have gained significant interest in recent years. These methods allow for the added advantage of including prior knowledge on the system into the estimation procedure through a few additional unknown hyperparameters. In this article, we offer a novel method to estimate one such system which is linear and can be modeled as a finite impulse response (FIR) filter. We weave in prior knowledge on the structure of this filter (associated with audio signal processing in reverberant environments), and estimate the channel through the sparse Bayesian learning (SBL) algorithm. These hyperparameters are estimated from the system measurements themselves using an evidence maximization approach [empirical Bayes (EB) method] (Aravkin et al., 2012; Carli et al., 2012). This eliminates the need for searching and setting optimal values in dynamic acoustic environments.
Relative impulse responses (ReIRs) or their frequency-domain counterparts, the relative transfer functions (RTFs) (Gannot et al., 2001), are important tools in several multichannel audio processing tasks such as speaker extraction, noise reduction, speech enhancement, source localization, etc. (Benesty, 2000; Gannot and Cohen, 2004; He and Yang, 2015; Laufer et al., 2013; MacDonald, 2008). ReIRs represent the impulse response between two microphones calculated when signals are received from a single source at both the microphones. We focus on a two-microphone setup in this article and aim to estimate the ReIR between these two microphones.
RTF information can be incorporated in beamforming algorithms deployed in a general sidelobe canceller (GSC) structure (Gannot and Cohen, 2004; Krueger et al., 2011) to produce a noise reference signal useful for adaptive interference cancellation and improving the speech enhancement algorithm performance. Reverberation has to be taken into account in the GSC for achieving satisfactory signal cancellation at the output of the blocking matrix. Gannot et al. proposed a variant called the transfer function-GSC (TF-GSC) (Gannot et al., 2001) that relies on estimated RTFs. The performance of the TF-GSC is tied to the quality of the RTF estimate. The RTF, however, is dynamic and changes with movements of the target or the microphones. Fast and accurate RTF estimates are required to prevent the target signal leaking through the blocking matrix, causing severe signal distortions at the output of the GSC.
ReIRs can be easily computed in a noiseless environment using a traditional least squares (LS) method as shown in (Krueger et al., 2011). The LS estimate becomes unstable in the presence of noise. There have been many recent attempts to estimate ReIRs accurately in a noisy environment (Koldovsky et al., 2015; Markovich et al., 2009; Schwab et al., 2006; Talmon et al., 2009). Many of these solutions require a long recording, more than 100–200 ms long, to obtain a good ReIR estimate. A method that exploits the non-stationarity of the target speech signal has been proposed in Gannot et al. (2001). This method assumes that the noise and the RTF are stationary, or at least much less dynamic, when compared to the target signal. Malek and Koldovsky (2014) propose a novel assumption that the ReIRs can be replaced by sparse filters, to regularize the LS solution. Sparsity constraints have been used extensively in multi-channel signal processing applications (Gemba et al., 2017; Gerstoft et al., 2015; He and Yang, 2015). ReIRs also exhibit a non-sparse decaying tail (Koldovsky et al., 2015) in reverberant environments. A novel approach of sparsely reconstructing time domain ReIRs from incomplete RTF measurements is proposed in Koldovsky et al. (2015), where estimation involves using high signal-to-noise ratio (SNR) frequency bins. Existing frequency domain (FD) approaches result in a biased estimate due to the inaccuracy of the power spectral density estimate obtained through a finite average (Cohen et al., 2003). This motivates us to consider a time-domain solution instead.
In this paper, we propose an empirical Bayes based approach: structured SBL (S-SBL). We consider a unified framework for handling both sparse early reflections and exponentially decaying reverberation tail via a suitable prior distribution and use an empirical Bayesian framework for robust parameter estimation. Our approach also models any ambient measurement noise and leads to a more robust estimator of the ReIR. Though Benichoux et al. (2014) consider incorporating both the sparse early reflections and exponentially decaying tail in an estimation framework, they require prior information of the reverberation time and the regularization parameters. These parameters are estimated through cross-validation, and thus not suited for real-time applications. Our EB framework estimates the decay rate and the variance of the ambient noise from the measurement itself using an evidence maximization procedure, eliminating the need for heuristically choosing these parameters. We validated this approach in our preliminary study (Giri et al., 2016). We analyze our proposed approach in detail and study its mean squared error (MSE) properties as well. We provide extensive results on SBL-based ReIR estimation by studying its blocking capability.
The remainder of this article is organized as follows. We begin by introducing the problem in Sec. II. Sections III A and III B present popular existing solutions to the problem in the time and FD, respectively. We then present a derivation for the ReIR structure in Sec. IV. This structure is then incorporated into the inference procedure in Sec. V, and an analysis on the MSE performance of the algorithm is provided in Sec. V E. Extensive experimental results over real world recordings are presented in Sec. VI for the blocking matrix construction task. Finally, Sec. VII discusses some future directions for this work and makes concluding remarks.
II. PROBLEM FORMULATION
The ReIR estimation problem is discussed below. Let hL and hR denote the RIR between the target and the two microphones [subscript indicating left (L) or right (R) microphone, respectively]. s denotes the target speech; while and denote the noise components in the microphone measurements xL and xR. The oracle RTF, denoted as , is the Fourier transform of h. Given the exact acoustic channels, we can obtain the ReIR h as where denotes the convolution operator. xL and xR can be related as follows:
The possible correlation between the noise signals on the microphones in diffuse noise fields has been discussed in Emmanuel et al. (2008). A detailed study using the exact noise characterization for ReIR estimation is given in Srikrishnan et al. (2018). The analysis henceforth assumes that the noise is drawn from an independent white Gaussian distribution. This assumption is implicit in most popular methods and is found to result in satisfactory solutions in practice.
We simplify the notation by replacing xRwith x and employ a matrix formulation to facilitate the least squares formulation. x denotes the stacked measurement vector of dimensions N × 1 and S is the convolution matrix of dimensions N × L constructed from xL. h is the ReIR of the system truncated to length L and denotes the measurement noise vector of size N × 1. We considered an overdetermined case (N > L) and aim to recover h in the following linear inverse problem:
The system in Eq. (4) is a vector formulation of the convolution operation depicted in Eq. (3). This system may have a non-causal behavior. In order to derive a causal ReIR, we delay xR by D samples such that the entire ReIR is effectively delayed (Koldovsky et al., 2013). In the absence of noise, . The delay is achieved in practice by zero-padding D samples before appending the first (N – D) samples of xR. Note that D samples of xR are lost in the process. The directional symmetry in the estimate is broken since a segment of the measurement from one side is lost. A suggestion to combat this is to use a longer measurement and choose the correct delay-separated segments from each microphone. The first N measurements from xL and the last N measurements from xR are taken from a measurement of (N + D) samples each. An example is shown in Appendix C.
III. IMPULSE RESPONSE ESTIMATION
We summarize some recent popular time domain and FD impulse response estimators below to provide context for and contrast to our proposed method.
A. Time domain estimators
1. Traditional least squares solution
The true impulse response h can be estimated using a least squares (LS) approach
The solution of Eq. (5) obtained through the pseudo-inverse is
This LS solution can be approximated in an online fashion using an adaptive algorithm such as the normalized least-mean-square algorithm. For band-limited signals; the temporal resolution of linear deconvolution algorithms is limited by the near degeneracy of the columns of the convolution matrix S. The ill-conditioning of the matrix proves to be very detrimental to the LS solution and amplifies any noise present in the system, leading to wildly fluctuating IR estimates (Lin and Lee, 2006).
2. Regularized least squares solution
A workaround to the ill-conditioning problem described above is to use a diagonal loading to make the matrix well conditioned, essentially a ridge regression framework (Marquardt and Snee, 1975). Considering a regularizer , the solution then becomes
We can show that is actually the solution of the following optimization problem:
We have used the regularized least squares (RLS) method as a baseline method with as a heuristic choice for the regularization parameter in this paper. The optimal value in practice can be found using a grid-search, varying with environmental conditions.
3. Sparsity inducing penalties
ReIRs have been shown to have a sparse structure. Sparsity constraints have been imposed on the IR in several recent works (Lin et al., 2007; Malek and Koldovsky, 2014), leading to an optimization problem of the form
Sparsity is promoted by the use of the l1 penalty term and λ > 0 controls the amount of sparsity in the solution. In Benichoux et al. (2014) and Koldovsky et al. (2015), the authors impose an exponential decaying structure to model the reverberant tail. A weighted least absolute shrinkage and selection operator (LASSO) (Tibshirani, 1994) problem is solved,
Here, is a vector of non-negative weights and denotes the Hadamard product. Note that in a weighted LASSO problem, elements of with higher weights tend to be closer to zero. To mimic the expected structure of a relative impulse response, the weights are chosen as follows:
where and k3 are positive constants and D is the integer delay. The choice for the values of these hyperparameters are discussed in Koldovsky et al. (2015). The values of the weights are small near i = D, where the direct path peak is expected. The weights increase beyond this point, forcing the corresponding elements of to be small.
B. FD estimators
1. Traditional FD estimation
We can rewrite Eq. (3) (with no noise), using the short-time Fourier transform (STFT), in the FD. We consider the RTF H to be static for a specific interval and divide that interval into P frames. Let θ denote the frequency variable and k denote the frame index of the measurements. The time domain convolution can be represented in the FD as
An estimate of the RTF can then be found as
The numerator is a sample estimate of the cross power-spectral density (PSD), while the denominator is a sample estimate of the auto-PSD. This baseline method is referred to as FD hereon.
2. Non-Stationarity based FD (NSFD) estimation
The NSFD method of Gannot et al. (2001) relies on the assumption that noise signals are stationary, or less dynamic; when compared to the target speech signal. Again, in the STFT domain, the model is represented as
E denotes the environmental noise spectrum. We divide the interval into P frames. For the pth frame, we obtain
where denotes the cross spectral density between the output and input in the pth frame. The other cross-spectral densities are defined similarly. Since the noise is stationary, we can write and the overdetermined set of equations for is solved to estimate H. The PSDs above are replaced by their sample estimates in practice.
IV. STRUCTURE OF ReIRs
Modeling of the room impulse response (RIR) as sparse early arrivals to model the specular component of the impulse response, and a random decaying tail to model the reverberation (late arrivals) has been well documented (Georganti et al., 2008; Hashemgeloogerdi and Bocko, 2016). However, the discussion of the appropriateness of the model for ReIRs has been limited. We now present a prediction framework to provide an explanation to the source of the specular and reverberant components of the ReIRs. The ReIR estimation problem can be viewed in a prediction framework as shown in Fig. 1.
Prediction framework: ReIR h predicts the right channel signal sR given the left channel signal sL.
Prediction framework: ReIR h predicts the right channel signal sR given the left channel signal sL.
The signal S from the speaker is filtered through the left RIR channel (hL) and the right RIR channel (hR). Each RIR channel is divided into a sparse specular component () and a noise-like reverberant component () (Georganti et al., 2008; Hashemgeloogerdi and Bocko, 2016). The ReIR h predicts the right channel signal (SR) given the left channel signal (SL). A noise signal (reverberant component in one channel) cannot be used to predict another noise signal (reverberant component in the other channel) unless correlated. We make a simplifying assumption that the reverberant components and are independent. The predictor will ideally model only the sparse components in low noise environments. Thus, the diffuse component can never be exactly predicted. Sparsity in the ReIR is an outcome of the sparsity arising out of the specular components in the RIRs. Consider a simplistic case where and , with and . The resultant specular predictor is solved for in the z-transform domain as
Equation (18) represents a decreasing sparse sequence on truncating the Taylor series expansion and provides support for the sparsity in the ReIR. ReIRs are observed to have a dominant peak corresponding to the difference between the direct paths. Thus, sparsity in the channel should be exploited by applying a sparse penalty during the estimation process. We now proceed to analyze the reverberant component of the ReIR.
ReIRs exhibit an exponentially decaying noise-like tail, shown below to be a result of the diffuse components in the RIRs. We characterize the shape of the ReIR obtained from such channels under simplifying assumptions. The prediction framework suggests that one reverberant component ideally cannot predict the other. However, given that we are using measurements from a short frame, a tail still persists in the ReIR. The derivation is discussed in detail in Appendix A and is summarized below. A simple model is considered for this purpose where both left and right RIRs are a delta function followed by reverberant tails. We assume a delta function at the first sample in hL, and the sample in hR. The reverberant RIR tail is composed of samples from independent Gaussian distributions with variance reducing exponentially beyond the delta function. The true ReIR is calculated for these constructed RIR channels. The envelope of the tail can be studied by obtaining the variance of the estimated ReIR. Based on the analysis ( Appendix A), we conclude that the ReIR tail is composed of a peak at the arrival of the early reflection, and decays exponentially on either side. The component of the tail before the end of the early reflections can be added to the sparse early reflection components themselves for estimation purposes. The ReIR can thus be postulated to be similar in structure to the RIR: sparse early reflections followed by a reverberant tail.
V. EMPIRICAL BAYESIAN ESTIMATION WITH STRUCTURED PRIOR
We now present an empirical Bayes based method to estimate the ReIR in the time domain by exploiting prior structure in the ReIR. We consider both the sparse early reflections and the reverberant tail in a unified Bayesian framework. The prior for h is designed considering the ReIR structure arguments posed before.
A. Model
Consider the model in Eq. (4). Under the Gaussian likelihood assumption, . The prior distribution over h is chosen as
γp corresponds to the variance of the pth early reflection.
corresponds to variance of the mth tap out of the M exponentially decaying reverberant tail components.
The proposed approach follows the relevance vector machine (RVM)/SBL (Tipping, 2001) framework to incorporate the sparse regularization for the early reflection taps. The hyperparameters γi can be modeled as random variables with a hyperprior. Further insight on how this prior actually enforces sparsity is discussed below.
B. Enforcing sparsity
The hierarchical nature of the prior allows for tractably enforcing sparsity into the early reflections. To study this further; assume that an inverse gamma (IG) distribution has been used as the prior over the hyperparameters γi. Though is Gaussian, we can integrate out γi to find the true nature of the prior as
This marginal true representation of the prior of the initial P taps corresponds to a Student's t-distribution (Fisher, 1925). Note that Γf denotes the Gamma function. This is a super Gaussian density (heavier tailed than Gaussian) which can be used to promote sparsity and is a special case of a more general class of densities that admit a Gaussian scale mixture (GSM) representation (Andrews and Mallows, 1974). The pdf of a Student's t-distribution (degree of freedom ν = 0.1) is compared with a Gaussian distribution in Fig. 2. Fortunately, in a hierarchical framework, the approach is not very sensitive to the choice of the hyperprior (Lehmann and Casella, 1998). For convenience, in our work we use a uniform hyper-prior on (). Then is an improper prior with infinite probability mass at the origin. For more discussion on the sparsity promoting features of SBL, the reader is referred to Wipf et al. (2011).
(Color online) Tail behavior: Student's t vs Gaussian. The Student's t-distribution is a super Gaussian density which can promote sparsity when used as a prior.
(Color online) Tail behavior: Student's t vs Gaussian. The Student's t-distribution is a super Gaussian density which can promote sparsity when used as a prior.
In the proposed SBL variant, we have incorporated the reverberant tail structure in to the regularization by tying the last M diagonal elements of Γ into an exponentially decaying tail. The structured SBL algorithm (S-SBL) effectively reduces the number of parameters to be estimated by incorporating channel structure into the prior.
C. Bayesian inference
We follow the method detailed in Wipf and Rao (2004) to estimate the ReIR h. The resulting procedure and equations are summarized below. The proposed model has hyperparameters , c1, and c2, which can be estimated from the data by maximizing the marginal likelihood . The marginal likelihood is referred to as the “evidence for hyperparameters” in MacKay (1999). Given these estimated hyper-parameters, the point estimate of the ReIR can be computed using
Considering the GSM representation used, , the posterior of h can be computed as
In the procedures, we approximate the true posterior by , following a Gaussian distribution whose mean and covariance depend on the estimated hyperparameters. Following Eq. (22), is used as the point estimate of the impulse response. Type-II maximum likelihood estimation (Giri and Rao, 2015) as discussed in Wipf and Rao (2004) is used to estimate the hyper-parameters given its superior recovery performance,
Here, . Note that terms independent of the minimization have been omitted. Closed form expressions cannot be obtained for the hyper-parameters (Tipping, 2001). We use the expectation-maximization (EM) algorithm [h as a hidden variable, as complete data] to estimate the hyper-parameters (Wipf and Rao, 2004). In the formulation below, a superscript t is used to indicate the value at iteration t. In the E step, the Q function is computed as
Ignoring terms independent of the minimization,
For iteration t; we need only to compute the following conditional expectation for all taps :
is the ith diagonal element of Σ given in Eq. (25) and is the ith element of μ given in Eq. (24). They can be computed using the values of the parameters obtained in iteration t. In the M step, maximizing the Q-function with respect to the hyperparameters (γ, c1, c2, and σ2) results in
We use the estimate of c2 from the previous iteration in Eq. (35). We solve Eq. (36) to obtain the closed form update rule for c2. Representing this as a polynomial of , we can show using Descartes's sign rule that only one positive root exists. We can therefore update c2 using . We iteratively estimate these parameters until convergence. The desired channel is estimated through Eq. (24). In practice, a few iterations of the above S-SBL procedure yields a converged impulse response estimate h.
D. Setting hyper-parameters in S-SBL
The parameters P and M are a measure of the duration of the early reflections and reverberant tail, respectively. P should account for both the causality delay D (Sec. II) and the early reflection terms; leading to differing solution structures with P. A value of P lower than D causes the estimated exponential tail to begin even before the actual early reflections have begun. The evaluated settings used D = 10 and P = 30. The implementation used 100 iterations of the EM-based hyper-parameter updates.
E. Mean squared error properties of S-SBL
The empirical Bayes based S-SBL estimator of the ReIR is now studied in terms of it is mean squared error properties, a commonly used metric for evaluating the quality of an estimate. Let be the true ReIR. The MSE (expected quadratic loss) for an estimator is
Our goal is to minimize the given MSE expression with respect to the hyperparameters γi, c1, and c2. We present the main observations below, with the complete derivation given in Appendix B. We make the simplifying assumption that for a more straightforward derivation. On substituting Γ from Eq. (20) in Eq. (38), for the estimator of the form given in Eq. (24) we can obtain the optimality constraints on γi, c1 and c2 under the no noise assumption (). The MSE estimates and satisfy the optimality conditions given below in Eqs. (39), (40), and (41) [refer to Eqs. (B8), (B12), and (B13) in Appendix B],
As mentioned before, our proposed S-SBL algorithm uses a type II inference procedure where the following cost function is being minimized:
We then simplify the cost function using Sylvester's determinant identity and Woodbury's inverse identity ( Appendix B) and collect terms involving the required parameters
Using Γ from Eq. (20), we separate the cost into sections involving only γ and c1, c2, respectively. We show that the that minimizes this cost function must satisfy [Eq. (B23) in Appendix B]
The cost function minimizing c1 and c2 must satisfy the following equations in the no noise condition (first order) [Eqs. (B27) and (B28) in Appendix B]:
We observe that under the orthonormal assumption with , the S-SBL estimates of γi (specular components) converges to an optimal estimator (in terms of MSE). With respect to the diffuse component variance, we see that there is a strong resemblance between Eqs. (45) and (46) and Eqs. (40) and (41). With some manipulations, it can be shown that the hyperparameters and that maximize the asymptotic evidence () in the proposed empirical Bayes framework when [satisfies Eqs. (44), (45), and (46)]; will also minimize a weighted MSE (MSEw) [shown in Eq. (47)], with weights wi = 1 for and for ,
The weighted MSE formulation is intuitively appealing and a similar result on the MSE properties of EB estimators can be found in Carli et al. (2012), where only an exponentially decaying kernel has been considered. Our work extends this result for a unified framework that incorporates both the sparse early reflections and the exponential decaying tail.
F. Connection between S-SBL and RLS
On simplifying Eq. (24), we get
Comparing this with solution (7), we see that S-SBL can be viewed as a reweighted l2-norm minimization algorithm with a more general penalty weight factor () being imposed on the taps. Instead of an ad hoc choice, the penalty weights are estimated every iteration through γi, c1, c2, and σ2, thus enforcing the desired IR structure through regularization in a systematic manner.
VI. S-SBL FOR BLOCKING MATRIX CONSTRUCTION
We now present detailed experimental results for evaluating the performance of several competing algorithms used for ReIR estimation in terms of their target signal blocking ability.
A. Experimental setting
We follow the experimental setting described by Koldovsky et al. (2015) as highlighted in Table I and use a database of measured impulse responses (Hadad et al., 2014) to generate reverberant recordings. The signal for the target source, a female utterance (10 s long) has been taken from the task of the online signal separation campaign (SISEC 2013) (Ono et al., 2013).
Experimental settings.
Parameters . | Values . |
---|---|
Sampling frequency | 8 kHz |
SNRin | 0 dB |
Target Angle | 0° |
Directional Noise Angle | −60° |
Microphone Pair | [3 4] (3 cm) |
Distance between source and mic | 2 m |
T60 | 360 ms |
Parameters . | Values . |
---|---|
Sampling frequency | 8 kHz |
SNRin | 0 dB |
Target Angle | 0° |
Directional Noise Angle | −60° |
Microphone Pair | [3 4] (3 cm) |
Distance between source and mic | 2 m |
T60 | 360 ms |
The testing utterance (female talker) is divided into intervals of 1024 samples each (128 ms at 8 kHz). We use a voice activity dectector (VAD) to perform ReIR estimation independently on only intervals with speech present. The average attenuation rate (described in Sec. VI B) is calculated over intervals where speech is present. Note that we estimate an ReIR 512 taps long. The LS method was used to estimate the true ReIR shown in Fig. 3 using a long noise free recording (Koldovsky et al., 2015).
(Color online) True relative impulse response (calculated using a long noise free recording).
(Color online) True relative impulse response (calculated using a long noise free recording).
B. Performance metric
We use a widely used performance metric called the Attenuation Rate (Koldovsky et al., 2015) to quantitatively evaluate the competing algorithms. The attenuation rate (ATR) is defined as the ratio between SNRout and SNRin (in dB scale), where
The numerator of SNRout measures the leakage of the target signal, whereas the denominator measures the attenuation of the noise signal. A lower ATR indicates a better blocking performance and suggests a good noise reference signal usable for further processing (such as single-channel postfiltering).
C. Competing algorithms
The following competing algorithms are evaluated in this paper.
Time domain methods: ℓ2 regularized least squares, ℓ1 regularized least squares, weighted ℓ1 regularized least squares, proposed S-SBL estimator.
FD Methods: Traditional FD, NSFD.
D. Results
We now compare the performance of algorithms in the presence of two cases of diffused noise (spatially white, babble) and directional noise (white noise at source, interfering talker).
1. Diffuse noise
We show the average ATR obtained using all competing algorithms in Table II in two diffused noise cases. The first case involves the speech signal contaminated with independent white Gaussian noise generated for each channel. This is an additive noise case, since the noise components at the microphones are uncorrelated. The second case uses a sample of omnidirectional (isotropic) babble noise recorded in the lab using a microphone array (Ono et al., 2013). The algorithms perform better in the presence of white noise when compared against babble noise. The proposed S-SBL approach achieves the best ATR in both the cases, more significantly so in the babble noise case. Informal subjective listening of the output of the blocking matrix showed noticeable differences as well. Note that the performance of the l1 based methods is sensitive to the choice of the regularization parameter λ and weights. We only report the ATR using the optimum λ and the optimal weights given in Koldovsky et al. (2015).
ATR measure in diffused noise scenario.
. | White Noise . | Omni Babble Noise . |
---|---|---|
Algorithms . | ATR (dB) . | ATR (dB) . |
FD | −6.18 | −3.68 |
NSFD | −11.24 | −5.18 |
RLS | −7.36 | −4.35 |
ℓ1 (λ = 0.05) | −8.30 | −5.59 |
Weighted ℓ1 (λ = 0.1) | −11.01 | −6.35 |
S-SBL (proposed) | −12.05 | −7.49 |
. | White Noise . | Omni Babble Noise . |
---|---|---|
Algorithms . | ATR (dB) . | ATR (dB) . |
FD | −6.18 | −3.68 |
NSFD | −11.24 | −5.18 |
RLS | −7.36 | −4.35 |
ℓ1 (λ = 0.05) | −8.30 | −5.59 |
Weighted ℓ1 (λ = 0.1) | −11.01 | −6.35 |
S-SBL (proposed) | −12.05 | −7.49 |
2. Directional noise
In Table III, we present the average ATR obtained using all competing algorithms in the presence of directional noise. The first case involves the target speech contaminated with directional white Gaussian noise. In the second case, a male speaking interferer is used instead. This situation is more challenging when compared with the diffused noise case, even more so when the directional noise is a speech interferer. The performance of all the algorithms is reduced in the presence of directional white noise when compared with diffused white noise.
ATR measure in presence of directional noise.
. | White . | Talker (with VAD) . |
---|---|---|
Algorithms . | ATR (dB) . | ATR (dB) . |
FD | −3.98 | −0.86 |
NSFD | −10.37 | −9.63 |
RLS | −7.25 | −11.40 |
ℓ1 (λ = 0.05) | −8.66 | −6.76 |
Weighted ℓ1 (λ = 0.1) | −10.39 | −11.22 |
S-SBL (proposed) | −10.79 | −15.72 |
. | White . | Talker (with VAD) . |
---|---|---|
Algorithms . | ATR (dB) . | ATR (dB) . |
FD | −3.98 | −0.86 |
NSFD | −10.37 | −9.63 |
RLS | −7.25 | −11.40 |
ℓ1 (λ = 0.05) | −8.66 | −6.76 |
Weighted ℓ1 (λ = 0.1) | −10.39 | −11.22 |
S-SBL (proposed) | −10.79 | −15.72 |
We show the spectrograms of the clean speech and the noise reference signal obtained using S-SBL and NSFD, respectively (directional white noise), in Figs. 4, 5, and 6. It is evident from Fig. 6 that the dominant low-frequency speech harmonic structure is still present in the NSFD noise reference estimate. All algorithms struggle with handling a speech interferer in the absence of a VAD and results in a positive ATR. The RTF estimate could be that of the speech interferer instead, since there is no method to distinguish between the targets. We present results assuming that an oracle VAD is available for both the target and the interferer. We have conducted such experiments using a realistic conversation database (Woods et al., 2015) and observed encouraging results. It is evident from Table III that even in presence of directional noise sources, S-SBL surpasses other competing algorithms.
(Color online) Spectrogram of clean utterance recorded at left microphone.
(Color online) Spectrogram of the noise reference signal obtained using S-SBL (directional white noise).
(Color online) Spectrogram of the noise reference signal obtained using S-SBL (directional white noise).
(Color online) Spectrogram of the noise reference signal obtained using NSFD (directional white noise).
(Color online) Spectrogram of the noise reference signal obtained using NSFD (directional white noise).
E. Effect of recording length
Figure 7 highlights the effect of increasing the recording length on the performance of all the competing algorithms in diffused noise. The performance of all the algorithms improves slightly with a growing recording length as expected. The same experiment is repeated with directional white noise with varying recording lengths (Fig. 8). A similar behavior as before is observed here as well. Though the longer recordings improve the ReIR estimation performance, the dynamic nature of the ReIR may prove to be a hindrance in real life since the surrounding acoustic environment, along with the positions of the target and the microphones, may vary during the recording duration. The proposed S-SBL method is shown to provide accurate noise-robust estimates of the ReIR, making it a very useful choice in practice despite the increased computational complexity compared to baseline methods.
(Color online) Attenuation rate vs length of the recording in presence of omnidirectional babble noise.
(Color online) Attenuation rate vs length of the recording in presence of omnidirectional babble noise.
(Color online) Attenuation rate vs length of the recording in presence of directional white noise.
(Color online) Attenuation rate vs length of the recording in presence of directional white noise.
VII. CONCLUSION
We proposed a novel Bayesian approach for estimating ReIRs using short, noisy, reverberant recordings. Our proposed time domain solution benefits from exploiting channel structure by employing both a sparsity inducing prior for the early reflections and an exponentially decaying tail for the reverberation components. We also analyzed the MSE properties of our estimator and show that the evidence maximization procedure can be interpreted as a weighted MSE minimization problem. Detailed experimental results also show consistent improvement of our proposed approach over competing algorithms.
APPENDIX A: REIR TAIL STRUCTURE
Let indicate the ith tap out of channel j
Note that we add a causality delay D in the ReIR by zero-padding the right channel measurement, leading to a delayed channel ,
We model the noise in the RIR tail as a wide sense stationary white Gaussian random variable with reducing variance. The tail is assumed to begin right after the specular component. The shape of the ReIR tail is controlled by the variance of the noise. We assume a variance of the form ; where n is indexed from zero beginning at the spike. The True ReIR () is obtained by solving Eq. (4). Assuming that the source signal is a delta function ( represents hR and S is designed from hL (Toeplitz matrix with first column as hL and first row as . We calculate the two components of the true LS solution and and study the resulting statistics. The mean of the solution will be the noise-free specular component. The variance can then be studied to shed light on the envelope of the ReIR tail.
We approximate the expression by replacing the summation with an expectation as per the Law of Large numbers. The off-diagonal terms can be summed to zero in limit. The measurement length is however short in reality and the off-diagonal elements die down exponentially,
We now characterize the mean and variance of STx
The mean will be zero in nearly all locations (expectation of a product of independent and identically distributed zero mean Gaussians is zero). A non-zero ReIR tap is obtained at the location where the two delta functions coincide; the sample in this case,
We consider the variance of the solution to study the tail. has already been approximated as a diagonal matrix. We now focus on obtaining the variance of each term in . Note that the inverse of increases along the diagonal,
The envelope of the ReIR tail increases till the arrival of the ReIR delta function. The summation remains constant; while an increasing scaling factor scales up the sum. We have an additional increasing factor from . Note that since factor in each term is extremely low; this growing structure is expected to be seen only close to the ReIR delta function for practical measurements.
The ReIR envelope starts reducing right beyond the delta function. A detailed evaluation considering the increasing diagonal terms with a decreasing variance term showed that the overall variance reduced as a . k(n) was found to be close to one and nearly invariant with n.
APPENDIX B: SBL MSE DERIVATION
The mean squared error properties of the estimator are derived below. Let be the true impulse response. The MSE (expected quadratic loss) for an estimator is
We now compute the MSE of the Bayes estimator given in Eq. (22), with fixed hyperparameters γis, c1, and c2 and true impulse response . The details of the derivation of the MSE expression can be found in Aravkin et al. (2014),
The given MSE expression is minimized with respect to the hyperparameters, i.e., γi, c1, and c2. We also assume that ,
Minimizing with respect to γi leads to the following optimality condition:
Minimizing with respect to c1 and c2 by setting the corresponding partial derivative to zero, we get
We then get the following optimality conditions for c1 and c2:
In the no noise assumption, letting , we get
The MSE estimates and will satisfy the above derived optimality conditions given in Eqs. (B12) and (B13). We are using a type II inference technique/evidence inference procedure to estimate the parameters
If the true impulse response is , then from our model, we get
For a long training sequence (Ljung, 1998) where the length , the minimum of the scaled function will be the minimum of its expected value ,
We will use the Sylvester's determinant identity to simplify this cost function,
Using this, along with the assumption that ,
We also use Woodbury's identity
Using these two identities and the assumption in Eq. (B17); collecting terms involving γ, c1, and c2 leads to
Using the definition of Γ shown in Eq. (20), we split up the cost function in two parts: (function of γi for ) and (function of c1 and c2),
that minimizes the above cost function must satisfy
The minimizing c1 and c2 must satisfy the following conditions (first order):
Considering we get
APPENDIX C: ALTERNATIVE TO ZERO PADDING
The effect of the suggested modification in Sec. II is shown in Fig. 9. The two ReIRs compared here are designed to be δ-functions; one case with positive delay such that (right shifted) and the other case with the same delay but negative such that (left shifted). Though we expect δ-functions of magnitude 1 in both cases, a distortion in the negative delay case is observed when we zero pad (indicated by dashed line). This is the lost directional symmetry referred to in Sec. II. Using a longer measurement with appropriately selected segments as an alternative to zero-padding ensures directional symmetry in the estimate.
(Color online) Alternative to zero padding: zero padding leads to distortions in the estimate ReIR, which can be corrected using the proposed modification.
(Color online) Alternative to zero padding: zero padding leads to distortions in the estimate ReIR, which can be corrected using the proposed modification.