Speech localization and enhancement involves sound source mapping and reconstruction from noisy recordings of speech mixtures with microphone arrays. Conventional beamforming methods suffer from low resolution, especially with a limited number of microphones. In practice, there are only a few sources compared to the possible directions-of-arrival (DOA). Hence, DOA estimation is formulated as a sparse signal reconstruction problem and solved with sparse Bayesian learning (SBL). SBL uses a hierarchical two-level Bayesian inference to reconstruct sparse estimates from a small set of observations. The first level derives the posterior probability of the complex source amplitudes from the data likelihood and the prior. The second level tunes the prior towards sparse solutions with hyperparameters which maximize the evidence, i.e., the data probability. The adaptive learning of the hyperparameters from the data auto-regularizes the inference problem towards sparse robust estimates. Simulations and experimental data demonstrate that SBL beamforming provides high-resolution DOA maps outperforming traditional methods especially for correlated or non-stationary signals. Specifically for speech signals, the high-resolution SBL reconstruction offers not only speech enhancement but effectively speech separation.
I. INTRODUCTION
Talker localization and separation are key aspects in computational auditory scene analysis, i.e., the segregation of sources from noisy and reverberant sound mixtures with signal processing. Multi-microphone processing systems are able to exploit both the spatial and spectral information of the wavefield thus have improved performance compared to single-microphone systems.1,2 Multi-channel speech localization and enhancement algorithms find several applications including robot audition,3,4 tele-conferencing,5 and hearing aids.6,7
The problem of sound source localization in array signal processing is to infer the direction-of-arrival (DOA) of the source signals from noisy measurements of the wavefield with an array of microphones. Beamforming methods based on spatial filtering have low resolution or degraded performance for coherent arrivals, e.g., in reverberant conditions, or for non-stationary signals, when only a few observation windows (snapshots) are available.8 In acoustic imaging, there are usually only a few sources generating the observed wavefield such that the DOA map is sparse, i.e., it can be fully described by only a few parameters. Exploiting the underlying sparsity, sparse signal reconstruction improves significantly the resolution in DOA estimation.9–12 While p-norm regularized maximum likelihood methods, with p ≤ 1, have been proposed to promote sparsity in DOA estimation9–11,13 and wavefield reconstruction,14,15 the accuracy of the resulting sparse estimate is determined by the ad hoc choice of the regularization parameter.12,16
Sparse Bayesian learning (SBL) is a probabilistic parameter estimation approach which is based on a hierarchical Bayesian method for learning sparse models from possibly overcomplete representations resulting in robust maximum likelihood estimates.17,18 Specifically, the Bayesian formulation of SBL allows regularizing the maximum likelihood estimate with prior information on the model parameters. However, instead of explicitly introducing specialized model priors to reflect the underlying structure, SBL uses a hierarchical model which controls the scaling of a multivariate Gaussian prior distribution through individual hyperparameters for each model parameter. The hyperparameters are iteratively estimated from the data selecting the most relevant model features while practically nulling the probability of irrelevant features, hence promoting sparsity.17,19 Since SBL learns the hyperparameters from the data, it allows for automatic regularization of the maximum likelihood estimate which adapts to the problem under study.17,20 The hierarchical formulation of SBL inference offers both a computationally convenient Gaussian posterior distribution for adaptive processing (type-I maximum likelihood) and automatic regularization towards robust sparse estimates determined by the hyperparameters which maximize the evidence (type-II maximum likelihood).21
In array signal processing, SBL is shown to improve significantly the resolution in beamforming22 and in general the accuracy of DOA estimation,23–28 outperforming conventional methods notably at demanding scenarios with correlated or non-stationary signals. Multi-snapshot26 and multi-frequency23,24,27,28 SBL inference exploits the common sparsity profile across snapshots for stationary signals and frequencies for broadband signals to provide robust estimates by alleviating the ambiguity in the spatial mapping between sources and sensors due to noise and frequency-dependent spatial aliasing, respectively. Accounting for the statistics of modelling errors in SBL estimation, e.g., due to sensor position, sound speed uncertainty or basis mismatch, further improves support recovery.28,29
We use the SBL framework to solve the sound source localization problem of speech mixtures in noisy and reverberant conditions. We employ the multi-snapshot, multi-frequency SBL algorithm in Ref. 28 to reconstruct simultaneously the DOA and the complex amplitude of speech signals. SBL beamforming assumes a predefined spatial mapping between the sources and the sensors to infer the DOAs directly from the reconstructed source vector, as opposed to methods (including SBL-based7) which infer the DOA of a single target talker indirectly through the estimation of the relative transfer function between a pair of microphones.4,6 It is demonstrated both with simulations and experimental data that SBL beamforming offers unambiguous source localization outperforming traditional beamforming methods especially for correlated signals and single-snapshot measurements. The high-resolution SBL reconstruction offers not only speech enhancement over noise, but also speech separation between competing talkers.
Herein, vectors and matrices are represented by bold lowercase and uppercase letters, respectively. The superscripts T and H denote the transpose and the Hermitian, i.e., conjugate transpose, operator, respectively, on vectors and matrices. The superscript + denotes the generalized inverse operator on a matrix. A Q × Q identity matrix is denoted IQ. The p-norm of a vector is defined as . The Frobenius norm of a matrix is defined as .
II. ARRAY SIGNAL MODEL
Assuming narrowband processing, the complex-valued measurements at an M-element array, i.e., the data, are described by the vector
where ym(f, l) is the short-time Fourier transform (STFT) coefficient for the fth frequency and the lth time-frame (snapshot) of the recorded signal at the mth sensor, . The frequency index f is omitted from the vector's notation for simplicity.
At the far-field of the array, the location of a source is characterized by the DOA, θ, of the associated plane wave, x. Discretizing the angular space of interest into N directions, the vector of the complex-valued sound source amplitudes, i.e., the model parameters, for the fth frequency and the lth snapshot is
The array measurements are related to the model parameters with the linear model
where is the wavefield measurements at M sensors for L snapshots, is the unknown source amplitudes at N angular directions for L snapshots and is additive noise which is assumed independent across sensors and snapshots. The sensing matrix,
has as columns the steering vectors a(θn) at each direction θn, , which describe the acoustic transfer function from a source at θn to all M sensors on the array. The sensing matrix is determined either analytically for simple array geometries, e.g., uniform linear arrays (ULA),11 spherical arrays baffled on a rigid sphere,30 or experimentally, e.g., from head-related transfer function (HRTF) measurements.31
III. DOA ESTIMATION
The problem of DOA estimation and source reconstruction with sensor arrays32 is to recover the sources X, given the sensing matrix A and a set of observations Y. Usually, there are only a few sources K N generating the acoustic field such that X is sparse in the angular space, i.e., has only a few non-zero components. However, precise localization requires fine angular resolution such that M < N and the problem in Eq. (3) is underdetermined, i.e., has infinitely many solutions.
An estimate can be obtained by spatial filtering the array data Y (beamforming), or by solving Eq. (3) with optimization or probabilistic methods for parameter estimation. For stationary sources, when X has a common row-wise sparsity profile, snapshots can be combined to improve the signal-to-noise ratio. Otherwise, the problem should be solved independently for each snapshot.
A. Spatial filtering
Spatial filtering of the recorded wavefield refers to applying direction-dependent complex weights to the sensor outputs to allow signals from a specific look-direction to pass undistorted while attenuating wavefield contributions from other directions. Applying a set of spatial weights, one for each look-direction to steer the beamformer across the angular space yields the DOA estimate,
where has as columns the spatial weight vectors at each DOA θn, .
Accordingly, the beamformer power at direction θ is
where is the sample data cross-spectral matrix from L snapshots. Note that, for broadband signals, spatial filtering methods are applied to each frequency separately according to the narrowband signal model Eq. (3).
1. Conventional beamforming
The conventional beamforming (CBF) is the simplest source localization method. The method uses the steering vectors as spatial weights, i.e.,
to combine the sensor outputs coherently enhancing the signal at the look-direction from the ubiquitous noise. CBF is robust to noise and can be used even with single snapshot data, L = 1, but is characterized by low resolution and the presence of sidelobes.
2. Minimum variance distortionless response beamforming
The minimum variance distortionless response (MVDR) beamforming33 weight vector is obtained by minimizing the output power of the beamformer under the constraint that the signal from the look direction, θ, remains undistorted,
resulting in the optimal weight vector,
where diagonal loading with regularization parameter β is used to regularize the inverse of the sample covariance matrix whenever it is rank deficient. Note that by replacing the data sample covariance matrix with the noise sample covariance matrix in Eqs. (8) and (9) results in an equivalent derivation of the MVDR weights.32 However, in practical applications it is more difficult to obtain a robust estimate of the noise separately from the measured data. MVDR beamforming offers high resolution DOA maps but its performance degrades significantly under snapshot-starved data, L < M, correlated arrivals and low SNR conditions.
B. Probabilistic parameter estimation
The problem of DOA estimation can be formulated in a probabilistic framework by considering both the unknowns X and the observations Y as stochastic processes and solved with Bayesian inference.16
Bayes' theorem,
derives the posterior distribution of the model parameters X, i.e., the complex source amplitudes, conditioned on the data Y, i.e., the sensor measurements, from the data likelihood , the prior distribution of the model parameters p(X) and the marginal distribution of the data p(Y). The maximum a posteriori (MAP) estimate,
is used for DOA reconstruction. Here, p(Y) is omitted from the optimization as it is marginalized over X.
The probabilistic formulation (11) provides a regularized solution to the DOA estimation problem (3) based on prior information. To demonstrate the effect of prior information on the estimate, consider the single-snapshot case. Assuming that the additive noise is independent and identically distributed (iid) circularly symmetric complex Gaussian with variance , the data likelihood is also complex Gaussian distributed,
Employing a general expression for the prior p(x) based on the multivariate generalized complex Gaussian distribution,34
where is the scaling parameter and is the shape parameter, the MAP estimate (11) is expressed as a regularized least-squares (R-LS) problem,
where μ = σ2/νp ≥ 0 is the regularization parameter which controls the relative importance between the data fit and the regularization term. The characteristics of the MAP estimate depend on the choice of the shape parameter p and the regularization parameter μ.12
For example, assuming that the model parameters follow an iid complex Gaussian distribution, , problem (14) becomes an 2-norm regularized least-squares problem which has an analytic solution,
The 2-norm regularizer penalizes the energy in the solution hence the estimate (15) is smooth and robust to noise but has low resolution. Note that CBF is related to the 2-norm estimate for large μ,12
Contrarily, assuming that the model coefficients follow a Laplacian-like distribution for complex random variables,35
the MAP estimate (14) becomes the solution to an 1-norm regularized least-squares problem,
which is known as the least absolute shrinkage and selection operator36 (Lasso) since the 1-norm regularizer shrinks the model coefficients towards zero as the regularization parameter, μ = σ2/ν, increases.
As opposed to a Gaussian prior, the Laplacian-like prior distribution encourages sparse solutions as it concentrates more mass at zero and in the tails. Thus the 1-norm estimate improves significantly the resolution in DOA estimation in the presence of only a few sources.11,12 The 1-norm minimization problem (18) can be solved with convex optimization algorithms37 which can be computationally intensive. Besides, the accuracy of the 1-norm estimate (18) depends on the regularization parameter which determines the degree of sparsity in the estimate and requires knowledge on the hyperparameters, i.e., σ2 and ν, of the underlying probability distributions (12) and (17).
1. Sparse Bayesian learning beamforming
The SBL framework uses a hierarchical approach to probabilistic parameter estimation. Instead of employing specialized prior models, e.g., Eq. (17), to explicitly promote sparse maximum likelihood estimates, e.g., Eq. (18), SBL uses a Gaussian prior, , with diagonal covariance matrix and controls the sparsity in the estimate by scaling the model parameters, x, with individual hyperparameters, . The hyperparameters are estimated from the data and control the variances of each coefficient in x, i.e., the source powers. Given that the model parameters are independent across snapshots, the multi-snapshot prior distribution is
Similarly, assuming that the noise is zero-mean complex Gaussian, independent both across sensors and snapshots such that with covariance matrix , the multi-snapshot data likelihood is
Given the Gaussian prior (19) and likelihood (20) for independent snapshots, the posterior distribution for X is also Gaussian,
where
is the posterior mean and covariance, respectively, and
is the data covariance matrix. Given the hyperparameters , or simply since is considered diagonal, and σ2, the MAP estimate (11) is the posterior mean (22), . Note that the sparsity of is dictated by the sparsity profile of the hyperparameters , i.e., xn = 0 if .
In SBL the hyperparameters and σ2 are estimated from the evidence, i.e., the unconditional probability distribution of the data marginalized over the model parameters X,
First, the hyperparameters are estimated with a type-II maximum likelihood, i.e., by maximizing the evidence,
where Tr(·) and denote, respectively, the trace and determinant operators on a matrix. The objective function of the resulting minimization problem (26) is non-convex.37 However, problem (26) can be solved approximately by differentiating the objective function to obtain the fixed point updates,26,28
where is the estimated variance of the nth model parameter, i.e., the estimated source power of a source at direction θn, at the ith iteration.
Then, the estimation of the hyperparameter σ2 is based on a stochastic maximum likelihood procedure,26
where is the set of the active indices indicating the position of the K largest peaks in such that .
To this point, the derivation is based on the narrowband model (3). For broadband signals, we can exploit the common sparsity profile across frequencies to enhance the sparsity of the estimate . The narrowband estimates Eq. (27) can be either combined incoherently for F frequencies,
or coherently assuming a prior with common covariance across frequencies, which results to a unified update rule for all frequencies,28
Table I summarizes the algorithm for SBL DOA estimation. The beamformer power spectrum is readily given by the hyperparameters which represent source power. For amplitude reconstruction, the unbiased estimate, , is used instead of the MAP estimate as it provides more accurate estimates.38 Nevertheless, highly correlated steering vectors, e.g., at very low frequencies, will increase the condition number of and, consequently, the error in the corresponding matrix inversion. For narrowband estimation set F = 1, in which case the update rules (29) and (30) are equivalent, i.e., they reduce to Eq. (27). The details of the derivation of the hyperparameter update rules Eqs. (27) and (28) and of the algorithm for the implementation of the SBL beamformer are in Refs. 26 and 28.
Algorithm for SBL beamforming.
Inputs: A, Y, |
Initializations: i = 0, ϵ = 1, |
Parameters: Niter, ϵmin, K |
1: while i < Niter and ϵ > ϵmin |
2: Update i = i + 1 |
3: Compute using (24), |
4: Update using |
5: Find |
6: Update using (28), |
7: Update |
8: end |
Output: |
Signal estimate: |
Beamformer power: |
Note that the descritization of the problem (3) to a predefined angular grid may affect the accuracy of the SBL estimate. This is either due to basis mismatch for grids that are too coarse to capture the true DOAs of the signals or due to high correlation of adjacent steering vectors for dense grids. Such uncertainty can be incorporated to the model as additive or multiplicative noise and the effect of modelling error can be mitigated by tuning the hyperparameters that control its statistics.28,29 In the interest of algorithm simplicity for practical applications, modelling errors are neglected herein. Moreover, we assume K known in step 5 of the algorithm in Table I, otherwise it can be determined with model order identification methods.23
C. Comparison of beamforming methods
Figure 1 compares the DOA power spectra of CBF, MVDR, and SBL beamformer [Eq. (6) and , respectively] on a simple configuration with a ULA. For a ULA with M sensors, the sensing matrix (4) is defined by the steering vectors,32
where d is the uniform inter-sensor spacing, λ is the wavelength and θn is the nth direction of arrival with respect to the array axis. To demonstrate the resolution capabilities of the beamformers, two sources are introduced with equal deterministic amplitude and random phase uniformly distributed in [0, 2π) on a grid with angular spacing 5°. Note that the DOA spectrum is limited within [−90°, 90°] due to the left-right ambiguity of ULA [i.e., ].32 The noise variance is determined by the SNR given the average source power across snapshots, .
(Color online) CBF, MVDR, and SBL power spectra from L snapshots for two equal-strength sources at 0° and 30° as the clean signal with a uniform linear array with M = 4 sensors and spacing d/λ = 1/2 for (a) SNR = 20 dB, L = 2 M, uncorrelated sources, (b) SNR = 20 dB, L = 1, uncorrelated sources, (c) SNR = 20 dB, L = 2 M, correlated sources, (d) SNR = 0 dB, L = 2 M, uncorrelated sources.
(Color online) CBF, MVDR, and SBL power spectra from L snapshots for two equal-strength sources at 0° and 30° as the clean signal with a uniform linear array with M = 4 sensors and spacing d/λ = 1/2 for (a) SNR = 20 dB, L = 2 M, uncorrelated sources, (b) SNR = 20 dB, L = 1, uncorrelated sources, (c) SNR = 20 dB, L = 2 M, correlated sources, (d) SNR = 0 dB, L = 2 M, uncorrelated sources.
For uncorrelated sources, high SNR and sufficient snapshots, all beamforming methods indicate the presence of the two sources as peaks in the power spectrum, albeit CBF with low-resolution and a prominent sidelobe at around −50°, Fig. 1(a). The high-resolution performance of the MVDR beamformer, which involves the inverse of the sample covariance matrix, degrades significantly for single-snapshot data and correlated sources, Figs. 1(b) and 1(c). Regularization of the MVDR weights Eq. (9), here β = σ2, smooths the MVDR estimate towards the low-resolution CBF estimate. The sparsity promoting SBL beamformer offers high-resolution reconstruction, with single-snapshot data and correlated arrivals invariably, even at low SNR Fig. 1(d). The spurious peaks (e.g., around −45°) at the SBL power spectrum, , for low SNR, Fig. 1(d), do not affect the unbiased amplitude estimate, , as the sparsity level is set to K = 2 at step 5 of the algorithm in Table I.
The results in Fig. 1 indicate that the SBL beamformer offers robust DOA estimation, particularly in case of snapshot-starved data, e.g., for non-stationary signals, and reverberant environments. Opposed to the CBF and MVDR beamformers which are implemented as spatial filters, the SBL beamformer involves an iterative estimation of the likelihood and prior hyperparameters. Figure 2 shows that the convergence rate, ε, decreases rapidly with the number of iterations while the CPU time on an Intel Core i5 increases linearly. The computational time for a single SBL iteration is ca. 3 ms compared to ca. 0.07 ms for CBF and ca. 0.1 ms for MVDR. Nevertheless, the reconstruction accuracy of SBL is significant. Notably, the computational time per number of snapshots is almost constant.26
(a) Convergence rate ϵ and (b) computational time of the SBL beamformer algorithm per number of iterations, at SNR = 20 dB (solid line) and 0 dB (dashed line).
(a) Convergence rate ϵ and (b) computational time of the SBL beamformer algorithm per number of iterations, at SNR = 20 dB (solid line) and 0 dB (dashed line).
In the following, the parameters of the SBL algorithm in Table I are set to Niter = 20 and ϵmin = 0.001. These values offer adequate estimation accuracy and computational efficiency (see Fig. 2) for problems of small dimensions, e.g., M = 4, N = 37, which are typical31,39 for the speech processing applications in focus. More iterations might be required for the SBL algorithm to converge for larger problems.26 The sparsity K is set to the number of sources in each case. Since speech is broadband, the multi-frequency update rule (30) is used for the SBL reconstruction.
IV. SIMULATION RESULTS
A listening scenario of interest where speech enhancement and separation is beneficial for speech intelligibility involves focusing at a reference talker in the presence of noise, competing talkers and reverberation. The performance of CBF, MVDR, and SBL beamforming in such conditions is demonstrated, herein, with simulations.
For the simulations, a ULA (31) is considered with M = 4 sensors. The inter-sensor spacing is d = 28.6 cm to avoid spatial aliasing, i.e., d/λ < 1/2, for frequencies up to 6 kHz which is the upper frequency for high speech quality, assuming airborne propagation with sound speed c = 343 m/s. The sources are speech excerpts from the EUROM1 English corpus40 including both male and female talkers of 1 s duration resampled at fs = 16 kHz. The speech excerpts, due to their short duration, have constant voice activity without silent intervals. Hence, the root-mean-square value of the target source, where T is the total number of samples, is used to determine the noise variance in relation to the SNR, . A DOA grid [−90°: 5°: 90°] is considered.
The signals are processed in 40 ms frames with 10% overlap. Each frame is further divided in 8 ms snapshots with 50% overlap resulting in L = 9 snapshots per frame. This way, the signal per frame can be approximated as stationary while having enough snapshots L > 2 M for a statistically robust sample data cross-spectral matrix (as in Ref. 41). A Hanning window is applied to each snapshot followed by a STFT. The resulting narrowband signals, for each frequency in the resulting spectrum ranging 0−8 kHz, are processed with steered beamforming methods for DOA estimation as detailed in Sec. III. Finally, for each direction on the resulting DOA map, an inverse STFT is applied to the reconstructed signals which are resynthesized to the time domain with the overlap-and-add procedure.42
Figure 3 depicts the DOA maps for the simple case of a single talker in the presence of additive noise at SNR = 15 dB, along with the spectrograms of the reconstructed signals at selected directions, calculated over frames of 40 ms duration, Hanning weighting and 50% overlap. Specifically, Fig. 3(a) indicates the actual source distribution across time and DOA. There is a single source of male speech at θ = 50° with frequency spectrum per time frame shown in the spectrogram in Fig. 3(b). The CBF, MVDR, and SBL estimates are depicted in Figs. 3(c), 3(f), and 3(i), respectively. In this case with a single source, additive noise at high SNR and sufficient snapshots, all methods reconstruct accurately the target signal at θ = 50° as shown in the corresponding spectrograms, Figs. 3(d), 3(g), and 3(j).
DOA maps for a single source (male talker) at 50° with additive noise at SNR = 15 dB for (a) the original signal, (c) CBF, (f) MVDR, and (i) SBL reconstruction. Spectrograms of the (b) clean signal, (d) CBF, (g) MVDR, and (j) SBL estimates at θ = 50°. Spectrograms of the (e) CBF and (h) MVDR estimates at θ = −50°.
DOA maps for a single source (male talker) at 50° with additive noise at SNR = 15 dB for (a) the original signal, (c) CBF, (f) MVDR, and (i) SBL reconstruction. Spectrograms of the (b) clean signal, (d) CBF, (g) MVDR, and (j) SBL estimates at θ = 50°. Spectrograms of the (e) CBF and (h) MVDR estimates at θ = −50°.
However, the low resolution CBF spreads the energy across the whole angular spectrum making DOA estimation very difficult. For example, there is a lot of energy at θ = −50°, especially at low frequencies, due to the single source at θ = 50°; see Fig. 3(e). This is explained by the coherence of the steering vectors (31) at different frequencies as indicated by the Gram matrices, 1/M(AH A), in Fig. 4. Note that each row of the Gram matrix, 1/M[aH(θ) A], is the CBF beampattern for a unit source at θ. At low frequencies the array aperture is too small to detect phase differences of the recorded wavefield across sensors and the CBF estimate is almost omnidirectional, Fig. 4(a). The CBF estimate becomes more directive for higher frequencies, Fig. 4(b), while for d/λ > 1/2 grating lobes appear in the estimate due to spatial aliasing Fig. 4(c). The directionality characteristics of CBF depicted in Fig. 4 indicate that processing only higher frequencies (e.g., above 2 kHz for the particular configuration) could improve the corresponding DOA estimates. However, this is not a suitable option for short-time processing of speech signals which have only a few energy (if any) at high frequencies as DOA estimation would fail due to absence of signal.
Gram matrices 1/M(AH A) indicating the coherence pattern of the steering vectors (31) for a ULA with M = 4 sensors and d = 28.6 cm uniform spacing at (a) f = 1 kHz, (b) f = 5 kHz and (c) f = 7 kHz.
Gram matrices 1/M(AH A) indicating the coherence pattern of the steering vectors (31) for a ULA with M = 4 sensors and d = 28.6 cm uniform spacing at (a) f = 1 kHz, (b) f = 5 kHz and (c) f = 7 kHz.
MVDR improves the resolution, Fig. 3(h), while SBL offers very accurate DOA estimation. Note that the spectrograms for the signal at θ = −50° in the clean and SBL DOA map are omitted since their energy is below the plotted dynamic range.
Figure 5 demonstrates the DOA estimation performance of CBF, MVDR, and SBL beamforming in the case of two sources, namely, a male talker at 0° and a female talker at 30° as shown in Fig. 5(a), and additive noise at SNR = 15 dB. The low resolution CBF offers smooth DOA reconstruction, Fig. 5(d), which results in poor localization hence poor signal separation. For example, the CBF estimate at 0° [Fig. 5(e)] contains energy not only from the source at 0° [Fig. 5(b)] but also from the source at 30° [Fig. 5(c)] and vice versa [Fig. 5(f)]. The MVDR estimate has improved resolution [Fig. 5(g)], attenuating more effectively signals from directions other than the focusing one [Figs. 5(h) and 5(i)]. SBL offers great spatial selectivity hence source separation [Figs. 5(j)–5(l)].
(Color online) DOA maps for a source (male talker) at 0° and a source (female talker) at 30° with additive noise at SNR = 15 dB for (a) the original signal, (d) CBF, (g) MVDR, and (j) SBL reconstruction. Spectrograms of the (b) clean signal, (e) CBF, (h) MVDR, and (k) SBL estimates at θ = 0°. Spectrograms of the (c) clean signal, (f) CBF, (i) MVDR, and (l) SBL estimates at θ = 30°. The blue box indicates an example of a time-frequency region where there is significant energy from the source at 0° and almost no energy from the source at 30° and vice versa within the red box.
(Color online) DOA maps for a source (male talker) at 0° and a source (female talker) at 30° with additive noise at SNR = 15 dB for (a) the original signal, (d) CBF, (g) MVDR, and (j) SBL reconstruction. Spectrograms of the (b) clean signal, (e) CBF, (h) MVDR, and (k) SBL estimates at θ = 0°. Spectrograms of the (c) clean signal, (f) CBF, (i) MVDR, and (l) SBL estimates at θ = 30°. The blue box indicates an example of a time-frequency region where there is significant energy from the source at 0° and almost no energy from the source at 30° and vice versa within the red box.
Finally, Fig. 6 shows the corresponding results to Fig. 5 when the source at 30° is a replica of the source at 0°. In this case, the sources are correlated, e.g., in the presence of strong reflections due to reverberant listening environments, and the MVDR estimate degenerates, merging the two sources into one and localizing it in between the true source directions [Figs. 6(g)–6(i)]. The SBL beamformer, localizes the two coherent sources accurately [Figs. 6(j)–6(l)].
The respective DOA maps and spectrograms as in Fig. 5 replacing the signal at 30° with a replica of the signal at 0°.
The respective DOA maps and spectrograms as in Fig. 5 replacing the signal at 30° with a replica of the signal at 0°.
A. Performance metrics
The results in Figs. 3 and 5 and 6 indicate qualitatively the performance of CBF, MVDR, and SBL DOA estimation in the presence of both uncorrelated and correlated sources under high-SNR listening conditions. To evaluate the performance of CBF, MVDR, and SBL beamforming quantitatively as a function of SNR, the following performance metrics are introduced:
- The relative root-mean-square error at the focusing direction,(32)
which indicates the relative noise level of the reconstructed signal at the focusing direction θf with respect to the clean signal x(θf, t), such that dB. The rrmse for the unprocessed data, e.g., the recorded signal at the mth microphome ym(t), indicates the relative noise level in the measurements, yielding the SNR. Hence, the SNR improvement due to the beamforming estimate is .
- The beamformer's directivity(33)
or equivalently the directivity index , which indicates the ratio of the power of the reconstructed signal at the focusing direction θf to the mean power of the reconstructed signal over all N directions on the angular grid. Thus, for an omnidirectional signal , i.e., the mean power over all directions on the grid is equal to the power at the focusing direction and D = 1 or DI = 0 dB. The more a beamformer suppresses the signal from directions other the focusing one, the larger is its directivity and the more accurate the DOA estimate. For a superdirective beamformer, such that , the directivity is maximized, D = N.
The short-time objective intelligibility (STOI) measure43 which is used to predict the speech intelligibility of the beamformed signal, hence evaluate perceptual consequences of the beamforming algorithm. STOI receives as inputs a clean reference signal and a degraded version of it due to noise and/or distortion and outputs the correlation coefficient (0 for unintelligible speech, 1 for fully intelligible speech) between the temporal envelopes of the input signals in short-time (384 ms) segments. STOI correlates well with subjective evaluation of speech intelligibility, i.e., from listening experiments.
The performance of CBF, MVDR, and SBL beamforming in reconstructing a target source at 0° in the presence of additive noise at a range of [−5:5:15] dB SNR is evaluated. Two noise types are examined, broadband white noise and babble noise constructed by overlapping speech from six talkers in the EUROM1 English corpus.40 For each noise type and at each SNR, beamforming estimates are obtained for 100 random realizations of speech and noise. The mean statistics of the performance metrics, namely, the rrmse at the focusing direction (32), the directivity (33), and the STOI score, are shown in Fig. 7.
(Color online) Mean values of (a) the at θf = 0°, (b) the directivity index DI, and (c) the STOI score for CBF, MVDR, and SBL beamforming reconstruction of a target source at 0° in the presence of white (solid lines) or babble noise (dashed lines) as a function of SNR from 100 random realizations. For comparison, the corresponding values for the data, i.e., the unprocessed signal from the first microphone on the array, are depicted.
(Color online) Mean values of (a) the at θf = 0°, (b) the directivity index DI, and (c) the STOI score for CBF, MVDR, and SBL beamforming reconstruction of a target source at 0° in the presence of white (solid lines) or babble noise (dashed lines) as a function of SNR from 100 random realizations. For comparison, the corresponding values for the data, i.e., the unprocessed signal from the first microphone on the array, are depicted.
All beamforming methods improve the SNR when focused at the direction of the target source compared to the SNR of the omnidirectional data for both noise types, Fig. 7(a). Consequently, the speech signal at 0° is enhanced over noise as indicated by the STOI scores in Fig. 7(c). However, the conventional CBF and MVDR beamformers have low directivity, Fig. 7(b), resulting in low resolution DOA maps with energy across the whole angular spectrum; e.g., see Figs. 3(c) and 3(f). Only the superdirective SBL beamformer, Fig. 7(b), offers unambiguous DOA estimation.
V. EXPERIMENTAL RESULTS
The high-resolution DOA estimation and speech separation capabilities of SBL are validated with experimental data in multi-talker, noisy, reverberant listening conditions. The measurement prototype comprises a workshop safety helmet circularly perforated above the cap and 8 microphones, which are adjusted on the front part of the helmet on a semicircular configuration with a uniform angular spacing 22.5°. The sensing matrix A for this array configuration is determined experimentally through the HRTFs. To obtain the HRTFs, the helmet is fitted on a Knowles electronics mannequin for acoustics research (KEMAR) and placed on a turning-base in the anechoic chamber at GN Hearing A/S, Ballerup, Denmark. Impulse responses are recorded for all microphones at a sampling frequency fs = 24 414 Hz, sequentially while rotating KEMAR by 2° until completing a full-circle rotation (θ = 0°: 2°: 360°, N = 181).
The measurement setup involves two speakers, the first exactly in front of KEMAR, at 0°, playing 2 s of male speech and the other towards the left ear, at −90°, playing 2 s of female speech. Both speakers were elevated to the plane of the array and placed at a radial distance of 1 m from KEMAR; see Fig. 8. The arrangement is set in an anechoic chamber and measurements are taken considering the full array as shown in Fig. 8(a) as a reference scenario, as well as in a populated canteen considering only the four microphones that are lying above the ears as shown in Fig. 8(b), as a challenging listening environment. All locations are at the facilities of GN Hearing A/S, Ballerup, Denmark. The signals are processed in single-snapshot, 20 ms frames with 50% overlap. A Hanning window followed by a STFT is applied to each frame and the resulting narrowband signals are beamformed with CBF and SBL. MVDR beamforming is omitted here due to the single-snapshot processing. The resulting steered responses are resynthesized with the overlap-and-add procedure.42
(Color online) Measurement setup and microphone positions for the considered array configurations.
(Color online) Measurement setup and microphone positions for the considered array configurations.
Figure 9 shows the DOA maps of the clean and the recorded signal in anechoic conditions and the CBF and SBL DOA estimates along with the corresponding spectrograms (calculated over frames of 40 ms duration, Hanning weighting and 50% overlap) at the speaker locations, i.e., at 0° and −90°, respectively. The two speech signals, Figs. 9(b) and 9(c), are mixed in the unprocessed single-microphone recording, Figs. 9(e) and 9(f), which does not offer directional information, Fig. 9(d). CBF attributes directivity to the microphone array by attenuating wavefield contributions from directions other than the focusing one, Figs. 9(h) and 9(i), but has low resolution, Fig. 9(g). The high-resolution SBL beamformer not only localizes accurately the two speakers, Fig. 9(j), but also separates the corresponding speech signals, Figs. 9(k) and 9(l), validating the simulation results, e.g., compare with Fig. 5. Similarly, Fig. 10, demonstrates the corresponding results for measurements in a populated canteen with reverberation time T60 = 0.9 s, at SNR = −6 dB. In this case, the recorded signal is very noisy due to babble, clinking cutlery, reverberation, etc., thus, both CBF and SBL DOA estimates deteriorate accordingly. Nevertheless, the SBL beamformer suppresses noise more effectively.
(Color online) DOA maps obtained with the array configuration in Fig. 8(a) for a source (male talker) at 0° and a source (female talker) at −90° in anechoic conditions for (a) the original signals, (d) the recorded signal from the front left microphone, (g) CBF, and (j) SBL reconstruction. Spectrograms of the (b) clean signal, (e) recorded signal, (h) CBF, and (k) SBL estimates at θ = 0°. Spectrograms of the (c) clean signal, (f) recorded signal, (i) CBF, and (l) SBL estimates at θ = −90°. The blue box indicates an example of a time-frequency region where there is significant energy from the source at 0° and almost no energy from the source at −90° and vice versa within the red box.
(Color online) DOA maps obtained with the array configuration in Fig. 8(a) for a source (male talker) at 0° and a source (female talker) at −90° in anechoic conditions for (a) the original signals, (d) the recorded signal from the front left microphone, (g) CBF, and (j) SBL reconstruction. Spectrograms of the (b) clean signal, (e) recorded signal, (h) CBF, and (k) SBL estimates at θ = 0°. Spectrograms of the (c) clean signal, (f) recorded signal, (i) CBF, and (l) SBL estimates at θ = −90°. The blue box indicates an example of a time-frequency region where there is significant energy from the source at 0° and almost no energy from the source at −90° and vice versa within the red box.
(Color online) The respective DOA maps and spectrograms as in Fig. 9 for signals recorded with the array configuration in Fig. 8(b) in a canteen.
VI. CONCLUSION
We use a probabilistic sparse signal reconstruction approach to solve simultaneously the sound source localization and speech enhancement problem within the SBL framework. The SBL formulation offers sparse robust DOA estimates by auto-regularizing a hierarchical Bayesian model with adaptive selection of the hyperparameters from the data.
Contrary to established spatial filtering methods, SBL beamforming provides high-resolution acoustic imaging even with correlated arrivals and single-snapshot measurements. Both simulation results with a ULA and experimental measurements with a semi-circular prototype array show that SBL beamforming offers simultaneous sound source localization and separation offering speech enhancement over noise, reverberation and competing talkers.
ACKNOWLEDGMENTS
This work was supported by the Innovation Fund Denmark, under Grant No. 99-2014-1.