Enhancement algorithms for wireless acoustic sensor networks (WASNs) are indispensable with the increasing availability and usage of connected devices with microphones. Conventional spatial filtering approaches for enhancement in WASNs approximate quantization noise with an additive Gaussian distribution, which limits performance due to the non-linear nature of quantization noise at lower bitrates. This work proposes a postfilter for enhancement based on Bayesian statistics to obtain a multidevice signal estimate, which explicitly models the quantization noise. The experiments using perceptual signal-to-noise ratio, perceptual evaluation of speech quality, and MUSHRA (multistimulus with hidden reference and anchors) scores demonstrate that the proposed postfilter can be used to enhance signal quality in ad hoc sensor networks.
1. Introduction
The emergence of connected and portable devices like smartphones and the rising popularity of voice user-interfaces and devices equipped with microphones enable the necessary infrastructure for ad hoc wireless acoustic sensor networks (WASNs). The dense, ad hoc positioning and collaboration in a WASN leads to efficient sampling of the acoustic space, thereby gaining higher quality signal estimates compared to single-channel estimates (Bertrand, 2011). Typical applications of ad hoc WASNs use microphones on low-resource devices, such that we need low-complexity methods that use bandwidth efficiently to compress and transmit the acoustic signals. This involves quantization at the encoder, whereby the received signal at the decoder is usually degraded by quantization noise (Bäckström and Fischer, 2016; Bäckström and Fischer, 2017; Bäckström, 2017; Dragotti and Gastpar, 2009; Pradhan and Ramchandran, 2003).
Past works on WASN often overlook the variability in maximum capacity of sensors (Zahedi et al., 2015). However, rate-constrained spatial filtering like beamforming and multichannel Wiener filtering have been used in binaural hearing aids (HAs) (Doclo et al., 2009; Dragotti and Gastpar, 2009; Roy and Vetterli, 2008; Srinivasan and Den Brinker, 2009a, 2009b). A study on rate-constrained optimal beamforming showed the advantage of using spatially separated microphones in HAs, although the method assumes that the joint statistics of signals are available at the processing nodes (Roy and Vetterli, 2008). Subsequently, sub-optimal strategies for noise reduction that do not use the joint statistics at the nodes have been proposed (Amini et al., 2019a,b; Doclo et al., 2009; Roy and Vetterli, 2008; Srinivasan and Den Brinker, 2009a, 2009b). While the above methods are effective in reducing noise, they are either limited to or are most efficient with two nodes (HAs) only. In a recent work on multinode WASN, a linearly constrained minimum variance beamformer was used to optimize rate allocation and sensor selection over nodes, based on spatial location and frequency content (Amini et al., 2019a,b; Zhang et al., 2017). However, due to the dynamic nature of an ad hoc WASN, information about sensor distribution, location, and number of targets and interference sources may be unavailable, or their exchange between nodes further adds to the bandwidth consumption and communication complexity. Further, the above methods assume an additive quantization noise model, which is accurate only at higher bitrates. Lastly, while all the above methods are optimized on Wyner–Ziv coding, their suitability in combination with existing speech and audio coding has not been demonstrated yet. Their performance in single-channel mode can therefore not compete with conventional single-channel codecs.
In this paper, we propose a Bayesian postfilter for enhancement in ad hoc WASNs, which explicitly models the quantization noise within the optimization framework of the filter and can be applied on top of existing codecs with minimal modifications. Thus, the main contribution of the current work is the postfilter, which takes quantization into account through truncation while retaining the conventional assumption of additive Gaussian background noise, thereby resulting in a truncated Gaussian representation of the clean speech distribution. To evaluate the proposed methodology, we make the necessary assumptions that the devices are dominantly degraded either by background noise and reverberation or by coding noise due to quantization, and each device operates at its maximum capacity. In line with past works, we show that by distributing the total available bitrate between the two sensors, the output gain of the WASN signal estimate is higher than the output gain of a low input SNR single sensor transmitting at full bitrate (Doclo et al., 2009; Roy and Vetterli, 2008; Srinivasan and Den Brinker, 2009a,b). In addition, we present the advantages of incorporating the exact quantization noise models within the optimization framework. To focus on the effect of the postfilter on quantization noise, we apply the proposed method on the output of a codec (Bäckström et al., 2018), which is specifically designed to address multidevice coding. To the best of our knowledge, this is the first time a complete WASN system has been evaluated with competitive performance also in a single-channel codec mode. Although we have not yet included models of spatial configuration of sensors, room impulse responses, or multiple sources, we show that the proposed method already yields large output gains.
2. Methodology
To focus on the novel aspects of the approach, we consider a simple WASN consisting of two devices with microphones: (1) a low-resource device A with high input SNR and (2) a high-resource device B with low input SNR, as illustrated in Fig. 1. An example application is a smartwatch that collaborates with a distant smart speaker. Let be the perceptual domain representations of the speech and noise signal, respectively, at the frequency bin k and time frame t (Bäckström, 2017); the perceptual domain representations are computed by dividing the frequency domain signals by the perceptual envelope obtained from the codec (Bäckström, 2017). These signals can be approximated by zero-mean Gaussian distributions with variances and , whereby the random variables are correspondingly (Kim and Shevlyakov, 2008). Under the assumption of uncorrelated, additive background noise, the noisy signal is Gaussian distributed with and variance (Kim and Shevlyakov, 2008). Our goal is to estimate the distribution of clean speech, conditioned over the noisy observation , in other words, the posterior distribution (Särkkä, 2013). We obtain estimates for every time-frequency bin and shall omit the time and frequency subscripts in the rest of the section to aid readability. According to the Bayes rule, the posterior distribution can be written as
where P(X) and P(Y) are the prior distributions of the speech and observed signals and is the conditional likelihood. However, our quantized observation, , of the noisy signal gives more evidence about X; The true value of the noisy signal Y lies within the quantization bin limits, , and the lower and upper bin limits for the quantization levels in a frame are obtained from the observed quantized spectrum of a frame (Das and Bäckström, 2018). Since the true noisy signal lies in the bounded field , we compute the summation of the likelihood over the quantization bin limits to obtain the posterior distribution of speech,
where signifies equality up to a scaling factor. Eq. 2 can be rewritten as the difference between cumulative distributions, . The conditional likelihood can be represented as , thus resulting in the final equation for the posterior distribution,
where erf(·) is the error function. Note that due to the use of the exact quantization bin limits, corresponds to a truncated Gaussian (Barr and Sherrill, 1999). This is in contrast to past works, where the quantization noise is approximated by an additive Gaussian distribution, which is an accurate approximation only at high bitrates (Amini et al., 2019a,b).
From Eq. (3), the single-channel posterior probability distribution function (PDF) of the clean speech in spatial channel i is as follows:
Here we assume that the speech and noise energies at each channel are estimated in a pre-processing stage, for example, using voice activity detection and minimum statistics (Martin, 2001). Additionally, to focus on the advantage of the proposed enhancement approach, we assumed that the time-delay between microphones with respect to the desired sources was known at the decoder, whereby the signals from the microphones were synchronized. We shall include time-delay estimation within the enhancement framework in future work. Based on our setup, the environmental degradation and the bitrate are different for the two channels. Hence, we can assume that , and the quantization bin offsets are uncorrelated and independent between the two channels. Therefore, when conditioned on Y, due to conditional independence between the channels, the joint posterior PDF of speech over the network can be represented as , where M is the number of microphones in the WASN. The posterior PDF of speech in a two microphone network is thus
We obtain the multidevice signal estimate , optimal in minimum mean squared error (MMSE) sense (Särkkä, 2013) by computing the expectation of the PDF obtained from Eq. (5). Due to the product of error functions in Eq. (5), the expectation does not have a known analytical formulation. Therefore, we approximate the expectation of the PDF via numerical integration (McLeod, 1980); computing the Riemann sum using the midpoint rule over intervals n = 200 provided an approximate with sufficient accuracy in our experiments. Hence, the final equation is
The system block diagram is depicted in Figs. 2(a) and 2(b), where Fig. 2(a) is the overview of the entire system, from acoustic signal acquisition at the sensors to obtaining the time-domain estimate from multidevice signals. Note that the postfilter is placed at the fusion center, directly after the decoder, which provides the decoded perceptual domain signals to the postfilter. Fig. 2(b) shows the internal structure of the postfilter. After receiving the quantization bin limits from the decoded signals, we compute the truncated Gaussian distribution for each channel and then compute the joint posterior distribution as the product of the truncated distributions of the channels. The final point estimate, obtained as the expectation of the posterior distribution, yields the multidevice signal estimate.
3. Experiments and results
To evaluate the performance of the proposed postfiltering approach, we determined the perceptual signal-to-noise ratio (PSNR) and PESQ scores (Bäckström, 2017) and conducted a subjective listening test using MUSHRA (multistimulus with hidden reference and anchors) (ITU-R, 2014; Schoeffler et al., 2015). We considered two categories of degradation: (1) additive background noise and (2) background noise with reverberation. For the background noises, from the QUT dataset, we extracted the cafeteria scenario with babble noise (Dean et al., 2010). The clean speech samples were obtained from the test set of the TIMIT dataset (Zue et al., 1990). We encoded the noisy samples and applied the proposed postfilter to the decoded samples. Hence, the output signal is corrupted by both coding and environmental artefacts. To generate noisy speech with reverberation, we considered a room of dimensions 7.5 × 5 × 2 m3, with one speech source at coordinates (1, 2.5, 0.5) m and three noise sources placed at (6.5, 2.85, 0.5) m, (3.5, 4.5, 0.5) m, and (6, 0, 0.5) m. The locations of the near and distant microphones are, respectively, (1.05, 2.55, 0.5) m and (2.25, 2.85, 0.5) m. An illustration of the setup is presented in Fig. 1. The signals at the microphones for the described acoustic scenario were simulated using Pyroomacoustics (Scheibler et al., 2018).
Let ρ and γ represent the PSNR and PESQ scores, respectively, and R the total bitrate. The postfilter is applied on the output of a codec that is specifically suitable for multidevice coding (Bäckström et al., 2018). For a fair evaluation, the single-channel enhancements from Eq. (6) are used as baselines. Furthermore, we employ the conventional multichannel Wiener filter (MWF) with diagonalized covariance matrix to evaluate the advantage of the proposed method with respect to a conventionally accepted baseline (Doclo and Moonen, 2002). The notations and their definitions are as follows. (1) is the multidevice estimate using device A at the and device B at the ; the PSNR and PESQ scores of the estimate are ρMC and γMC, respectively. (2) is the baseline posterior estimate (from Eq. 6) at distant device B, encoding at full ; and are the objective measures. (3) is the baseline posterior estimate (from Eq. 6) at device A using , and and are the objective measures. (4) is the multichannel Wiener filter using noisy signals from device A and device B, and and are the objective measures. We show the advantage of the proposed postfilter over the baseline methods using differential PSNR and PESQ scores; their definitions are (1) , (2) , (3) , (4) , (5) , and (6) .
The input SNR at device A was fixed to 40 dB and at device B, we used a range of input . From the test set of the TIMIT dataset, we randomly selected 100 speech samples (50 male and 50 female) and tested the postfilter over all the combinations of the bitrates, , and the input SNRs for each speech sample. The objective results for the additive noise scenario are presented in Figs. 3(a) and 3(c). , , and are shown in Fig. 3(a) for the listed SNRs and the total ; We found that the PSNR of the proposed method was better than all three baselines over all SNRs and bitrates. For relative to the single-channel estimate , the highest differential PSNR is . With respect to , the highest is obtained at 30 dB input SNR and 16 kilobits/s (kbps). In addition, we observe that decreases with the increase in the input SNR at device B; also, it increases with an increase in total bitrate due to lower degradation from coding noise, specifically at device A. In contrast, increases with an increase in the input SNR at device B but decreases with increase in the total bitrate. In terms of PESQ, the largest differential PESQ for relative to is Mean Opinion Score (MOS), attained at −5 dB and 32 kbps. However, at 16 kbps and above 15 dB, the negative MOS implied a decrease in quality. With respect to , largest value is at 30 dB input SNR at device B. Furthermore, the variations of and relative to the input SNR and bitrate follow similar trends as differential PSNR. Without exception, we observed similar trends for all the listed bitrates. The inverse variations of the differential scores with respect to and support our expectation that the proposed postfilter optimally merges information from the two channels to obtain an enhanced multidevice estimate.
The test was repeated to include reverberation over a range of absorption coefficients, . The results for are illustrated in Figs. 3(b) and 3(d). While is positive for both bitrates over all the listed absorption coefficients, is consistently negative. One reason for this could be that while the postfilter reduces environment noise, as is reflected in the improvement with respect to , it may introduce some speech distortion or be unable to completely remove reverberation due to the lack of reverberation model, which shows as a drop in the PSNR with respect to . Nevertheless, both and are positive over both the bitrates and all α, and they follow similar variation trends as in the additive noise scenario. Lastly, the positive differential objective scores for both noise types with respect to the MWF indicate that the PSNR and PESQ gains of the proposed postfilter are larger than the gains obtained using the multichannel Wiener filter. This supports our informal observation that Wiener filtering is inefficient in capturing the essential features of speech signals.
The subjective MUSHRA listening test contained eight test items (four male and four female), four of which included background noise with reverberation at , while the remaining items were comprised of background noise only at . Each test item consisted of five test conditions and the reference clean speech signal; a hidden reference and a lower anchor, which was the 3.5 kHz low-pass version of the reference signal, , and were presented as the test conditions; total bitrate was . As post-screening, we retained the responses from only those subjects who rated the hidden reference at more than 90 MUSHRA points for all items. Figure 4 presents the consolidated differential MUSHRA, represented as η, from 13 participants who passed the post-screening; the boxplots show the median and interquartile range of η. The background noise with reverberation is presented in items , and the background-noise-only samples are items . Items are female, and the rest are male. was positive for all items, indicating that most subjects preferred over . With respect to , the variations were found to be gender dependent. While the median points were positive for most male items (mean-M), they were negative for females (mean-F). Further analysis of the samples revealed that while background noise was attenuated in the , speech distortions were introduced into the estimate, and those distortions were more prominent in the female samples. This problem could potentially be addressed by using more informative speech priors and modifying the signal model to incorporate the effects of reverberation.
To study the region of optimal performance of the postfilter, we analyzed the average as a function of bitrate and input SNRs and absorption coefficient α; the resulting contour plots are depicted in Fig. 5. For the additive background noise scenario, the highest gains are at higher bitrates and low input SNRs. Furthermore, the negative over 20 dB input SNR and below 32 kbps implies that the postfilter performs sub-optimally in this region; in other words, we gain from a multidevice signal estimate when the additive degradation level is below 20 dB and the total bitrate is greater than 32 kbps. In the presence of reverberation, we observed that while the total bitrate had an impact on , the improvement was fairly constant over the range of α at an arbitrary bitrate, and the improvement was positive over the considered input SNR range. This implies that the proposed postfilter can also be used to enhance signals degraded by reverberation and is not especially sensitive to the amount of reverberation, despite the fact that the signal model did not explicitly account for distortions from reverberation.
4. Conclusion
In this work, we proposed a postfilter to enhance speech in an ad hoc sensor network. The method explored the feasibility of using sources degraded by two distinct noise types to obtain an enhanced estimate of the clean speech signal. We demonstrated that by distributing the total available bandwidth between two sensors, we can achieve signal quality that is higher than a single-channel estimate operating at full bitrate. Further work is needed to address the classic noise reduction vs speech distortion problem, by incorporating a signal model that takes into account the effects of reverberation, although the objective and subjective results are already encouraging.