For the enhancement of single-channel speech corrupted by acoustic noise, recently short-time Fourier transform domain clean speech estimators were proposed that incorporate prior information about the clean speech spectral phase. Instrumental measures predict quality improvements for the phase-aware estimators over their conventional phase-blind counterparts. In this letter, these predictions are verified by means of listening experiments. The phase-aware amplitude estimator on average achieves a stronger noise reduction and is significantly preferred over its phase-blind counterpart in a pairwise comparison even if the clean spectral phase is estimated blindly on the noisy signal.
1. Introduction
To reduce the detrimental effect of acoustic noise, frequently short-time discrete Fourier transform (STFT) domain single-channel speech enhancement algorithms are employed, of which the vast majority effectively only enhances the amplitude of the complex valued noisy spectral coefficients. Only few approaches utilize or modify the spectral phase. The focus on the spectral amplitude is frequently motivated by the work of Wang and Lim,1 where it was reported that for STFT segments of practically relevant lengths and an overlap of 50%, little or no improvement in perceptual speech quality was achieved by only improving the spectral phase. Similarly, combining a conventional clean speech amplitude estimator with an estimate of the clean speech spectral phase instead of the noisy one, showed only little effect on the speech enhancement performance in terms of instrumental measures.2 Only for a more redundant STFT setup, e.g., with 32 ms segments overlapping by 7/8 and using zero-padding, the improvement of the spectral phase has been reported to have a considerable influence on the perceptual quality of conventional clean speech amplitude estimators.3 Such a redundant setup, however, increases the computational overhead, which is a critical factor in many applications such as hearing devices.4 Recently, it has been proposed to use the clean speech spectral phase also to facilitate the estimation of the clean speech spectral amplitude.5 For the resulting phase-aware clean speech estimators, instrumental measures predict that phase-aware estimation may push the limits of speech enhancement algorithms further. However, an evaluation of the perceptual quality of these phase-aware estimators by means of listening experiments is still lacking.
Previous studies showed that single-channel approaches in general can hardly improve speech intelligibility.4,6 Nevertheless, recent advances in the area of pre-trained speech enhancement, despite some challenges,7 led to promising results and improvements in speech intelligibility have been reported for well-trained ideal ratio mask (IRM) estimators.8 Also, very recently, it has been shown that considering the spectral phase to compute a complex-valued IRM instead of a conventional real-valued IRM, can be beneficial in terms of perceptual quality,9 at least if trained for the desired speaker and noise type. In this letter, we show that the general Bayesian phase-aware clean speech estimators,5 which do not rely on prior training or knowledge about the speaker, can lead to improvements in perceptual quality over the corresponding phase-blind estimator. The quality improvements predicted by instrumental measures are verified by means of formal listening experiments for various acoustic scenarios, even for a computationally efficient STFT setup without zero-padding and with adjacent segments overlapping by only 50%. Moreover, the results suggest that using the clean speech spectral phase for an improved amplitude estimation5 is more robust than directly employing it for signal reconstruction in a practical enhancement setup.
2. Algorithms
Noise-corrupted signals, sampled at 16 kHz, were transformed into the STFT domain, using a segment length of 32 ms, a segment shift of 16 ms, no zero-padding, and a square-root Hann window for analysis and synthesis. The noise power spectral density (PSD) and the a priori signal to noise ratio (SNR), which were needed for all algorithms, were estimated using state-of-the-art adaptive estimators as detailed in the original proposal of the phase-aware estimators.5 We introduced a lower limit of −6 dB to the a priori SNR ξ, which is a key quantity in clean speech estimators, effectively limiting the maximum attenuation that can be applied to the noisy signal. Such a lower limit is commonly used in practice to avoid speech distortions and to mask artifacts at the cost of introducing a residual noise floor.
In this letter, we investigate three different ways to estimate clean speech.
Phase-blind amplitude estimation (PBA). First, the clean speech spectral amplitude was estimated using the conventional, phase-blind short-time spectral amplitude estimator (STSA).10 The minimum mean square error (MMSE) optimal amplitude estimate was combined with the noisy spectral phase to obtain an estimate of the complex clean speech coefficient.
Phase-aware amplitude estimation (PAA). The clean amplitude was estimated with the phase-aware amplitude estimator5 and combined with the noisy spectral phase. In contrast to PBA, the estimator PAA was derived under the assumption that the clean speech spectral phase is given. The more the observed spectral phase differs from the clean speech phase, the more attenuation is applied by the estimator. Since large deviations from the clean speech phase are more likely in heavily disturbed spectro-temporal regions, the phase information yields additional means to distinguish noise from speech.5
Phase-aware estimation of complex coefficients (PAC). Again, the clean amplitude was estimated via the phase-aware estimator,5 but this time the estimated amplitude is combined with (an estimate of) the clean speech spectral phase instead of the noisy phase.
We set the parameters of the phase-aware estimator5 to β = μ = 1, such that, besides the phase-awareness, the statistical models for the noise and the speech amplitudes are the same for PAA, PBA, and PAC. The estimators PAA and PAC can hence be considered the phase-aware counterparts of the conventional phase-blind estimator PBA, which allows for a fair comparison and an isolated investigation of the impact of phase-awareness.
3. Experimental setup
In the experiments we considered three different noise types: pink noise, babble noise, both taken from the NOISEX-92 database, and a recording of a stroll in a park, which also included some wind noise. We further introduced some highly impulsive disturbances by adding some hammer blows to the babble noise signal. Thus, while the pink noise was stationary, the other two noise types were highly non-stationary. The noises were mixed with clean speech utterances taken from the TIMIT database at SNRs of −5 and 5 dB, respectively. A 1 s long noise-only section was inserted at the beginning of each utterance to ensure a reliable initialization of the noise PSD estimator, but was removed in the final test material. The signals that have been used in the experiments can be found in Ref. 11.
The different algorithms were compared to each other by means of a pairwise preference test. For each of the six acoustic scenarios, the participants were asked to judge which of the two presented signals (A) contains less noise, (B) offers the higher speech quality, and (C) they overall prefer. Each criterion was tested for all acoustic scenarios before proceeding to the next criterion. The order in which the different signal pairs were presented, as well as the order of the acoustic scenarios had been randomized. The participants were allowed to listen to the signals as often as they liked. There was no option to choose that both signals are equal. As a reference, also the clean speech utterance was always provided.
To investigate the potential benefit of phase-aware speech enhancement, we conducted two experiments, which differ in the way the clean speech phase needed for the two phase-aware approaches PAA and PAC was obtained. In the first experiment, the clean speech phase was perfectly known and taken from the original clean speech utterance to investigate phase-aware speech enhancement under idealized conditions, detached from the shortcomings of existing phase estimation schemes. In speech pauses, where using the spectral phase of the clean speech recording is not sensible, for PAA and PAC we employed the phase-blind amplitude estimator10 in combination with the noisy phase. We evaluated the unprocessed noisy signal, PBA, and the two phase-aware estimators PAA and PAC, leading to six pairwise comparisons per acoustic scenario and evaluation criterion.
In the second, practically more relevant experiment, the clean speech phase was blindly estimated from the noisy signal based on a harmonic model,12 where we performed the estimation only across frequency bands, not along time.12 Based on the outcomes of the first experiment, in the second experiment we dropped PAC and only compared the unprocessed noisy signal, PBA, and the phase-aware amplitude estimator PAA. Since the phase estimation approach12 provides phase estimates only for voiced speech sounds, in signal segments for which no phase estimate was available we again used the phase-blind amplitude estimator10 together with the noisy phase for PAA. Segments that contain voiced speech were detected using the fundamental frequency estimator PEFAC,13 which was also used to estimate the corresponding fundamental frequency needed for the phase estimation.12 Due to this limitation of the phase estimation scheme,12 improvements over PBA can only be expected for voiced speech sounds. Furthermore, in the second experiment the conventional estimator10 was also used for frequencies above 3 kHz to limit the impact of errors in the phase estimate, which accumulate towards higher frequencies.12
The signals were presented diotically to the participants over closed Sennheiser HDA 200 headphones using an RME Fireface UCX sound card in a quiet office. The clean speech signal was calibrated to a sound pressure level of 65 dB, measured using a G.R.A.S. artificial ear KB1065, a G.R.A.S. 40AG 1/2 in. microphone, a G.R.A.S. 26AC preamplifier, and a Norsonic Nor140 sound analyzer. 20 self-reported normal hearing listeners participated in the tests, aged between 20 and 36.
While for head-to-head comparisons we used a binomial test to check for statistical significance, if more than two algorithms were compared we used the non-parametric Friedman test followed by a Wilcoxon signed-rank test with Holm-Bonferroni correction. If not stated otherwise, we assume statistical significance for p < 0.05.
4. Results and discussion
4.1 Perfectly known clean speech phase
In Fig. 1, we present the percentage of times each algorithm has been preferred by the participants, e.g., a score of 100% is achieved if the respective algorithm won every pairwise comparison it was involved in. In all six acoustic scenarios, shown in the first three columns, utilizing the information about the clean speech phase in PAA and PAC led to an increased noise reduction (NR) with respect to the conventional phase-blind approach PBA. The preference in terms of NR was statistically significant in all scenarios. At the same time, the speech quality (SQ) of PAA was always rated similar or better than that of its phase-blind counterpart PBA. Using the complex estimator PAC, however, led to significant degradations in speech quality in the pink noise case (third column), even though the clean speech spectral phase was perfectly known. Due to the lower limit on the a priori SNR and the resulting limited attenuation of the estimators, noise dominated regions were not completely suppressed, leading to a residual noise floor. Combining the spectral phase of a weak underlying speech sound with the amplitude of this residual noise floor can lead to audible artifacts in the enhanced signal.12 Of the three considered noise types, the residual noise floor was most prominent for pink noise, and so were these artifacts. The degradation in speech quality has also been reflected in the overall preference for pink noise. The phase-aware amplitude estimator PAA on the other hand was predominantly preferred over its phase-blind counterpart PBA in all of the tested scenarios, with significant improvements for all scenarios but pink noise at 5 dB SNR. This indicates that using the clean speech phase to improve the amplitude estimate as in PAA is more robust than straight forwardly using it to also exchange the noisy phase for signal reconstruction as in PAC.
(Color online) Preference ratings with the corresponding standard errors when the true clean speech phase is known. In (a), each acoustic scenario is evaluated separately. In (b), all scenarios are combined and only head-to-head comparisons of PBA and PAA are considered. Criteria: speech quality (SQ), noise reduction (NR), overall preference (OP).
(Color online) Preference ratings with the corresponding standard errors when the true clean speech phase is known. In (a), each acoustic scenario is evaluated separately. In (b), all scenarios are combined and only head-to-head comparisons of PBA and PAA are considered. Criteria: speech quality (SQ), noise reduction (NR), overall preference (OP).
In general, the improvements of phase-aware speech enhancement were more prominent for the non-stationary noises (first and second column), where speech enhancement is more challenging. In contrast to the stationary pink noise, in non-stationary noises the conventional phase-blind approaches suffer from erroneous estimates of the PSDs of the speech and the noise, resulting in a suboptimal suppression of the non-stationary noise components. However, using phase-aware estimators, information about the clean speech spectral phase allows to still identify and suppress the undesired non-stationary noise.5
On the right of Fig. 1, we only evaluated head-to-head comparisons between PAA and its phase-blind counterpart PBA, taking all six acoustic scenarios into account. For the head-to-head comparisons we assumed a binomial distribution for the ratings and thus performed a two-tailed binomial test to check for statistical significance. Although the preference in terms of speech quality was not significant (p > 0.05), the preference for PAA over PBA in noise reduction and overall were significant with p < 0.0001. The results highlight the potential of phase-aware amplitude estimation, which can achieve improvements even for this practical setup with 50% overlap and without zero-padding.
Finally, we also measured the consistency of the ratings of each participant by means of the coefficient of consistency14 0 ≤ ζ ≤ 1. While a high consistency ζ indicates that the differences between the presented signals are perceptually distinct with respect to the tested criterion, here, for random ratings the expected value of ζ is 0.5.14 The average consistency values and corresponding standard errors were , and , respectively.
4.2 Estimated clean speech phase
In the second and practically more relevant experiment, the clean speech phase was blindly estimated given only the noisy signal. Based on the outcome for a perfectly known clean speech phase in the first experiment, we decided to concentrate on PAA and drop the complex estimator PAC. The results are presented in Fig. 2. For stationary pink noise (third column), the phase-aware amplitude estimator PAA and the conventional phase-blind approach PBA performed virtually the same. For the non-stationary noise types (first and second column), PAA was preferred by the majority of listeners over PBA in terms of noise reduction. In contrast to the first experiment reported in Fig. 1, PBA tended to be preferred over PAA with respect to speech quality when the estimated phase was employed. However, the differences were not statistically significant. Except for the scenario of babble noise at 5 dB SNR on the bottom left, these deteriorations were outweighed by the increased noise reduction and PAA was preferred by the majority of listeners, with statistically significant differences at low SNRs. These trends are also reflected on the right of Fig. 2, where head-to-head comparisons between PBA and PAA for all six scenarios have been evaluated. Again, as for the perfectly known clean speech phase in Fig. 1, also for the estimated clean speech phase, the preference for PAA over PBA in terms of noise reduction (p < 0.0001) and overall (p < 0.001) was statistically significant, while there was no significant difference in terms of speech quality. We again used a two-tailed binomial test for the head-to-head comparisons. Considering that the phase-aware enhancement is limited to voiced sounds due to the limitations of the model-based phase estimator,12 these improvements are remarkable.
(Color online) Preference ratings with standard errors when the clean speech phase is blindly estimated. In (a), each acoustic scenario is evaluated separately. In (b), all scenarios are combined and only head-to-head comparisons of PBA and PAA are considered.
(Color online) Preference ratings with standard errors when the clean speech phase is blindly estimated. In (a), each acoustic scenario is evaluated separately. In (b), all scenarios are combined and only head-to-head comparisons of PBA and PAA are considered.
The average consistency values were , and , respectively, where here the expected value for random ratings was 0.75.14
5. Concluding remarks
In this study we investigated the importance and potential of phase-aware speech enhancement by means of listening experiments. Overall, the phase-aware amplitude estimator PAA5 achieved a higher noise reduction and was preferred by the majority of listeners over its conventional phase-blind counterpart PBA,10 even when the clean speech phase was blindly estimated. The results further indicate that utilizing the clean spectral phase only to improve the amplitude estimation as in PAA is more robust than using the phase also for signal reconstruction as in PAC. The improvements were most prominent in acoustically challenging scenarios, i.e., for highly non-stationary noise, where conventional clean speech estimators suffer from erroneous estimates of the speech and the noise PSDs. In such situations the benefit of the phase information is most valuable, as it gives additional means to distinguish speech from noise outliers. In contrast to other studies that used a modified spectral phase only for signal reconstruction, the improvements with phase-aware amplitude estimation have been achieved within a computationally efficient STFT setup without additional redundancies, i.e., without zero-padding and with consecutive windowed segments overlapping by only 50%.
Acknowledgments
This work was supported by the Project GE2538/2-1 and the Cluster of Excellence 1077 “Hearing4All,” both funded by the German Research Foundation (DFG).