Acoustic measurements using sine sweeps are prone to background noise and non-stationary disturbances. Repeated measurements can be averaged to improve the resulting signal-to-noise ratio. However, averaging leads to poor rejection of non-stationary high-energy disturbances and, in the case of a time-variant environment, causes attenuation at high frequencies. This paper proposes a robust method to combine repeated sweep measurements using across-measurement median filtering in the time-frequency domain. The method, called Mosaic, successfully rejects non-stationary noise, suppresses background noise, and is more robust toward time variation than averaging. The proposed method allows high-quality measurement of impulse responses in a noisy environment.
1. Introduction
The exponentially swept sine (ESS) is an excitation signal used in the measurement of room impulse responses (RIRs) since many decades.1,2 It is praised for its ability to achieve a high signal-to-noise ratio (SNR) in the resulting RIRs,3 robustness and consistency of measurements,3 and ease of rejection of harmonic distortion.2–5 The ESS is, however, sensitive to non-stationary disturbances, which translate to artifacts in the deconvolved RIRs.3,6,7 Additionally, if averaging of repeated ESS measurements is used to enhance the SNR, it often leads to time-variance-induced loss of energy in the resulting signal, especially in high frequencies.7,8 This work focuses on non-stationary noise removal from repeated sweep measurements while avoiding energy loss. Such methods are particularly useful when the sound environment cannot be controlled fully, such as in case of public spaces, urban environments, outdoor places, or factories with machinery.9
Currently, there is a lack of methods to remove non-stationary noise from acoustic measurements. Only few studies known to the authors developed techniques for this specific problem,6,10,11 however, not achieving a complete non-stationary noise removal. The literature regarding the removal of clicks from audio signals, such as speech or music,12,13 is more comprehensive, with median filtering being a popular signal-processing technique.12 Those approaches, though, are unsuitable for acoustic measurements, as the transients are only moderately suppressed and not fully rejected.13 Additionally, clicks are just one of several non-stationary noise types, which can contain fairly long or tonal noise events.6,7
In a previous study, we faced the problem of non-stationary noise corrupting background noise energy estimates in a detection algorithm.7 We found that using a median instead of a mean as an energy estimator significantly improved the robustness of energy evaluation and, in consequence, the non-stationary noise detection process.7 Similarly, the median is used in background subtraction techniques in image processing, where the across-frame temporal median filter allows the separation of moving (non-stationary) objects, such as cars and people, from the stationary background image.14 Median filtering was also used to remove unwanted reflections in semianechoic RIRs.15
This letter introduces a method that applies across-measurement median filtering to recorded sweeps. As the method outputs the clean sweeps from fragments of ESS measurements, we call it Mosaic. In analogy with across-frame video filtering, we treat the ESS and sound decay as a “stationary background image,” while the non-stationary noise is regarded as a “moving object” and filtered out. We show the advantage of time-frequency filtering over time-domain filtering in achieving a measurement with highly diminished non-stationary disturbances. The background noise energy suppression in the filtered sweeps is estimated. We compare the proposed method with commonly used averaging and with state-of-the-art methods of removing transients from acoustic measurements,6,10 showing the superiority of this approach.
2. Methodology
This section presents the proposed method for non-stationary noise removal. We discuss both time-domain (Mosaic-T) and time-frequency-domain (Mosaic-TF) variants of median filtering across repeated measurements.
2.1 Median filtering across repeated measurements
Given the possibility of recording an erroneous or corrupted signal, it is common practice to capture more than one sweep during an acoustic measurement session, especially when the environment is noisy. In the previous method, called the Rule of Two (Ro2), we recommended repeating the measurement until at least a pair of clean (i.e., free from non-stationary noise) sweeps is captured.7 Performing sufficiently many repetitions, however, is not always feasible, and all captured sweeps may be contaminated. Here, we take advantage of having multiple repetitions of the same measurement to remove non-stationary noise and recover a clean sweep from several contaminated ones.
Median filtering acts as a robust estimator, especially stable against out-of-distribution values, as its breakdown point is 0.5. This means that if less than 50% of data points are contaminated, the estimator will result in the correct value.16 Therefore, to recover a clean sweep from a series of K measurements based on Eq. (1), at each time n, at least samples need to be free from non-stationary noise, where is the rounding-up operator. In other words, there is no requirement for any of the K sweeps to be completely free of non-stationary noise as long as the disturbances do not occur at the same time. It is crucial, however, that all sweeps are time aligned before Mosaic is applied. Therefore, in this work, we pre-process all ESSs by upsampling by the factor of ten,17 time-aligning based on the lag of maximum values of cross correlation between two sweeps, and downsampling to the original sampling rate.
An example of non-stationary noise removal from a series of K = 3 ESS signals is presented in Figs. 1(a) and 1(b). Every signal was contaminated with a synthesized disturbance: a broadband click of 0.1 s duration (sweeps 1 and 3) or a sine tone of 1 kHz and lasting for 0.6 s (sweep 2). The noise events are marked in red in Fig. 1(a). Mosaic-T following Eq. (1) removes all the disturbances, recovering a clean sweep, as shown in Fig. 1(b).
Figure 1(c) illustrates a histogram of sample values distribution for ESSs from Fig. 1(a) between 2.3 and 3 s from the start of each sweep—at the time of the tonal disturbance in sweep 2. The sample-value distributions are different when only sweeps 1 and 3 are considered, as they are clean in the given time, to when all sweeps are analyzed. Including the contaminated sweep visibly broadens the distribution. However, the median values are unaffected, showing that, regardless of the non-stationary noise presence, less than 50% of the samples are contaminated.
2.2 Median filtering across time and frequency
Figure 1(d) shows the STFTs of a series of sweeps contaminated with synthetic tonal disturbances: a sine of 2 kHz lasting 3 s in sweep 1, a 0.6-s-long 1-kHz tone in sweep 2, and a 2-s-long 5.6-kHz tone in sweep 3. Again, the occurrences of non-stationary noise events are marked in red. The disturbance removal was performed according to Eq. (2). The result of Mosaic-TF is depicted in Fig. 1(e), showing the STFT of a perfectly cleaned ESS.
Figure 1(f) illustrates the distributions of magnitude values in 1-kHz frequency bin, where the tonal disturbance appears in sweep 2. Similarly, as in Fig. 1(c), when only the non-contaminated bins from sweeps 1 and 3 are considered, the distribution is different from when bins of all sweeps are analyzed. Notably, the number of high-magnitude bins increased with the presence of contamination. The medians of both distributions, however, remain unchanged, showing that less than half of all bins contain non-stationary noise. For this reason, the noise removal is perfect in Fig. 1(e).
3. Validation on measurements
In the following, we evaluate the proposed non-stationary noise removal method on two sets of measured ESS signals. We compare Mosaic with commonly used averaging in terms of contamination rejection, noise energy suppression, and robustness toward time variance.
3.1 Evaluation data
The validation of the proposed method was performed on two datasets of sweep measurements: Arni and MRTD. Arni contains sweeps recorded in the variable acoustic laboratory of the Acoustics Lab of Aalto University in Espoo, Finland. The ESSs were 3-s-long and spanned frequencies from 20 Hz to 20 kHz. Each measurement was repeated five times (K = 5). A more detailed description of the measurement procedure and equipment is found in the literature.7,19 We used the Ro2 on Arni to identify the sweeps corrupted by non-stationary noise.7
MRTD comprises sweeps measured in several spaces of the Otakaari 5 building in Espoo, Finland. The details of measurement setup and locations are presented elsewhere.20 The excitation signals were 2-s-long ESSs repeated three times for each configuration (K = 3). The noisy environment caused non-stationary noise to appear in many of the measured sweeps, with impulsive noise and speech being the most common disturbances.
3.2 Non-stationary noise removal using Mosaic
The performance of Mosaic-T, Mosaic-TF, and averaging is compared in Fig. 2 on the example of three sweeps from the MRTD database. Each ESS is contaminated with either speech [cf. Fig. 2(a)] or impulsive noise [cf. Figs. 2(b) and 2(c)].
The results of averaging process are depicted in Fig. 2(d), showing remains of the speech signal (between 0.4 and 1 s) and transients (around 1.1 s) still present. If Mosiac-T is used, the non-stationary noise events are removed, but harmonic distortion is enhanced, as illustrated in Fig. 2(e). As all the disturbances happen at different times and different frequencies in the measured sweeps, they are successfully removed by Mosaic-TF, as shown in Fig. 2(d).
Based on the results in Fig. 2, we recommend that when the measurement environment is noisy, contamination removal is carried out on signal STFTs, according to Eq. (2). Time-domain median filtering might also be used, as harmonic distortion is easily rejected in the deconvolution process. Averaging, however, proved unsuitable for post-processing ESSs containing non-stationary noise. The optimal time-frequency resolution highly depends on the nature of the disturbance. The filtering domain should be chosen such that the sparsity of the disturbances is increased so that repeated occurrence becomes unlikely.
3.3 Background-noise energy suppression
The aim of averaging repeated sweep measurements is to enhance the SNR of the resulting signal by suppressing the background-noise energy. A similar effect can be obtained using across-measurement median filtering. However, the median filtering is not as efficient as averaging for stationary noise with Gaussian distribution, resulting in background-noise energy suppression lower by up to a factor of dB.21
Method . | Mosaic-T . | Mosaic-TF . | Time-domain averaging . | |||
---|---|---|---|---|---|---|
K . | 3 . | 5 . | 3 . | 5 . | 3 . | 5 . |
Mean (dB) | 6.9 | 11.1 | 7.5 | 11.7 | 9.8 | 14.6 |
SD (dB) | 6.6 | 7.5 | 7.0 | 7.9 | 8.9 | 9.8 |
Method . | Mosaic-T . | Mosaic-TF . | Time-domain averaging . | |||
---|---|---|---|---|---|---|
K . | 3 . | 5 . | 3 . | 5 . | 3 . | 5 . |
Mean (dB) | 6.9 | 11.1 | 7.5 | 11.7 | 9.8 | 14.6 |
SD (dB) | 6.6 | 7.5 | 7.0 | 7.9 | 8.9 | 9.8 |
The results show that all algorithms perform well in line with the theory, both when K = 3 (expected background-noise energy suppression is 5.62 dB for median and 9.54 dB for mean) and K = 5 (expected result is 10.06 dB and 13.98 dB for median and mean, respectively). Mosaic-TF consistently outperforms Mosaic-T. However, the standard deviations, also given in Table 1, show considerable variability in the performance of all three methods.
3.4 Energy loss due to time variance
In room acoustics, the measurement environment is never strictly time-invariant due to small changes in temperature, humidity, and air movement.8 Therefore, a common drawback of cross-measurement averaging is the loss of signal energy (either in the sweep or in the deconvolved RIR) and is most prominent in the high frequencies.8
The time-variance-induced energy loss is illustrated in Fig. 3 on an example of ESSs from MRTD dataset. The energy of median-filtered and averaged signals is visibly lower than the single-sweep energy toward the end of the ESS, i.e., while the high frequencies are played. This difference is not attributed to the effect of background-noise energy suppression, which should influence the entire sweep in the same way, and not only the high frequencies.
Figure 3 suggests that the time-variance-induced energy loss is more prominent when averaging is used compared to the median-filtering. To test this hypothesis, we compared the energy loss between single sweeps and either averaged or median-filtered sweeps from Arni (900 conditions, K = 5) and MRTD datasets (324 conditions, K = 3).
The relatively stable environment during data collection for Arni results in energy losses around 0.4 dB, with no significant difference between the averaged and median-filtered signals. Much more time-variant measurements for MRTD show that using the median-filtering contributes to high frequencies being attenuated by 1.5 dB on average, while using averaging contributes to energy losses of 2.3 dB on average. This shows that Mosaic-TF is more robust toward time-variance-induced energy losses in sweeps than averaging.
4. Comparison with non-stationary noise removal method
In the previous section, we showed that the proposed method is superior to frequently used averaging when it comes to recovering a clean ESS signal in the presence of non-stationary noise. In the following, we compare the Mosaic with the non-stationary noise removal techniques proposed by Ćirić et al.6 and Telle and Vary.10
Ćirić et al.6 refer to their procedures as A, B, and C. In procedure A, the contaminated part of the measured sweep is replaced with the corresponding normalized fragment of the excitation ESS. In procedure B, the magnitude of the contaminated part is multiplied by the magnitude of the excitation sweep, while the phase stays the same. Procedure C requires prior knowledge of the background noise and the transient itself, as they are used in inverse filtering when deconvolving sweeps into RIRs. Telle and Vary10 use weighted averaging, where weights for each of K signals are determined based on signal power. The signal with the highest power is assumed contaminated, and thus receives the lowest weight.
In this work, we used a series of five sweeps from Arni, where one of the signals is contaminated with transient noise. We carried out procedures A, B, and C on a single sweep, and weighted averaging and proposed Mosaic-TF using K = 5 measurements. For procedure C, we estimated the background noise from the parts of measurement without sweep playing, and an RIR obtained from one of the clean sweeps was used as an equivalent of the transient noise.
The comparison results are depicted in Fig. 4. The sweeps in Fig. 4(a) represent the measured signals contaminated with a transient (marked by a blue rectangle). The proposed method, shown in Fig. 4(b) exhibits complete suppression of the impulsive artifact and an added advantage of enhanced SNR. Weighted averaging is illustrated in Fig. 4(c), showing remains of the transient still present. procedure A, depicted in Fig. 4(d), removes the transient, but the information about the sound decay in the contaminated region is lost. Moreover, no fade-in and fade-out in the replaced part of the sweep leads to the appearance of clicks. procedure B, shown in Fig. 4(e), rejects most of the contaminating impulse but also produces clicks. Procedure C completely failed the test, as the non-stationary noise in Fig. 4(f) is identical to the original transient from Fig. 4(a).
To quantify the noise removal performance of the compared methods, we calculated the Pearson correlation coefficients (PCCs) between the sweeps shown in Fig. 4 and one of the clean sweeps from the measurement. PCC is an indicator of contamination in a measurement,7 and thus evaluates the success of non-stationary noise removal methods. The PCC values for the compared techniques are displayed in the top-left part of each pane in Fig. 4. Mosaic-TF shows the highest PCC value and the biggest improvement from the initial PCC of 0.9983 in Fig. 4(a). Additionally, Mosaic is applied to the sweeps without prior knowledge of any signal parameters (power,10 contamination location6). The proposed method appears to be superior to all four procedures and thus should be used when the removal of non-stationary noise is necessary.
5. Conclusion
This letter proposes a method to remove non-stationary noise from repeated acoustic swept-sine measurements. The proposed across-measurement median filtering, called Mosaic, rejects disturbances and recovers a clean sweep suitable for further processing, when less than half of the measurements are contaminated at each sample time or frequency bin. We show that performing the median filtering in the time-frequency domains gives the best results.
Comparison with commonly used across-measurement averaging shows the advantages of the proposed method in terms of non-stationary noise rejection and robustness against the time-variance-induced energy loss. In the test against other techniques for non-stationary noise removal, the proposed method offered significant improvements without further corrupting the resulting sweep and with no prior knowledge of the signal or contamination.
This letter teaches that, when measuring in an uncontrolled environment, it is recommended to use multiple shorter ESSs over one very long ESS. The ESSs should still be long enough for obtaining a sufficient SNR and harmonic distortion rejection. The proposed technique can virtually remove any time-frequency sparse non-stationary noises given sufficiently many repetitions.
Author Declarations
Conflict of Interest
The authors have no conflict of interest to disclose.
Data Availability
The raw sweep data are available from the authors upon request. The sound examples that support claims made in this work and related code are available online (https://github.com/KPrawda/mosaic_noise_removal).