Determining the absolute onset time of reflections in an acoustic impulse response (IR) has applications for both subjective and physical acoustics problems. Although computationally simple, a first-order energetic analysis of the IR can lead to false-positive identification of reflections. This letter reports on a method to determine reflection onset timings using a modified running local kurtosis analysis to identify regions in the IR where the distribution is non-normal. IRs from both real and virtual rooms are used to validate the method and to find optimum analysis window sizes.
I. Introduction
The absolute timings of early reflections in a room impulse response (RIR) can be used to solve a variety of problems relating to both subjective and physical acoustical phenomena in the corresponding environment (be it real or virtual). Although slight imprecision in the timing detection will not dramatically change coloration nor spatial impression, since both are evaluated at a more global level, the initial time-delay gap (ITDG) can affect perceptual integration of the direct and reflected sound (i.e., the precedence effect) and may affect speech intelligibility or musical timbre. Physical characteristics of a space can also be determined from precise timings of reflections in an acoustic IR; a process commonly used in SONAR, seismic exploration, and medical diagnosis.
The most computationally simple method to identify the reflection onset timings (ROTs) in a RIR is from an analysis of the energy envelope. There are a variety of approaches to this first-order energy profile analysis for onset detection within an audio signal, such as using tuned edge-detection wavelet filters or spectral-domain energy fluctuations (for instance, this has been applied to finding the onset of transient events in music audio signals1). It is the goal of this short letter to first describe the short-comings of using an energy profile analysis to determine ROTs in a RIR, and to second introduce an improved method based on a local running statistical analysis of the RIR. The analysis is particularly directed toward identifying low-order early reflections after the direct-sound.
II. Characterizing a room impulse response by ROTs
A. Energy profile analysis vs statistical energy analysis
To demonstrate the short-coming of using a first-order energetic analysis to determine ROTs, we refer to the RIR described in Fig. 1 that was measured in an open-ended room of dimensions . By analysis of an image-source model constructed using the same acoustic measurement parameters, we can verify that the “true” first reflection actually corresponds to the peak marked with the box symbol in Fig. 1. However, using a simple analysis of the energy envelope of the RIR, the “false” reflection peak marked with the cross symbol shortly after the direct-sound peak would be mistaken as the first reflection. Such false reflections in the energy profile may be a result of distortion from anti-aliasing filters in the analog to digital converter system or from nonlinearities in the excitation and measurement transducer systems. Precise reflection onset detection in a RIR is a problem further compounded by scattering effects from room surfaces, which can introduce a Poisson-like distribution for early reflections from averaged acquisition methods (such as maximum length sequence or swept-sine techniques).2
Statistical approaches to identifying reflection timings in an acoustic impulse response can overcome the problems with an energy analysis by utilizing the second-order statistics of a RIR. Reverberation is generally defined as a physical phenomenon where the sound reflections impinging on a point are such that the pressure can be modeled as a stochastic function with a normal distribution and a mean of zero, and that part of a RIR that has a normal distribution is called the reverberant part.3 The degree to which a part of a RIR has a normal distribution (the “degree of normality”) has been used before to determine the onset of the reverberation. Two such measures of normality are “average rms response fluctuation” and kurtosis. Schroeder’s frequency-domain stochastic model3 predicts that 68.26% of the response curve will lie within about 10 dB of the average level. Similarly, the Abel and Huang criterion4 for the reverberant part of a RIR is based on the well-known phenomenon that 68.26% of the observations of a normal population will be found within 1 standard deviation of its mean.5
Another well established and understood measure of the degree of normality is kurtosis. Kurtosis can be considered as a measure of the degree of peakedness of a distribution, or more correctly as the degree of bi-modality. The kurtosis of a data series is defined as the ratio of its fourth central moment to its standard deviation raised to the power of 4 (hence kurtosis is always positive) as follows:
where is the statistical expectation operator, is the mean, and is the standard deviation of the data series .
To develop a running kurtosis analysis of data , consider a block of samples centered about discrete time sample as follows:
The running kurtosis of is now defined as
where and are the mean and standard deviation of the -length vector . If sample is large compared with the local samples then the kurtosis will be high (leptokurtic); a flat distribution in —a low peakedness—will have a low kurtosis (platykurtic); and a normal distribution has a kurtosis of 3. However, a bi-modal distribution will also give a low kurtosis; so therefore low kurtosis values are not necessarily an indication of low peakedness.6 Kurtosis has been used before for acoustic IR analysis: for determining the optimum length of adaptive filters for applications such as blind audio upmixing7 and for determining the onset of musical instruments in an audio signal.1 In a RIR, spurious large-value samples from strong individual reflections will clearly have a leptokurtic distribution, giving a kurtosis greater than 3. It is this principle that forms the basis of the proposed method for identifying the onset of early reflections in a RIR.
B. Modified kurtosis method
Similar to the “echogram” or “reflection diagram” of Kuttruff,8 which shows the absolute strength (i.e., acoustic power) of reflections impinging on a point over time, we can show a plot of the RIR kurtosis as a function of time (though the magnitude of the kurtosis is not necessarily related directly to the strength of the reflections). Such a “kurtosis echogram” for a RIR measured in a concert hall is shown in Fig. 2, where the onset of the strong first-order reflection from the rear of the concert hall can be discerned about 0.4 s after the direct-sound. However, fine-structure detail of the early impulse response is unclear and individual early reflections cannot be discerned.
The temporal resolution of the running kurtosis echogram is affected by two parameters: the range over which the local kurtosis is calculated (i.e., the window size ) and the overlap between successive kurtosis frames. As we saw in Fig. 2, a long temporal window smears the fine-structure of the early impulse response. This can be understood analytically by considering the reflection density of the RIR. The reflection density (reflections/s) can be approximated by
where is the speed of sound, is the room volume, and is the time since the RIR onset. It has been reported that this relationship holds not just for rectangular rooms but also for any room shape.8 Using Eq. (3), we can determine that for the first 6 ms of our test room the echo-density is approximately 1 arrival/ms, corresponding to 1 reflection arriving every 5 samples for a 48 kHz sample-rate system. If the kurtosis analysis window is less than the inter-reflection time then the expected difference between the instant RIR sample value and the local mean may be lower than the RIR mean (or the RIR mean calculated with a larger local window) and likewise the kurtosis will be low even when a reflection is within the analysis window. Alternatively, if the kurtosis window is too large then the temporal resolution for kurtosis will be smeared, and a bi-modal distribution from double peaks within the analysis window can also give a low kurtosis.6 We therefore introduce a modified kurtosis metric that is a function of two analysis windows: a first small window to estimate a “fast” local estimate of the RIR level [this replaces the single sample value in Eq. (1)], and a second longer local window over which a “slower” mean and standard deviation are calculated. The modified kurtosis estimate is now defined as
where and are the lengths of a first and second data window, both centered about sample of the data series ; and are the means of the -length and -length local data windows ; and is the standard deviation of the -length window. An empirical investigation was conducted to find optimum window lengths and .
III. Results and discussion
The short window size for the estimate of the fast running-average, i.e., the length , was four samples. This gave sufficient temporal resolution to capture a single reflection; and yet the time of arrival between two consecutive reflections in the early part of a RIR is generally at least four samples for small rooms, as discussed previously with reference to the Kuttruff echo-density trend in Eq. (3). The window size for the longer local IR window, i.e., the length , was varied from being 2 to 16 times the length of the four-sample window. Figure 3 show that as the ratio of the long to small window (i.e., ) is increased, then the resulting “kurtosis echogram” becomes less noisy and the kurtosis peaks can be more easily discerned. However, for ratios greater than 16, the width of these peaks increases, making it difficult to precisely identify the timing of peak maxima.
The criterion for selecting the optimum ratio was by a difference between the ROTs of the measured RIR using the new two-window running kurtosis method, compared with the true ROTs as determined using the analytic image-source room model. The empirical ROTs were determined by analysis of the kurtosis echograms shown in Fig. 3 using a peak-picking inspection of the kurtosis data for peaks greater than a value of 3. The measurement was repeated with six different source-receiver locations (in the same room), and the real-world ROTs were compared with the virtual ROTs determined from the corresponding room model. For the first six ROTs, there was a 100% correspondence between the real and virtual ROTs for six different source-receiver test configurations when the ratio was equal to 16. However, these exploratory findings used source-receiver locations that were chosen to be non-symmetrical about the room axes. For source-receiver locations centered near the central room axes, we would find that ROTs nearly coincided and would therefore decrease the inter-reflection time to a value less than the smallest analysis window yielding a bi-modal distribution with a low kurtosis; hence the reflection onsets may not be detected. Accordingly, we conclude that the kurtosis analysis method is effective for RIR echo-densities less than approximately 1-per-ms, which corresponds to approximately the first 80 ms of the RIR for a concert-hall sized room [predicted using Eq. (3)]. Although the analysis was conducted on full-band audio signals, the method for identifying ROTs is applicable for filtered IRs, such as octave of third-octave filtered time signals.
IV. Conclusion
A new method has been proposed to determine the onset of individual reflections in the early part of a RIR. The method is based on a running local kurtosis analysis with two analysis windows to identify regions in the RIR where the distribution is non-normal. RIRs determined from both real and virtual rooms were used to validate the new method and to find optimum analysis window sizes, where a first window size of 0.8 ms and a second of 12.8 ms were found to accurately identify reflection onsets for the early part of the RIR. The effectiveness of identifying reflection onsets was found to be related to the relative length of the smaller window and the reflection density: The new method was reliable for densities less than 1-per-ms.
Acknowledgments
This work was partly supported by grants from iMP, Call No. FP7-ICT-2007-3 and Objective ICT-2007.4.4. The author is grateful for the comments from three anonymous reviewers.