Several applications in spatial audio signal processing benefit from the knowledge of the diffuseness of the sound field. In this paper, several experiments are performed to determine the response of a tetrahedral microphone array under a spherically isotropic sound field. The data were gathered with numerical simulations and real recordings using a spherical loudspeaker array. The signal analysis, performed in the microphone signal and spherical harmonic domains, reveals the characteristic coherence curves of spherical isotropic noise as a function of the frequency.
1. Introduction
A number of practical applications benefit the knowledge about the diffuseness of a sound field, including speech enhancement and dereverberation (Habets et al., 2006), noise suppression (Ito et al., 2010), source separation (Duong et al., 2010), or background estimation (Stefanakis and Mouchtaris, 2015). In the field of spatial audio, diffuseness estimation is often used for parametrization (Politis et al., 2018; Pulkki, 2006), Direction-of-Arrival estimation (Thiergart et al., 2009), or source separation (Motlicek et al., 2013).
In this paper, we study diffuseness estimation by subjecting a tetrahedral microphone array to spherically isotropic noise fields. The motivation for this work is, first, that tetrahedral arrays are a well-known type of microphone arrays, which have today become popular for applications related to Virtual and Augmented Reality. Second, the spherical isotropic sound field is known to be a good approximation to the reverberant part of the sound field in a room (Elko, 2001; McCowan and Bourlard, 2003), and therefore it would be interesting to investigate how different microphone arrays behave under such conditions.
1.1 Coherence analysis
Diffuseness is commonly estimated through the Magnitude Squared Coherence (MSC) (Elko, 2001) between two frequency-domain signals S1 and S2, as a function of the wavenumber k and the microphone distance r,
where the operator represents the temporal expected value, and * defines the complex conjugate operator. In the case of spherical isotropic noise fields, Eq. (1) can be expressed in terms of microphone directivity patterns T(ϕ, θ, kr) as (Elko, 2001)
Moreover, the general expression of the directivity of a first-order differential microphone is given by the following relationship:
where i ∈ [1, 2] is the microphone index, ψi is the angle between wave incidence and microphone orientation axis, and αi ∈ [0, 1] is the directivity parameter of the microphone i, which ranges from bidirectional (αi = 0) to omnidirectional (αi = 1). For first-order differential microphones, there is a closed-form expression for the numerator and denominator of Eq. (2),
where refers to the wave incidence angle ψi expressed in spherical coordinates (with azimuth ϕ and inclination θ).
1.2 Diffuseness estimation in ambisonics
Let us consider a sound field captured with a spherical microphone array, which contains Q microphones distributed around a spherical surface of radius R at the angular positions given by the azimuth-inclination pairs Ωq = (ϕq, θq). The captured frequency-domain signals Xq(k) can be represented as the spherical harmonic domain signals Xmn(k) through the spherical harmonic transform of order L (Moreau et al., 2006),
where Ymn(Ωq) are the real-valued spherical harmonics, and Γm(kR) are the radial filters or equalization terms of order m, with m ∈ [0, L] and n ∈ [−m, m].
Due to a number of practical reasons, it is desirable to distribute the microphone capsules in a uniform manner along the sphere, with the regular tetrahedron being the simplest possible configuration (Gerzon, 1975). Capsule signals recorded with such topology receive the name of A-Format signals. Conversely, the term B-Format (ambisonics) describes the application of Eq. (5) (ambisonic encoding) to the A-Format signals. One of the most common coherence estimators for first-order ambisonic frequency-domain signals Xmn(k) is the diffuseness Ψ as defined in DirAC (Pulkki, 2006),
where X0(k) = X00(k) and are SN3D-normalized. For the sake of clarity, we will further define the B-Format coherence estimator Δ as,
Under spherical isotropic noise, the theoretical coherence between any pair of zeroth and first order ambisonic virtual microphones is equal to 0 for all frequencies, due to orthogonality and symmetry of the spherical harmonics (Elko, 2001). This result can also be assessed by Eq. (4). However, there are several practical factors that might corrupt the coherence estimation, such as the approximation of the temporal expectation by time averaging (Thiergart et al., 2011) in Eq. (6), or the non-ideal implementation of the radial filters Γm(kR) (Schörkhuber and Höldrich, 2017). In Secs. 2.1–2.3, we present several experiments that illustrate the behavior of different coherence estimators applied on the signals captured with a tetrahedral microphone subjected to spherical isotropic noise, using both simulated and real sound recordings.
2. Methods
2.1 Simulation
Spherical isotropic noise has been generated following the geometrical method (Habets and Gannot, 2007, 2010), using N = 1024 plane waves. The resulting A-Format signals correspond to a virtual tetrahedral microphone array mimicking the Ambeo1 characteristics (R = 0.015 m, α = 0.5). The generated audio has duration of 60 s.
2.2 Recording
Spherical isotropic noise has been rendered to a spherical loudspeaker layout with 25 Genelec 8040 (Iisalmi, Finland). The loudspeakers are arranged into three azimuth-equidistant 8-speaker rings at inclinations θ = [π/4, π/2, 3π/4], plus one speaker at the zenith (θ = 0). The different speaker distances to the center are delay- and gain-corrected, and the signal feeds are equalized to compensate for speaker coloration. The room has an approximate T60 of 300 ms measured at the 1 kHz third-band octave. The spherical isotropic noise has been again created following the geometrical method, encoding a number of uncorrelated noise plane waves in ambisonics with varying orders L ∈ [1, 5]. Due to practical limitations related with the software, the minimum number of sources N = 256 for an accurate sound field reconstruction (Habets and Gannot, 2010) could not be reached—instead, the analysis has been performed parametrically with N = [8, 16, 32, 64]. For each value of L and N, approximately 15 s of audio have been recorded with an Ambeo microphone located at the center of the speaker array. Ambisonics decoding uses the AllRAD method (Zotter and Frank, 2012), passing through a spherical 64-point 10-design virtual speaker layout, and includes an imaginary speaker at the nadir (θ = π). The decoding matrix uses in-phase weights.
2.3 Data processing and metrics
The sampling rate of all signals is 48 kHz. All frequency-domain results have been obtained by averaging their time-frequency representations over time. Ambisonics conversion is performed using Ambeo A-B converter AU plugin, version 1.2.1. Two error metrics are considered: the frequency-dependent squared error ε(k), and the mean squared error ,
3. Results and discussion
3.1 A-Format
The coherence of the generated A-Format signals is exemplified in Fig. 1 (left), which shows the MSC between the capsule pair (BLD, BRU) for the theoretical, simulated, and recorded cases. The theoretical coherence is derived from Eq. (4), while simulated and recorded MSCs have been computed by Welch's method, using a Hanning window of 256 samples and 1/2 overlap. The difference between theoretical and simulated coherence is negligible for practical applications. However, there is a noticeable difference when compared to the recorded coherence. In general, the recorded MSC follows the tendency of the simulated curve up to around 5 kHz. Above this frequency, the recorded MSC presents several spectral peaks, which might be partially explained by the interference of the microphone itself in the recorded sound field, and by the non-ideal directivity of the capsules. The squared error ε(k) with respect to the simulated curve is shown in Fig. 1 (left), while Fig. 1 (right) represents the same error averaged over frequency for different spatial resolution values of the diffuse field reproduction algorithm. As expected, decreases with increasing values of L and N.
(Color online) A-Format coherence between microphone signals. Left: MSC as a function of the frequency of theoretical, simulated and recorded [(BLD, BRU), L = 5, N = 64] signals. Right: mean error of the recorded signals' MSC (BLD, BRU) compared to the simulated values, for all values of L and N.
(Color online) A-Format coherence between microphone signals. Left: MSC as a function of the frequency of theoretical, simulated and recorded [(BLD, BRU), L = 5, N = 64] signals. Right: mean error of the recorded signals' MSC (BLD, BRU) compared to the simulated values, for all values of L and N.
3.2 B-Format
In order to evaluate the dependency of Δ on the number of time frames used for averaging, the following procedure is presented. The simulated A-Format sound field has been transformed into the spherical harmonic domain, with and without the application of radial filters Γm(kR). Then, Δ has been computed with Eq. (7) for exponentially growing values of r between 1 (8 ms) and 2048 (10.92 s), where r is the vicinity radius used for time averaging, and the number of time windows is given by T = 2r + 1. The time-frequency representation is derived by applying the short-time Fourier transform (STFT) with the same window parameters as in Sec. 3.1.
Figure 2 (left) shows the great dependence of Δ on r. The estimated coherence tends to the theoretical values with increasing values of r. This tendency is better appreciated in Fig. 2 (right): the curve asymptotically decreases to a value Δmin ≈ 0. Another interesting observation comes from the frequency response of the curves. For all values of r, the coherence of the compensated B-Format signal [with Γm(kR)] is roughly flat up to around 7 kHz, which approximately corresponds to the operational spatial frequency range of the microphone (Gerzon, 1975). Above this value, the coherence response loses the flatness due to spatial aliasing. The response above the maximum frequency could be stabilized, if needed, by alternative diffuseness estimation methods (Politis et al., 2015). The coherence level differences along frequency are inversely proportional to r—the effect is better depicted by the standard deviation values (right). The effect of the radial filters in the coherence measurement is also shown: for a given r, the shape of the coherence is always less flat if no filters are applied. Conversely, in this case, coherence values are always smaller for the same r. This effect might be explained taking into account the inter-channel coherence introduced by microphone and encoder imperfections in real scenarios (Schörkhuber and Höldrich, 2017). As a remark, the comparison between Figs. 1 and 2 provides evidence that the application of the spherical harmonic transform might be able to yield more accurate diffuseness estimations, due to a better signal conditioning (Epain et al., 2016).
(Color online) Estimated B-Format coherence (Δ) of a simulated diffuse sound field, as a function of the temporal averaging vicinity radius r. Left: Δ(k) for different values of r, with (coarse) and without (fine) application of radial filters. Right: mean and standard deviation of Δ(k) as a function of r.
(Color online) Estimated B-Format coherence (Δ) of a simulated diffuse sound field, as a function of the temporal averaging vicinity radius r. Left: Δ(k) for different values of r, with (coarse) and without (fine) application of radial filters. Right: mean and standard deviation of Δ(k) as a function of r.
Figure 3 (left) shows the estimated coherence for the recorded sound field with L = 5 and N = 64, using a vicinity radius of r = 1024 (≈5 s). The curve is centered around Δ = 0.25 and presents several spectral peaks, as in the A-Format case. It is important to notice here that the deviations between the coherence of the simulated and the recorded sound fields are much stronger compared to those of Fig. 1. This effect can be also appreciated in Fig. 3 (right): the mean squared error is around 2 orders of magnitude higher in B-Format. Nevertheless, similar as in Fig. 1 (right), decreases with increasing values of L and N. This behavior suggests that the deviations between the recorded and the simulated coherence can be to a large degree explained by the low spatial resolution of the reproduction system; given a higher number of loudspeakers, we expect that the reproduced diffuseness will tend to the theoretical expression.
(Color online) B-Format coherence between microphone signals. Left: Δ of simulated and recorded (L = 5, N = 64) signals. Right: of the recorded signals coherence across all values of L and N.
(Color online) B-Format coherence between microphone signals. Left: Δ of simulated and recorded (L = 5, N = 64) signals. Right: of the recorded signals coherence across all values of L and N.
4. Conclusions
The diffuseness of a sound field is an important parameter for several applications. In this work, two different metrics of diffuseness have been defined and measured with a tetrahedral microphone subjected to spherical isotropic noise. The analysis shows, first, the impact of the time-averaging window length on the B-Format diffuseness estimator. This result might be useful for designing coherence estimators that are parameterized with respect to the length of the analysis window (Thiergart et al., 2011). Second, the feasibility of diffuse sound field reproduction by a spherical loudspeaker array using ambisonics plane-wave encoding and the geometrical method is studied. Results suggest that this approach is viable, given a sufficient spatial resolution; a quantification of the impact of the number of loudspeakers remains for future work.
Sennheiser Ambeo VR Mic. https://en-us.sennheiser.com/microphone-3d-audio-ambeo-vr-mic.