Several applications in spatial audio signal processing benefit from the knowledge of the diffuseness of the sound field. In this paper, several experiments are performed to determine the response of a tetrahedral microphone array under a spherically isotropic sound field. The data were gathered with numerical simulations and real recordings using a spherical loudspeaker array. The signal analysis, performed in the microphone signal and spherical harmonic domains, reveals the characteristic coherence curves of spherical isotropic noise as a function of the frequency.

## 1. Introduction

A number of practical applications benefit the knowledge about the diffuseness of a sound field, including speech enhancement and dereverberation (Habets *et al.*, 2006), noise suppression (Ito *et al.*, 2010), source separation (Duong *et al.*, 2010), or background estimation (Stefanakis and Mouchtaris, 2015). In the field of spatial audio, diffuseness estimation is often used for parametrization (Politis *et al.*, 2018; Pulkki, 2006), Direction-of-Arrival estimation (Thiergart *et al.*, 2009), or source separation (Motlicek *et al.*, 2013).

In this paper, we study diffuseness estimation by subjecting a tetrahedral microphone array to spherically isotropic noise fields. The motivation for this work is, first, that tetrahedral arrays are a well-known type of microphone arrays, which have today become popular for applications related to Virtual and Augmented Reality. Second, the spherical isotropic sound field is known to be a good approximation to the reverberant part of the sound field in a room (Elko, 2001; McCowan and Bourlard, 2003), and therefore it would be interesting to investigate how different microphone arrays behave under such conditions.

### 1.1 Coherence analysis

Diffuseness is commonly estimated through the *Magnitude Squared Coherence* (MSC) (Elko, 2001) between two frequency-domain signals *S*_{1} and *S*_{2}, as a function of the *wavenumber k* and the microphone distance *r*,

where the $\u27e8\xb7\u27e9$ operator represents the temporal expected value, and * defines the complex conjugate operator. In the case of spherical isotropic noise fields, Eq. (1) can be expressed in terms of microphone directivity patterns *T*(*ϕ*, *θ*, *kr*) as (Elko, 2001)

Moreover, the general expression of the directivity of a first-order differential microphone is given by the following relationship:

where *i* ∈ [1, 2] is the microphone index, *ψ _{i}* is the angle between wave incidence and microphone orientation axis, and

*α*∈ [0, 1] is the directivity parameter of the microphone

_{i}*i*, which ranges from bidirectional (

*α*= 0) to omnidirectional (

_{i}*α*= 1). For first-order differential microphones, there is a closed-form expression for the numerator and denominator of Eq. (2),

_{i}where $xi=cos(\varphi i)\u2009sin(\theta i);\u2009yi=sin(\varphi i)\u2009sin(\theta i);\u2009zi=cos(\theta i)$ refers to the wave incidence angle *ψ _{i}* expressed in spherical coordinates (with azimuth

*ϕ*and inclination

*θ*).

### 1.2 Diffuseness estimation in ambisonics

Let us consider a sound field captured with a spherical microphone array, which contains *Q* microphones distributed around a spherical surface of radius *R* at the angular positions given by the azimuth-inclination pairs Ω_{q} = (*ϕ _{q}*,

*θ*). The captured frequency-domain signals

_{q}*X*(

_{q}*k*) can be represented as the spherical harmonic domain signals

*X*(

_{mn}*k*) through the spherical harmonic transform of order

*L*(Moreau

*et al.*, 2006),

where *Y _{mn}*(Ω

_{q}) are the

*real-valued spherical harmonics*, and Γ

_{m}(

*kR*) are the

*radial filters*or equalization terms of order

*m*, with

*m*∈ [0,

*L*] and

*n*∈ [−

*m*,

*m*].

Due to a number of practical reasons, it is desirable to distribute the microphone capsules in a uniform manner along the sphere, with the regular tetrahedron being the simplest possible configuration (Gerzon, 1975). Capsule signals recorded with such topology receive the name of *A-Format* signals. Conversely, the term *B-Format* (*ambisonics*) describes the application of Eq. (5) (*ambisonic encoding*) to the *A-Format* signals. One of the most common coherence estimators for first-order ambisonic frequency-domain signals *X _{mn}*(

*k*) is the

*diffuseness*Ψ as defined in

*DirAC*(Pulkki, 2006),

where *X*_{0}(*k*) = *X*_{00}(*k*) and $X1(k)=[X1\u22121(k),X10(k),X11(k)]\u22ba$ are *SN3D*-normalized. For the sake of clarity, we will further define the *B-Format coherence* estimator Δ as,

Under spherical isotropic noise, the theoretical coherence between any pair of zeroth and first order ambisonic virtual microphones is equal to 0 for all frequencies, due to orthogonality and symmetry of the spherical harmonics (Elko, 2001). This result can also be assessed by Eq. (4). However, there are several practical factors that might corrupt the coherence estimation, such as the approximation of the temporal expectation by time averaging (Thiergart *et al.*, 2011) in Eq. (6), or the non-ideal implementation of the radial filters Γ_{m}(*kR*) (Schörkhuber and Höldrich, 2017). In Secs. 2.1–2.3, we present several experiments that illustrate the behavior of different coherence estimators applied on the signals captured with a tetrahedral microphone subjected to spherical isotropic noise, using both simulated and real sound recordings.

## 2. Methods

### 2.1 Simulation

Spherical isotropic noise has been generated following the *geometrical method* (Habets and Gannot, 2007, 2010), using *N* = 1024 plane waves. The resulting *A-Format* signals correspond to a virtual tetrahedral microphone array mimicking the Ambeo^{1} characteristics (*R* = 0.015 m, *α* = 0.5). The generated audio has duration of 60 s.

### 2.2 Recording

Spherical isotropic noise has been rendered to a spherical loudspeaker layout with 25 *Genelec 8040* (Iisalmi, Finland). The loudspeakers are arranged into three azimuth-equidistant 8-speaker rings at inclinations *θ* = [*π*/4, *π*/2, 3*π*/4], plus one speaker at the zenith (*θ* = 0). The different speaker distances to the center are delay- and gain-corrected, and the signal feeds are equalized to compensate for speaker coloration. The room has an approximate *T*_{60} of 300 ms measured at the 1 kHz third-band octave. The spherical isotropic noise has been again created following the *geometrical method*, encoding a number of uncorrelated noise plane waves in ambisonics with varying orders *L* ∈ [1, 5]. Due to practical limitations related with the software, the minimum number of sources *N* = 256 for an accurate sound field reconstruction (Habets and Gannot, 2010) could not be reached—instead, the analysis has been performed parametrically with *N* = [8, 16, 32, 64]. For each value of *L* and *N*, approximately 15 s of audio have been recorded with an Ambeo microphone located at the center of the speaker array. Ambisonics decoding uses the AllRAD method (Zotter and Frank, 2012), passing through a spherical 64-point 10-design virtual speaker layout, and includes an imaginary speaker at the nadir (*θ* = *π*). The decoding matrix uses *in-phase* weights.

### 2.3 Data processing and metrics

The sampling rate of all signals is 48 kHz. All frequency-domain results have been obtained by averaging their time-frequency representations over time. Ambisonics conversion is performed using *Ambeo A-B converter* AU plugin, version 1.2.1. Two error metrics are considered: the frequency-dependent squared error *ε*(*k*), and the mean squared error $\epsilon \xaf$,

## 3. Results and discussion

### 3.1 A-Format

The coherence of the generated *A-Format* signals is exemplified in Fig. 1 (left), which shows the *MSC* between the capsule pair (*BLD*, *BRU*) for the theoretical, simulated, and recorded cases. The theoretical coherence is derived from Eq. (4), while simulated and recorded MSCs have been computed by Welch's method, using a *Hanning* window of 256 samples and 1/2 overlap. The difference between theoretical and simulated coherence is negligible for practical applications. However, there is a noticeable difference when compared to the recorded coherence. In general, the recorded MSC follows the tendency of the simulated curve up to around 5 kHz. Above this frequency, the recorded *MSC* presents several spectral peaks, which might be partially explained by the interference of the microphone itself in the recorded sound field, and by the non-ideal directivity of the capsules. The squared error *ε*(*k*) with respect to the simulated curve is shown in Fig. 1 (left), while Fig. 1 (right) represents the same error averaged over frequency $\epsilon \xaf$ for different spatial resolution values of the diffuse field reproduction algorithm. As expected, $\epsilon \xaf$ decreases with increasing values of *L* and *N*.

### 3.2 B-Format

In order to evaluate the dependency of Δ on the number of time frames used for averaging, the following procedure is presented. The simulated *A-Format* sound field has been transformed into the spherical harmonic domain, with and without the application of radial filters Γ_{m}(*kR*). Then, Δ has been computed with Eq. (7) for exponentially growing values of *r* between 1 (8 ms) and 2048 (10.92 s), where *r* is the vicinity radius used for time averaging, and the number of time windows is given by *T* = 2*r* + 1. The time-frequency representation is derived by applying the short-time Fourier transform (STFT) with the same window parameters as in Sec. 3.1.

Figure 2 (left) shows the great dependence of Δ on *r*. The estimated coherence tends to the theoretical values with increasing values of *r*. This tendency is better appreciated in Fig. 2 (right): the curve asymptotically decreases to a value Δ_{min} ≈ 0. Another interesting observation comes from the frequency response of the curves. For all values of *r*, the coherence of the compensated *B-Format* signal [with Γ_{m}(*kR*)] is roughly flat up to around 7 kHz, which approximately corresponds to the operational spatial frequency range of the microphone (Gerzon, 1975). Above this value, the coherence response loses the flatness due to spatial aliasing. The response above the maximum frequency could be stabilized, if needed, by alternative diffuseness estimation methods (Politis *et al.*, 2015). The coherence level differences along frequency are inversely proportional to *r*—the effect is better depicted by the standard deviation values (right). The effect of the radial filters in the coherence measurement is also shown: for a given *r*, the shape of the coherence is always less flat if no filters are applied. Conversely, in this case, coherence values are always smaller for the same *r*. This effect might be explained taking into account the inter-channel coherence introduced by microphone and encoder imperfections in real scenarios (Schörkhuber and Höldrich, 2017). As a remark, the comparison between Figs. 1 and 2 provides evidence that the application of the spherical harmonic transform might be able to yield more accurate diffuseness estimations, due to a better signal conditioning (Epain *et al.*, 2016).

Figure 3 (left) shows the estimated coherence for the recorded sound field with *L* = 5 and *N* = 64, using a vicinity radius of *r* = 1024 (≈5 s). The curve is centered around Δ = 0.25 and presents several spectral peaks, as in the *A-Format* case. It is important to notice here that the deviations between the coherence of the simulated and the recorded sound fields are much stronger compared to those of Fig. 1. This effect can be also appreciated in Fig. 3 (right): the mean squared error is around 2 orders of magnitude higher in *B-Format*. Nevertheless, similar as in Fig. 1 (right), $\epsilon \xaf$ decreases with increasing values of *L* and *N*. This behavior suggests that the deviations between the recorded and the simulated coherence can be to a large degree explained by the low spatial resolution of the reproduction system; given a higher number of loudspeakers, we expect that the reproduced diffuseness will tend to the theoretical expression.

## 4. Conclusions

The diffuseness of a sound field is an important parameter for several applications. In this work, two different metrics of diffuseness have been defined and measured with a tetrahedral microphone subjected to spherical isotropic noise. The analysis shows, first, the impact of the time-averaging window length on the *B-Format* diffuseness estimator. This result might be useful for designing coherence estimators that are parameterized with respect to the length of the analysis window (Thiergart *et al.*, 2011). Second, the feasibility of diffuse sound field reproduction by a spherical loudspeaker array using ambisonics plane-wave encoding and the *geometrical method* is studied. Results suggest that this approach is viable, given a sufficient spatial resolution; a quantification of the impact of the number of loudspeakers remains for future work.

^{1}

Sennheiser Ambeo VR Mic. https://en-us.sennheiser.com/microphone-3d-audio-ambeo-vr-mic.

## References and links

*Audio Engineering Society Convention 50*, Audio Engineering Society.

*Audio Engineering Society Convention 127*, Audio Engineering Society.