Several applications in spatial audio signal processing benefit from the knowledge of the diffuseness of the sound field. In this paper, several experiments are performed to determine the response of a tetrahedral microphone array under a spherically isotropic sound field. The data were gathered with numerical simulations and real recordings using a spherical loudspeaker array. The signal analysis, performed in the microphone signal and spherical harmonic domains, reveals the characteristic coherence curves of spherical isotropic noise as a function of the frequency.

A number of practical applications benefit the knowledge about the diffuseness of a sound field, including speech enhancement and dereverberation (Habets et al., 2006), noise suppression (Ito et al., 2010), source separation (Duong et al., 2010), or background estimation (Stefanakis and Mouchtaris, 2015). In the field of spatial audio, diffuseness estimation is often used for parametrization (Politis et al., 2018; Pulkki, 2006), Direction-of-Arrival estimation (Thiergart et al., 2009), or source separation (Motlicek et al., 2013).

In this paper, we study diffuseness estimation by subjecting a tetrahedral microphone array to spherically isotropic noise fields. The motivation for this work is, first, that tetrahedral arrays are a well-known type of microphone arrays, which have today become popular for applications related to Virtual and Augmented Reality. Second, the spherical isotropic sound field is known to be a good approximation to the reverberant part of the sound field in a room (Elko, 2001; McCowan and Bourlard, 2003), and therefore it would be interesting to investigate how different microphone arrays behave under such conditions.

Diffuseness is commonly estimated through the Magnitude Squared Coherence (MSC) (Elko, 2001) between two frequency-domain signals S1 and S2, as a function of the wavenumber k and the microphone distance r,

MSC12(kr)=|S1(kr)S2(kr)*|2|S1(kr)|2|S2(kr)|2,
(1)

where the · operator represents the temporal expected value, and * defines the complex conjugate operator. In the case of spherical isotropic noise fields, Eq. (1) can be expressed in terms of microphone directivity patterns T(ϕ, θ, kr) as (Elko, 2001)

MSC12(kr)=|N12(kr)|2|D12(kr)|2=|0π02πT1(ϕ,θ,kr)T2*(ϕ,θ,kr)ejkrcosθsinθdθdϕ|2|0π02π|T1(ϕ,θ,kr)|2sinθdθdϕ0π02π|T2(ϕ,θ,kr)|2sinθdθdϕ|2.
(2)

Moreover, the general expression of the directivity of a first-order differential microphone is given by the following relationship:

Ti(ψi)=αi+(1αi)cosψi,
(3)

where i ∈ [1, 2] is the microphone index, ψi is the angle between wave incidence and microphone orientation axis, and αi ∈ [0, 1] is the directivity parameter of the microphone i, which ranges from bidirectional (αi = 0) to omnidirectional (αi = 1). For first-order differential microphones, there is a closed-form expression for the numerator and denominator of Eq. (2),

N12(kr)=α1α2sin(kr)kr+(1α2)(1α2)(x1x2+y1y2)(kr)3(sin(kr)krcos(kr))+z1z2kr3[((kr)2sin(kr)+2krcos(kr))(1α1)(1α2)+2sin(kr)(1α1)(1α2)]+z1(kr)3[j(kr)2α2cos(kr)(α11)+jkrα2sin(kr)(1+α1)]+z2(kr)3[j(kr)2α1cos(kr)(α21)+jkrα1sin(kr)(1+α2)],D12(kr)=3α12+(1α1)23α22+(1α2)23,
(4)

where xi=cos(ϕi)sin(θi);yi=sin(ϕi)sin(θi);zi=cos(θi) refers to the wave incidence angle ψi expressed in spherical coordinates (with azimuth ϕ and inclination θ).

Let us consider a sound field captured with a spherical microphone array, which contains Q microphones distributed around a spherical surface of radius R at the angular positions given by the azimuth-inclination pairs Ωq = (ϕq, θq). The captured frequency-domain signals Xq(k) can be represented as the spherical harmonic domain signals Xmn(k) through the spherical harmonic transform of order L (Moreau et al., 2006),

Xmn(k)=QXq(k)Ymn(Ωq)Γm(kR),
(5)

where Ymnq) are the real-valued spherical harmonics, and Γm(kR) are the radial filters or equalization terms of order m, with m ∈ [0, L] and n ∈ [−m, m].

Due to a number of practical reasons, it is desirable to distribute the microphone capsules in a uniform manner along the sphere, with the regular tetrahedron being the simplest possible configuration (Gerzon, 1975). Capsule signals recorded with such topology receive the name of A-Format signals. Conversely, the term B-Format (ambisonics) describes the application of Eq. (5) (ambisonic encoding) to the A-Format signals. One of the most common coherence estimators for first-order ambisonic frequency-domain signals Xmn(k) is the diffuseness Ψ as defined in DirAC (Pulkki, 2006),

Ψ(k)=12||{X1(k)X0(k)*}||||X1(k)||2+|X0(k)|2,
(6)

where X0(k) = X00(k) and X1(k)=[X11(k),X10(k),X11(k)] are SN3D-normalized. For the sake of clarity, we will further define the B-Format coherence estimator Δ as,

Δ(k)=1Ψ(k).
(7)

Under spherical isotropic noise, the theoretical coherence between any pair of zeroth and first order ambisonic virtual microphones is equal to 0 for all frequencies, due to orthogonality and symmetry of the spherical harmonics (Elko, 2001). This result can also be assessed by Eq. (4). However, there are several practical factors that might corrupt the coherence estimation, such as the approximation of the temporal expectation by time averaging (Thiergart et al., 2011) in Eq. (6), or the non-ideal implementation of the radial filters Γm(kR) (Schörkhuber and Höldrich, 2017). In Secs. 2.1–2.3, we present several experiments that illustrate the behavior of different coherence estimators applied on the signals captured with a tetrahedral microphone subjected to spherical isotropic noise, using both simulated and real sound recordings.

Spherical isotropic noise has been generated following the geometrical method (Habets and Gannot, 2007, 2010), using N = 1024 plane waves. The resulting A-Format signals correspond to a virtual tetrahedral microphone array mimicking the Ambeo1 characteristics (R = 0.015 m, α = 0.5). The generated audio has duration of 60 s.

Spherical isotropic noise has been rendered to a spherical loudspeaker layout with 25 Genelec 8040 (Iisalmi, Finland). The loudspeakers are arranged into three azimuth-equidistant 8-speaker rings at inclinations θ = [π/4, π/2, 3π/4], plus one speaker at the zenith (θ = 0). The different speaker distances to the center are delay- and gain-corrected, and the signal feeds are equalized to compensate for speaker coloration. The room has an approximate T60 of 300 ms measured at the 1 kHz third-band octave. The spherical isotropic noise has been again created following the geometrical method, encoding a number of uncorrelated noise plane waves in ambisonics with varying orders L ∈ [1, 5]. Due to practical limitations related with the software, the minimum number of sources N = 256 for an accurate sound field reconstruction (Habets and Gannot, 2010) could not be reached—instead, the analysis has been performed parametrically with N = [8, 16, 32, 64]. For each value of L and N, approximately 15 s of audio have been recorded with an Ambeo microphone located at the center of the speaker array. Ambisonics decoding uses the AllRAD method (Zotter and Frank, 2012), passing through a spherical 64-point 10-design virtual speaker layout, and includes an imaginary speaker at the nadir (θ = π). The decoding matrix uses in-phase weights.

The sampling rate of all signals is 48 kHz. All frequency-domain results have been obtained by averaging their time-frequency representations over time. Ambisonics conversion is performed using Ambeo A-B converter AU plugin, version 1.2.1. Two error metrics are considered: the frequency-dependent squared error ε(k), and the mean squared error ε¯,

ε(k)=|X1(k)X2(k)|2;ε¯=1Kk=1K|X1(k)X2(k)|2.
(8)

The coherence of the generated A-Format signals is exemplified in Fig. 1 (left), which shows the MSC between the capsule pair (BLD, BRU) for the theoretical, simulated, and recorded cases. The theoretical coherence is derived from Eq. (4), while simulated and recorded MSCs have been computed by Welch's method, using a Hanning window of 256 samples and 1/2 overlap. The difference between theoretical and simulated coherence is negligible for practical applications. However, there is a noticeable difference when compared to the recorded coherence. In general, the recorded MSC follows the tendency of the simulated curve up to around 5 kHz. Above this frequency, the recorded MSC presents several spectral peaks, which might be partially explained by the interference of the microphone itself in the recorded sound field, and by the non-ideal directivity of the capsules. The squared error ε(k) with respect to the simulated curve is shown in Fig. 1 (left), while Fig. 1 (right) represents the same error averaged over frequency ε¯ for different spatial resolution values of the diffuse field reproduction algorithm. As expected, ε¯ decreases with increasing values of L and N.

Fig. 1.

(Color online) A-Format coherence between microphone signals. Left: MSC as a function of the frequency of theoretical, simulated and recorded [(BLD, BRU), L = 5, N = 64] signals. Right: mean error ε¯ of the recorded signals' MSC (BLD, BRU) compared to the simulated values, for all values of L and N.

Fig. 1.

(Color online) A-Format coherence between microphone signals. Left: MSC as a function of the frequency of theoretical, simulated and recorded [(BLD, BRU), L = 5, N = 64] signals. Right: mean error ε¯ of the recorded signals' MSC (BLD, BRU) compared to the simulated values, for all values of L and N.

Close modal

In order to evaluate the dependency of Δ on the number of time frames used for averaging, the following procedure is presented. The simulated A-Format sound field has been transformed into the spherical harmonic domain, with and without the application of radial filters Γm(kR). Then, Δ has been computed with Eq. (7) for exponentially growing values of r between 1 (8 ms) and 2048 (10.92 s), where r is the vicinity radius used for time averaging, and the number of time windows is given by T = 2r + 1. The time-frequency representation is derived by applying the short-time Fourier transform (STFT) with the same window parameters as in Sec. 3.1.

Figure 2 (left) shows the great dependence of Δ on r. The estimated coherence tends to the theoretical values with increasing values of r. This tendency is better appreciated in Fig. 2 (right): the curve asymptotically decreases to a value Δmin ≈ 0. Another interesting observation comes from the frequency response of the curves. For all values of r, the coherence of the compensated B-Format signal [with Γm(kR)] is roughly flat up to around 7 kHz, which approximately corresponds to the operational spatial frequency range of the microphone (Gerzon, 1975). Above this value, the coherence response loses the flatness due to spatial aliasing. The response above the maximum frequency could be stabilized, if needed, by alternative diffuseness estimation methods (Politis et al., 2015). The coherence level differences along frequency are inversely proportional to r—the effect is better depicted by the standard deviation values (right). The effect of the radial filters in the coherence measurement is also shown: for a given r, the shape of the coherence is always less flat if no filters are applied. Conversely, in this case, coherence values are always smaller for the same r. This effect might be explained taking into account the inter-channel coherence introduced by microphone and encoder imperfections in real scenarios (Schörkhuber and Höldrich, 2017). As a remark, the comparison between Figs. 1 and 2 provides evidence that the application of the spherical harmonic transform might be able to yield more accurate diffuseness estimations, due to a better signal conditioning (Epain et al., 2016).

Fig. 2.

(Color online) Estimated B-Format coherence (Δ) of a simulated diffuse sound field, as a function of the temporal averaging vicinity radius r. Left: Δ(k) for different values of r, with (coarse) and without (fine) application of radial filters. Right: mean and standard deviation of Δ(k) as a function of r.

Fig. 2.

(Color online) Estimated B-Format coherence (Δ) of a simulated diffuse sound field, as a function of the temporal averaging vicinity radius r. Left: Δ(k) for different values of r, with (coarse) and without (fine) application of radial filters. Right: mean and standard deviation of Δ(k) as a function of r.

Close modal

Figure 3 (left) shows the estimated coherence for the recorded sound field with L = 5 and N = 64, using a vicinity radius of r = 1024 (≈5 s). The curve is centered around Δ = 0.25 and presents several spectral peaks, as in the A-Format case. It is important to notice here that the deviations between the coherence of the simulated and the recorded sound fields are much stronger compared to those of Fig. 1. This effect can be also appreciated in Fig. 3 (right): the mean squared error is around 2 orders of magnitude higher in B-Format. Nevertheless, similar as in Fig. 1 (right), ε¯ decreases with increasing values of L and N. This behavior suggests that the deviations between the recorded and the simulated coherence can be to a large degree explained by the low spatial resolution of the reproduction system; given a higher number of loudspeakers, we expect that the reproduced diffuseness will tend to the theoretical expression.

Fig. 3.

(Color online) B-Format coherence between microphone signals. Left: Δ of simulated and recorded (L = 5, N = 64) signals. Right: ε¯ of the recorded signals coherence across all values of L and N.

Fig. 3.

(Color online) B-Format coherence between microphone signals. Left: Δ of simulated and recorded (L = 5, N = 64) signals. Right: ε¯ of the recorded signals coherence across all values of L and N.

Close modal

The diffuseness of a sound field is an important parameter for several applications. In this work, two different metrics of diffuseness have been defined and measured with a tetrahedral microphone subjected to spherical isotropic noise. The analysis shows, first, the impact of the time-averaging window length on the B-Format diffuseness estimator. This result might be useful for designing coherence estimators that are parameterized with respect to the length of the analysis window (Thiergart et al., 2011). Second, the feasibility of diffuse sound field reproduction by a spherical loudspeaker array using ambisonics plane-wave encoding and the geometrical method is studied. Results suggest that this approach is viable, given a sufficient spatial resolution; a quantification of the impact of the number of loudspeakers remains for future work.

1.
Duong
,
N. Q.
,
Vincent
,
E.
, and
Gribonval
,
R.
(
2010
). “
Under-determined reverberant audio source separation using a full-rank spatial covariance model
,”
IEEE Trans. Audio, Speech, Lang. Process.
18
(
7
),
1830
1840
.
2.
Elko
,
G. W.
(
2001
). “
Spatial coherence functions for differential microphones in isotropic noise fields
,” in
Microphone Arrays
(
Springer
,
New York
), pp.
61
85
.
3.
Epain
,
N.
,
Jin
,
C. T.
,
Epain
,
N.
,
Jin
,
C. T.
,
Epain
,
N.
, and
Jin
,
C. T.
(
2016
). “
Spherical harmonic signal covariance and sound field diffuseness
,”
IEEE/ACM Trans. Audio, Speech Lang. Process.
24
(
10
),
1796
1807
.
4.
Gerzon
,
M. A.
(
1975
). “
The design of precisely coincident microphone arrays for stereo and surround sound
,” in Audio Engineering Society Convention 50, Audio Engineering Society.
5.
Habets
,
E. A.
, and
Gannot
,
S.
(
2007
). “
Generating sensor signals in isotropic noise fields
,”
J. Acoust. Soc. Am.
122
(
6
),
3464
3470
.
6.
Habets
,
E. A.
, and
Gannot
,
S.
(
2010
). “
Comments on ‘Generating sensor signals in isotropic noise fields
,’ ” Technical Report, available at https://www.audiolabs-erlangen.com/content/05-fau/professor/00-habets/05-software/04-noise-generators/Comments_on_Habets2007b.pdf.
7.
Habets
,
E. A.
,
Gannot
,
S.
, and
Cohen
,
I.
(
2006
). “
Dual-microphone speech dereverberation in a noisy environment
,” in
2006 IEEE International Symposium on Signal Processing and Information Technology
, pp.
651
655
.
8.
Ito
,
N.
,
Ono
,
N.
,
Vincent
,
E.
, and
Sagayama
,
S.
(
2010
). “
Designing the wiener post-filter for diffuse noise suppression using imaginary parts of inter-channel cross-spectra
,” in
2010 IEEE International Conference on Acoustics, Speech and Signal Processing
, pp.
2818
2821
.
9.
McCowan
,
I. A.
, and
Bourlard
,
H.
(
2003
). “
Microphone array post-filter based on noise field coherence
,”
IEEE Trans. Speech Audio Process.
11
(
6
),
709
716
.
10.
Moreau
,
S.
,
Daniel
,
J.
, and
Bertet
,
S.
(
2006
). “
3d sound field recording with higher order ambisonics—objective measurements and validation of a 4th order spherical microphone
,” in
120th Convention of the AES
, pp.
20
23
.
11.
Motlicek
,
P.
,
Duffner
,
S.
,
Korchagin
,
D.
,
Bourlard
,
H.
,
Scheffler
,
C.
,
Odobez
,
J.-M.
,
Galdo
,
G. D.
,
Kallinger
,
M.
, and
Thiergart
,
O.
(
2013
). “
Real-time audio-visual analysis for multiperson videoconferencing
,”
Advances Multimedia
2013
,
175745
.
12.
Politis
,
A.
,
Delikaris-Manias
,
S.
, and
Pulkki
,
V.
(
2015
). “
Direction-of-arrival and diffuseness estimation above spatial aliasing for symmetrical directional microphone arrays
,” in
Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing
(August 6–10).
13.
Politis
,
A.
,
Tervo
,
S.
, and
Pulkki
,
V.
(
2018
). “
Compass: Coding and multidirectional parameterization of ambisonic sound scenes
,” in
2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
, pp.
6802
6806
.
14.
Pulkki
,
V.
(
2006
). “
Directional audio coding in spatial sound reproduction and stereo upmixing
,” in
Audio Engineering Society Conference: 28th International Conference: The Future of Audio Technology–Surround and Beyond
, Audio Engineering Society.
15.
Schörkhuber
,
C.
, and
Höldrich
,
R.
(
2017
). “
Ambisonic microphone encoding with covariance constraint
,” in
Proceedings of the International Conference on Spatial Audio
, pp.
7
10
.
16.
Stefanakis
,
N.
, and
Mouchtaris
,
A.
(
2015
). “
Foreground suppression for capturing and reproduction of crowded acoustic environments
,” in
2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
, pp.
51
55
.
17.
Thiergart
,
O.
,
Del Galdo
,
G.
, and
Habets
,
E. A.
(
2011
). “
Diffuseness estimation with high temporal resolution via spatial coherence between virtual first-order microphones
,” in
2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
, pp.
217
220
.
18.
Thiergart
,
O.
,
Schultz-Amling
,
R.
,
Del Galdo
,
G.
,
Mahne
,
D.
, and
Kuech
,
F.
(
2009
). “
Localization of sound sources in reverberant environments based on directional audio coding parameters
,” in Audio Engineering Society Convention 127, Audio Engineering Society.
19.
Zotter
,
F.
, and
Frank
,
M.
(
2012
). “
All-round ambisonic panning and decoding
,”
J. Audio Eng. Soc.
60
(
10
),
807
820
.