Binaural rendering of Ambisonic signals is of great interest in the fields of virtual reality, immersive media, and virtual acoustics. Typically, the spatial order of head-related impulse responses (HRIRs) is considerably higher than the order of the Ambisonic signals. The resulting order reduction of the HRIRs has a detrimental effect on the binaurally rendered signals, and perceptual evaluations indicate limited externalization, localization accuracy, and altered timbre. In this contribution, a binaural renderer, which is computed using a frequency-dependent time alignment of HRIRs followed by a minimization of the squared error subject to a diffuse-field covariance matrix constraint, is presented. The frequency-dependent time alignment retains the interaural time difference (at low frequencies) and results in a HRIR set with lower spatial complexity, while the constrained optimization controls the diffuse-field behavior. Technical evaluations in terms of sound coloration, interaural level differences, diffuse-field response, and interaural coherence, as well as findings from formal listening experiments show a significant improvement of the proposed method compared to state-of-the-art methods.

Binaural rendering (synthesis) of acoustic scenes aims at evoking an immersive experience for listeners, which is desirable in virtual reality and 360-degree multimedia productions (Begault, 2000), and can also improve speech intelligibility in teleconferencing applications (Evans, 1997) by exploiting the effect of spatial unmasking (Freyman et al., 1999; Litovsky, 2012).

Typically, binaural rendering involves a convolution of source signals with measured or modeled head-related impulse responses (HRIRs) or binaural room impulse responses (BRIRs) and playback via headphones (Møller, 1992; Wightman and Kistler, 1989). Both HRIRs and BRIRs implicitly contain the cues that are evaluated by the human auditory system to perceive sound from a certain direction and distance, with a certain source width, and spaciousness. A first theory on binaural perception was introduced by Rayleigh (1907) and is known today as the Duplex theory, which states that lateralization of sound sources is due to interaural time differences (ITD) for low frequencies, and due to the interaural level difference (ILD) at higher frequencies, although the frequency ranges of ITD and ILD localization overlap significantly. However, ILD and ITD information cannot be mapped directly to a certain direction, as sources that originate from the same cone of confusion evoke similar ILDs and ITDs, and thus, spectral cues are needed to resolve the ambiguities; a review is presented in Carlile et al. (2005). Besides the localization cues, the interaural coherence (IC) is related to the apparent source width, or parameters such as envelopment or spaciousness (Lindau, 2014; Okano et al., 1998; Pollack and Trittipoe, 1959).

It has been previously shown that localization ambiguities can be reduced and externalization can be improved by using dynamic binaural rendering, where the natural head rotation of listeners is accounted for (Begault et al., 2001; Brungart et al., 2006; Wallach, 1940). Dynamic binaural rendering of object-based audio typically involves fading or switching of filters (HRIRs or BRIRs) (Engdegard et al., 2008), while for dynamic binaural rendering of Ambisonic signals (scene-based audio), no filter switching is needed as the entire sound scene can be rotated by a simple frequency-independent matrix multiplication (Jot et al., 1999; Pinchon and Hoggan, 2007).

Ambisonics (Daniel, 2000; Gerzon, 1973; Malham and Myatt, 1995) is based on the representation of a three-dimensional sound field with an orthonormal basis, the spherical harmonics (SH). Typically, the maximum SH order N used for representation determines the spatial resolution, the number of channels (N+1)2, and the minimum number of loudspeakers required for playback. Binaural rendering of Ambisonics signals typically consists of decoding to virtual loudspeakers using a state-of-the-art method, e.g., Zotter and Frank (2012), followed by a summation of the virtual loudspeaker signals convolved with the HRIRs for the corresponding directions, see also Jot et al. (1998) and Noisternig et al. (2003). However, more recent methods (Bernschütz et al., 2014; Sheaffer et al., 2014) employ a rendering in the SH domain without the intermediate step of decoding to a virtual loudspeaker setup.

For direct rendering in the SH domain, the spatial order of the Ambisonic signals and HRIR description must match (Bernschütz et al., 2014). It has been shown that HRIRs represented in the SH domain contain a significant amount of energy in orders up to N = 30. However, recording, transmitting, and processing of 961 (N = 30) channels is usually not feasible. Thus, a low-order HRIR representation has to be used for rendering of practical Ambisonic orders N < 5. This low-order representation typically leads to impairments of localization cues (ILD, ITD), reduced spaciousness (IC), and a severe roll-off at high frequencies (Avni et al., 2013; Bernschütz et al., 2014; Sheaffer and Rafaely, 2014) (see also Sec. II A). Strategies which aim at improving the perceptual aspects of binaurally rendered order-limited Ambisonic signals include (i) a static equalization filter which is derived from the diffuse-field response of an HRIR set or spherical head model (Ben-Hur et al., 2017; Sheaffer and Rafaely, 2014) and (ii) using a composite grid (spatial resampling) that matches the SH order N (Bernschütz et al., 2014).

Further methods for reducing the SH order of the HRIR representation are known in the field of SH-based HRIR interpolation. The order reduction is achieved by an independent description of the minimum-phase response and linear-phase of the HRIRs; see Evans (1997), Evans et al. (1998), Jot et al. (1999), Romigh et al. (2015). In order to retain the ITD information in the rendered signals, the direction of sources must be known in advance, and thus these methods are well-suited for rendering of object-based audio or rendering of a parameterized sound scene (Laitinen and Pulkki, 2009; Pulkki, 2007), but not for direct binaural rendering of Ambisonic signals.

We suggest the computation of a binaural renderer for Ambisonic signals which consists of a HRIR preprocessing stage and an optimization stage. In the preprocessing stage, a frequency-dependent time alignment is applied to the original HRIRs; in the optimization stage, the binaural renderer is obtained by frequency domain minimization of the squared approximation error with respect to the high-frequency time-aligned HRIRs subject to quadratic constraints such that the diffuse-field properties of the rendered signals match the diffuse-field properties of a model. This article is structured as follows. In Sec. II, we address the challenges of binaural rendering of low-order Ambisonic signals and summarize state-of-the-art binaural rendering methods based on suggestions given in Ben-Hur et al. (2017), Bernschütz et al. (2014), and Sheaffer and Rafaely (2014). The proposed binaural renderer is presented in detail in Sec. III. The evaluation of the proposed renderer and a comparison to existing methods via technical measures is presented in Sec. IV. Finally, the findings from formal listening experiments are discussed in Sec. V.

Let us consider a continuous distribution of sources in the far-field s(ω,Ω), where ω is the frequency, Ω(ϕ,θ)S2,ϕ=(0,2π] is the azimuth angle, which is measured counter clockwise from the Cartesian x-axis in the horizontal plane, and θ=(π/2,π/2) is the elevation angle, which increases upwards from the Cartesian x–y plane. With a continuous description of the far-field head-related transfer functions (HRTFs) h(ω,Ω)=[hl(Ω,ω),hr(Ω,ω)]T, the ear signals are obtained by

(1)

where S2(·)dΩ02ππ/2π/2(·)cosθdθdϕ,(·)l,r indicates the left and right ear, respectively, and (·)T is the transpose operator. The corresponding acoustic scene in Ambisonics of order NA is given by the (NA+1)2×1 vector

(2)
(3)

where Ynm(Ω) are the real-valued SHs (Williams, 1999)

(4)

where Pnm(·) is the associated Legendre function, and 0nNA,nmn are the order and degree, respectively.

As Ambisonics is a scene-based, and not an object-based format, the source signals and directions are typically not known (a parametric description of the scene can be obtained by the directional audio coding method from Laitinen and Pulkki, 2009, and Pulkki, 2007, but is beyond the scope of this article) and thus, a desired renderer is independent of the actual source signal s(ω,Ω). The binaural rendering matrix BNA(ω) yields the ear signals x̂(ω)=BNAH(ω)a(ω) from an Ambisonic signal of the order NA. And it is obtained by solving a least-squares (LS) problem of the form

(5)

where ||·||F is the Frobenius norm. In most practical situations, only samples of the underlying continuous HRTFs are available and Eq. (5) is approximated by numerical integration

(6)

where

(7)
(8)

is an arbitrary set of HRTFs that is defined at the discrete directions Ωp(ϕp,θp)S2 with p={1,,P} as the index of the available grid points,

(9)
(10)

and W is a real-valued frequency-independent diagonal weighting matrix containing the quadrature weights. We note that selection of an optimal sampling grid for the sphere is an open research question as the sampling must approximate the integral well and also be practical (Duraiswami et al., 2005; Fliege and Maier, 1999; Maday et al., 2008; Zotkin et al., 2009).

For the sake of readability the frequency dependency is not indicated in the remainder of this section. From Eq. (6), the optimal solution of the LS formulation is found as

(11)

where BNALS contains the approximated SH expansion coefficients of the HRTFs (Ahrens et al., 2012; Pollow et al., 2012). Here we assume a high-resolution closed sampling grid P>(NA+1)2 that achieves (YNAWYNA,PH)I (orthogonality property of SH), where I is the identity matrix and thus, the LS solution can be further simplified to

(12)

and the binaurally rendered ear signals are

(13)

For a unity-gain plane wave impinging from direction Ωq, the Ambisonic signals are defined as a=yq, and by substituting Eq. (12) in Eq. (13), the rendered ear signals are

(14)

where hNA,q are the reconstructed HRTFs at direction Ωq, which are obtained by a weighted summation of the original HRTFs. The directional weighting of that summation is defined by

(15)

Please note that Eq. (14) can be interpreted as an approximation method for reconstructing HRTFs corresponding to any direction. However, we do not intend to present a method for HRTF interpolation, but solely use the differences between the reconstructed and the original HRTFs as performance measures for binaural rendering of Ambisonic signals. For SH-based HRTF approximation and modeling, the readers are referred to Evans et al. (1998), Romigh (2012), and Zotkin et al. (2009).

Now let us consider a HRTF set (KU-100) which is measured at P = 2702 discrete directions arranged according to a Lebedev grid (Bernschütz, 2013). It can be seen in Fig. 1 that the SH order NH in order to obtain near perfect binaural rendering increases as frequency increases, and that for the chosen HRTF set NH30 is necessary (Bernschütz et al., 2014). However, in practice, the maximum order of the HRTFs represented in the SH domain must match the order of the Ambisonic signals to be rendered for playback, cf. Eq. (13). Typically, the signals obtained from spherical microphone arrays are encoded in orders NA < 5, and thus the reduction of order leads to severe loss of spatial detail, especially at higher frequencies. Moreover, a reduction of the order NA < NH leads to a broader main-lobe of the directional weighting function and thus, more HRTF directions around the source direction Ωq contribute substantially to the rendered ear signals. The directional weighting functions for orders NA = 1 and NA = 5 and Ωq=(0°,0°) are depicted in Fig. 2. Although the shape of the directional weighting is independent of the source-direction Ωq, the effect on the rendered ear signals is strongly direction-dependent due to the off-center position of the ears. For a head radius rH=8.5 cm, we calculate the time offsets τpl,r (time difference between the center and the ears) for each grid point p via a simple geometric model

(16)

where c=343(m/s) is the speed of sound. Utilizing Eq. (15), the IR between all grid nodes P and an omnidirectional microphone at the position of the right ear is

(17)

where t is the time, and δ(·) is the delta function. The obtained IRs, and corresponding transfer functions for directions Ωq=(ϕq,0°) on the horizontal plane, and NA = 5 are depicted in Figs. 3(a) and 3(b), respectively. For directions from the front and back, we observe strong colorations (low-pass behavior), as the time offsets for neighboring grid nodes which comprise the main-lobe (almost equally weighted) are highest. On the contrary, less coloration is expected for lateral directions as the variation of time offsets for nodes within the main-lobe is smaller. Similar findings are outlined in Solvang (2008), where an overview of the relation between number of loudspeakers, reproduction error, and coloration of two-dimensional Ambisonic systems is given.

FIG. 1.

Distribution of normalized energy contained in all modes nmn for each SH order n over frequency for the left ear and calculation of the SH expansion coefficients according to Eq. (11) for NA = 35. ITD and head-shadowing effects (sharp dips and phase discontinuities at contralateral ear) lead to involvement of higher SH orders at high frequencies.

FIG. 1.

Distribution of normalized energy contained in all modes nmn for each SH order n over frequency for the left ear and calculation of the SH expansion coefficients according to Eq. (11) for NA = 35. ITD and head-shadowing effects (sharp dips and phase discontinuities at contralateral ear) lead to involvement of higher SH orders at high frequencies.

Close modal
FIG. 2.

Magnitude of the directional weighting function gNA,q=WYNA,PHyq in dB for rendering of a plane wave impinging from Ωq=(0°,0°) using NA = 1 (a), and NA = 5 (b).

FIG. 2.

Magnitude of the directional weighting function gNA,q=WYNA,PHyq in dB for rendering of a plane wave impinging from Ωq=(0°,0°) using NA = 1 (a), and NA = 5 (b).

Close modal
FIG. 3.

(a) IRs between all grid nodes P and an omnidirectional microphone at the position of the right ear as defined in Eq. (17). The azimuth angle (ordinate) indicates the source-direction ϕq with θq=0°. The ideal IRs correspond to single pulses δ(tτpr) with a time offset τpr as defined in Eq. (16). (b) Transfer functions at the position of the right ear.

FIG. 3.

(a) IRs between all grid nodes P and an omnidirectional microphone at the position of the right ear as defined in Eq. (17). The azimuth angle (ordinate) indicates the source-direction ϕq with θq=0°. The ideal IRs correspond to single pulses δ(tτpr) with a time offset τpr as defined in Eq. (16). (b) Transfer functions at the position of the right ear.

Close modal

Furthermore, the direction-dependent coloration for binaural rendering of low order Ambisonic signals is discussed in Avni et al. (2013), Ben-Hur et al. (2017), Bernschütz et al. (2014), and Sheaffer and Rafaely (2014) and improvements are obtained by (i) using a spectral (diffuse-field) equalization filter, which is based on a spherical head model (Ben-Hur et al., 2017), or (ii) spatial resampling of HRTFs at a reduced grid (Bernschütz et al., 2014). Both suggested methods are reviewed in the following paragraphs.

1. Spectral equalization

With the model of a perfectly diffuse sound field sd consisting of an infinite number of plane waves impinging from every direction in space, with random mutually uncorrelated phases, and a total power ρ (see Epain and Jin, 2016), the covariance matrix of the ear signals xd(t) is defined as RH=Et{xd(t)xdH(t)}, where Et{·} denotes the statistical expectation operator over the time t. With xd(t)=S2sd(t,Ω)h(Ω)dΩ, we get RH=ρS2h(Ω)hH(Ω)dΩ, and by numerical integration and assuming ρ = 1, the estimated diffuse-field covariance matrix is obtained as

(18)

where the main diagonal entries contain the diffuse-field energies of the left and right ear, respectively. A similar formulation is used in Pulkki et al. (2017). With the diffuse-field energies of the order N rendered signals rNqq(ω), where q{r,l}, the transfer function of the static diffuse-field equalization (timbre correction) filter (Sheaffer and Rafaely, 2014), which achieves rNAqq(ω)=rNHqq(ω), where NA < NH is given by

(19)

If instead of the measured HRTFs a rigid sphere head model is used, the diffuse-field energy according to Williams (1999) is

(20)

where

(21)

and jn(·) is the spherical Bessel function, hn(·) is the spherical Hankel function of the second kind, jn(·),)hn(·) are their first derivatives, and k=ω/c.

In practice, we set (Gl(ω)|NANH+Gr(ω)|NANH)/2=G(ω)|NANH and the diffuse-field equalized (DEQ) binaural renderer is obtained by

(22)

2. Spatial resampling

It has been found in Bernschütz et al. (2014) that selecting a feasible (sparse) set of grid points can reduce colorations. The HRTFs at the chosen composite grid nodes Ωg, with g={1,,G}, and G < P, are either selected from the original HRTFs (nearest neighbor), or are interpolated according to Eqs. (14) and (12) as

(23)

In Bernschütz et al. (2014) an equiangular Gaussian quadrature (Stroud, 1966) with G=2(NA+1)2 grid nodes and a Lebedev (Lebedev, 1977) quadrature were compared. In accordance with the results from listening experiments we use a composite Gaussian grid (CGG) binaural renderer, which is obtained by

(24)

where WG contains the quadrature weights for the Gaussian sampling for the comparison of rendering methods.

The computation of the proposed binaural renderer consists of a preprocessing and an optimization stage. In the preprocessing stage, we apply a high-frequency time alignment to the original HRTF set. It has been shown in the context of SH-based HRTF interpolation that removing the linear phase (ITD equalization) leads to an order reduction of the HRTF representation. In contrast to Evans et al. (1998), Rasumow et al. (2014), and Romigh et al. (2015), we suggest a frequency-dependent ITD equalization that retains the ITD at low and removes ITD at high frequencies. Furthermore, the high frequency ITD is not re-synthesized after rendering.

Here, the time-aligned HRTF set is computed as

(25)

where the frequency response of the allpass filter Apl,r(ω) is defined as

(26)

where ωc=2πfc,i=1, and the time offset τpl,r is calculated according to Eq. (16). Note that the time offsets could be estimated from the HRTF set as well (see Katz and Noisternig, 2014 for a comparison of different methods). However, in pre-tests, we found no significant improvement and therefore used the simple geometric model. Due to the time alignment, the energy contained in higher SH orders is significantly reduced, cf. Figs. 1 and 4. Thus, lower orders NH<NH are sufficient to represent the HRTFs at higher frequencies. As, according to the Duplex Theory (Hartmann et al., 2016; Macpherson and Middlebrooks, 2002; Rayleigh, 1907; Wightman and Kistler, 1992), the ITD cue becomes less relevant as frequency increases, we expect that high-frequency time alignment of HRIRs with a cut-on frequency of fc=1.5 kHz (empirically chosen) allows for efficient order reduction while retaining the perceptually relevant localization cues.

In order to achieve the diffuse-field response and IC behavior of the original HRTF set H [see Eq. (18)], we cast the computation of the binaural renderer as a constrained optimization problem of the form

(27)
(28)

where H is the high-frequency time-aligned HRTF set, RYI is the SH spatial covariance matrix, and RH is defined in Eq. (18). A similar formulation is used in Schörkhuber and Höldrich (2017) and Vilkamo and Pulkki (2013). The set of solutions which satisfy the covariance constraint Eq. (28) is given by

(29)

where Q is an (NA+1)2×2 arbitrary unitary matrix such that QHQ=I and C is obtained by some suitable matrix decomposition of RH such that CHC=RH. With the properties ||M||F2=tr{MHM} and tr{M1+M2}=tr{M1}+tr{M2}, where tr{·} is the trace of a matrix, and by inserting Eq. (29) into Eq. (27), we restate the minimization problem

(30)
(31)

with

(32)
(33)
(34)

where R(·) denotes the real part of a complex number. Since T3 is independent of Q, we drop it in the sequel. By assuming that YNA,PWYNA,PH=RY=I, we can also drop the first term. Hence, the problem of determining Q is reduced to

(35)
(36)

where A=CH°WYNA,PH. Using the singular value decomposition UΣVH=A, the solution is given by

(37)

where Λ=[I20(NA+1)22×2]T. The final form of the time-aligned, and diffuse-field covariance-constrained (TAC) binaural renderer for order NA is thus given by

(38)
FIG. 4.

Distribution of normalized energy contained in all modes nmn for each SH order n over frequency for the left ear and time-aligned HRIRs H (using fc = 1.5 kHz). The SH expansion coefficients are calculated according to Eq. (11). A lower order NH15 is sufficient to represent the time-aligned HRIRs compared to the original set, cf. Fig. 1.

FIG. 4.

Distribution of normalized energy contained in all modes nmn for each SH order n over frequency for the left ear and time-aligned HRIRs H (using fc = 1.5 kHz). The SH expansion coefficients are calculated according to Eq. (11). A lower order NH15 is sufficient to represent the time-aligned HRIRs compared to the original set, cf. Fig. 1.

Close modal

In this section, the rendered signals, which are obtained by the proposed TAC method are analyzed and compared with state-of-the-art methods presented in Sec. II. The quality criteria include (i) the direction-dependent coloration (presented for directions at the horizontal plane), (ii) the ILD errors in octave bands, and (iii) the diffuse-field behavior, i.e., the diffuse-field response and the interaural coherence.

The composite loudness level (CLL) (Frank, 2013; Ono et al., 2001, 2002) is a measure to describe the perceived timbre. We use the simplified definition

(39)

where xl,r(Ωp,ω) are the reference ear signals [see Eq. (1)] due to a single unity-gain plane wave impinging from direction Ωp. The CLL error between the reference and rendered Ambisonic signals [see Eq. (13)] is defined as

(40)

where CLLpNA(ω) is the CLL of the rendered signals using a rendering order NA. The resulting CLL errors obtained for all discussed binaural rendering methods using NA = 3 are depicted in Fig. 5 for directions on the horizontal plane (θp=0°). CLL errors for directions on the median plane show similar trends and are therefore not depicted here.

FIG. 5.

CLL error according to Eq. (40) between reference and binaurally rendered Ambisonic signals in dB evaluated for sources at the horizontal plane (θq=0°) for an order of NA = 3. (a) LS as defined in Eq. (11). (b) DEQ as defined in Eq. (22); see Ben-Hur et al. (2017). (c) CGG as defined in Eq. (24), see Bernschütz et al. (2014). (d) TAC as defined in Eq. (38).

FIG. 5.

CLL error according to Eq. (40) between reference and binaurally rendered Ambisonic signals in dB evaluated for sources at the horizontal plane (θq=0°) for an order of NA = 3. (a) LS as defined in Eq. (11). (b) DEQ as defined in Eq. (22); see Ben-Hur et al. (2017). (c) CGG as defined in Eq. (24), see Bernschütz et al. (2014). (d) TAC as defined in Eq. (38).

Close modal

Below the aliasing frequency fNANAc/2πrH1.9 kHz (Rafaely, 2005) the CLL errors are negligible for all tested methods. Above the aliasing frequency, we observe a severe low-pass behavior for frontal and dorsal directions using the LS method, Eq. (11) [see Fig. 5(a)]. As the diffuse-field equalization [DEQ, see Eq. (22)] filter is basically a direction-independent high-shelving filter, the coloration error is shifted from frontal to lateral directions, cf. Fig. 5(b). The spatial resampling approach using a CGG as defined in Eq. (24) reduces the coloration for most directions, see Fig. 5(c). However, best performance in terms of minimal CLL error is observed for the proposed method (TAC), see Fig. 5(d).

The obtained ILD errors between the reference and rendered signals in octave-bands are defined as

(41)
(42)

where ωo indicates an octave-band center frequency, and el,r are the energies contained in the octave-bands of the left and right ear signal, respectively. The absolute values of the ILD errors are calculated for five octave-bands with center frequencies at 1, 2, 4, 8, and 16 kHz for all grid directions (P = 2702) and are analyzed with a histogram between 0 and 15 dB with 30 equally spaced bins. Figure 6 depicts the resulting cumulative density function (CDF) which is defined for each band as

(43)

where cj is the number of elements in bin j, and i is the histogram bin index.

FIG. 6.

Absolute ILD error between the original and the approximated HRIRs (NA = 3) for all P directions is analyzed using a histogram between 0 and 15 dB using 30 equally spaced bins. Shown is the cumulative density function vi=j=1i(cj/P), where P = 2702 is the total number data points, cj the number of elements in bin j, and i is the index of the histogram bins. The black solid lines indicate the 90% threshold of the CDF.

FIG. 6.

Absolute ILD error between the original and the approximated HRIRs (NA = 3) for all P directions is analyzed using a histogram between 0 and 15 dB using 30 equally spaced bins. Shown is the cumulative density function vi=j=1i(cj/P), where P = 2702 is the total number data points, cj the number of elements in bin j, and i is the index of the histogram bins. The black solid lines indicate the 90% threshold of the CDF.

Close modal

When comparing the sub figures depicted in Fig. 6, it can be observed that above the aliasing frequency fNA1.9 kHz ILD errors are increasing with frequency. Whereas for LS and DEQ, the distribution of absolute ILD errors is similar, rendering using the CGG approach shows the highest, and rendering using the TAC approach shows lowest overall ILD errors.

In order to compare the algorithms for rendering of diffuse sound fields, the main- and off-diagonal elements of the diffuse-field covariance matrix as defined in Eq. (18) are compared in Figs. 7(a) and 7(b), respectively. While the proposed TAC approach yields the same diffuse-field behavior as the reference set (due to constraint), all other approaches show deviations. The diffuse-field response of the LS renderer clearly indicates the discussed low-pass behavior. Results for DEQ and CGG show improvements, but above the aliasing frequency we observe colorations, cf. Fig. 7(a).

FIG. 7.

Diffuse-field energy (left ear) in dB (a) and interaural coherence (b) for a rendering order NA = 3.

FIG. 7.

Diffuse-field energy (left ear) in dB (a) and interaural coherence (b) for a rendering order NA = 3.

Close modal

According to Menzer (2010) the interaural coherence (IC) is defined as

(44)

where rlr(ω),rll(ω), and rrr(ω) are defined in Eq. (18). The IC of the tested rendering methods is depicted in Fig. 7(b). Again, the TAC yields the same behavior as the reference while the IC for LS and DEQ (same IC), and CGG shows significant deviations from the reference.

According to Gabriel and Colburn (1981), Pollack and Trittipoe (1959), and Stern et al. (2006), just noticeable differences (JNDs) of interaural correlation values change depending on the source frequency, bandwidth, and reference condition, and findings indicate a JND of 0.08 for a reference condition with interaural correlation of 1, and a JND of 0.35 for a reference condition with interaural correlation of 0, respectively. The JNDs for intermediate reference conditions are between 0.08 and 0.35 (Kim et al., 2008).1 As the IC deviations exceed the JNDs for certain frequency ranges, an altered spaciousness or envelopment is expected for LS, DEQ, and CGG rendering methods, especially for orders NA3.

In order to study and compare the perceptual aspects of binaural rendering using the TAC and state-of-the-art methods, formal listening experiments were conducted.

Test participants were asked to rate the overall difference between a reference [rendered according to Eq. (11) with NA=NH=30] and the test signals on a scale from no audible difference to severe difference. A hidden reference was used for screening of ratings, and thus the test procedure can be described as MUSHRA-like [multi stimulus with hidden reference and anchor (ITU-R, 1997)]. The presented test signals were continuously looped and participants were allowed to seamlessly switch between signals in real-time as often as desired.

The Ambisonic signals are obtained by a convolution of a monophonic source signal with a room impulse response (RIR) in the SH domain. For simulation of the RIRs, we used the multichannel room acoustics simulation toolbox (McRoomSIM) (Wabnitz et al., 2010) for a shoebox room of dimensions 9.5×12×4.2 m, with a mean absorption coefficient α=0.2360, a mean T60=0.8 s, and a source/listener setup as depicted in Fig. 8. The listener at position [3.5,3,1.7] m is facing towards the positive x-axis and the omnidirectional source is positioned relative to the listener as defined by the evaluation angle Ωq on a radius rq=1.5rc, where rc=1.39 m is the critical distance. The tested discrete source directions include Ωq=(0°,0°),Ωq=(90°,0°),Ωq=(35°,45°), and Ωq=(45°,0°). The perceptual evaluation is segmented in three experiments.

FIG. 8.

Room layout and source/listener position used for simulating the room impulse responses via McRoomSIM. The listener is placed at [3.5,3,1.7]m and the source position is varied according to evaluation angle Ωt on a radius rt=1.5rc2.09m.

FIG. 8.

Room layout and source/listener position used for simulating the room impulse responses via McRoomSIM. The listener is placed at [3.5,3,1.7]m and the source position is varied according to evaluation angle Ωt on a radius rt=1.5rc2.09m.

Close modal
a. Experiment I.

A speech signal and the direct part of the RIR were used for computing the test signals at all four test directions Ωq. The tested methods were (i) LS, (ii) DEQ, (iii) CGG, and (iv) TAC for orders NA = 1 and NA = 5.

b. Experiment II.

The test signals were a speech signal and a drum loop (kick drum, snare drum, cymbals). In order to evaluate the performance of the algorithms in reflective environments, the entire simulated RIR (direct, early reflections, and diffuse part) for all four test directions Ωq was used. The tested algorithms include (i) DEQ, (ii) CGG, and (iii) TAC for orders NA = 1, NA = 3, and NA = 5.

c. Experiment III.

The dependence of the overall quality on the order NA=[1,2,3,4,5,6,9,12,15] was tested for the TAC method only. We used the entire RIR, the drum signal (as is it more complex), and Ωq=(0°,0°) and Ωq=(90°,0°) for testing.

Overall, 14 test pages were presented and the order of test signals within one page as well as the order of test pages were randomized. Depending on the experiment, the nine participants (expert listeners, no hearing impairments) were asked to rate the perceived overall difference on a continuous scale from no difference to severe difference for 9–10 test signals per page.

In order to ensure equal listening conditions for all participants and test signals, no head-tracking was used. This is valid as participants rated the difference to a reference and not the localization or externalization of stimuli. The test signals were played back via an AKG-702 (half-open) headphone and equalization according to Schärer and Lindau (2009) was used.

The median and 95% confidence interval of all ratings (for all four test directions Ωq) from Experiment I are depicted in Fig. 9. The results indicate a clear perceptual improvement for higher orders and that the proposed method (TAC) overall outperforms the other tested methods. The p-values (significance level) of a Kruskal-Wallis test (Kruskal and Wallis, 1952) presented in Table I indicate that there are five groups that are significantly different to each other.

FIG. 9.

Results of pooled data obtained in Experiment I showing the median (markers) and 95% confidence interval (solid lines) of ratings for all four tested directions using a speech signal and the direct-path only.

FIG. 9.

Results of pooled data obtained in Experiment I showing the median (markers) and 95% confidence interval (solid lines) of ratings for all four tested directions using a speech signal and the direct-path only.

Close modal
TABLE I.

p-values (Kruskal-Wallis) for all tested methods of Experiment I. The numbers after the / indicate the tested order NA. Methods which are not significantly different (p-values > 0.05) are highlighted with gray rectangles.

LS/1 1.000        
LS/5 0.000 1.000       
DEQ/1 0.000 0.333 1.000      
DEQ/5 0.000 0.171 0.006 1.000     
CGG/1 0.000 0.800 0.123 0.096 1.000    
CGG/5 0.0000 0.001 0.000 0.042 0.000 1.000   
TAC/1 0.000 0.053 0.000 0.857 0.005 0.009 1.000  
TAC/5 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 
method LS/1 LS/5 DEQ/1 DEQ/5 CGG/1 CGG/5 TAC/1 TAC/5 
LS/1 1.000        
LS/5 0.000 1.000       
DEQ/1 0.000 0.333 1.000      
DEQ/5 0.000 0.171 0.006 1.000     
CGG/1 0.000 0.800 0.123 0.096 1.000    
CGG/5 0.0000 0.001 0.000 0.042 0.000 1.000   
TAC/1 0.000 0.053 0.000 0.857 0.005 0.009 1.000  
TAC/5 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 
method LS/1 LS/5 DEQ/1 DEQ/5 CGG/1 CGG/5 TAC/1 TAC/5 

The groups in ascending order of quality are (i) LS/1, (ii) DEQ/1, CGG/1, and LS/5, (iii) TAC/1, and DEQ/5, (iv) CGG/5, and (v) TAC/5. Overall, the TAC method yields the least perceptual differences to the reference for all tested orders and results for NA = 1 are comparable to results for LS and DEQ using NA = 5.

The detailed results (all test directions Ωq separately) for Experiment I are shown in Fig. 10, where it can be seen that the performance of most methods varies with the source direction Ωq. As most pronounced coloration of the LS approach is observed for frontal directions [see Fig. 10(a)] it is rated to have severe difference to the reference. On the other hand, the DEQ approach shifts the coloration from frontal to lateral directions and thus, performance is worst for Ωq=(90°,0°) [see Fig. 10(b)]. The CGG approach can give an improvement, but still the performance is highly dependent on the source direction [compare Figs. 10(a)–10(c)]. Results for the TAC method do not only show an overall improvement, but also the least variation across the different test directions.

FIG. 10.

Detailed results of Experiment I showing the median (markers) and 95% confidence interval (solid lines) of ratings from all participants for testing the perceived overall difference to the reference. The reference was a speech signal rendered for a far-field scenario and a source direction indicated by Ωq.

FIG. 10.

Detailed results of Experiment I showing the median (markers) and 95% confidence interval (solid lines) of ratings from all participants for testing the perceived overall difference to the reference. The reference was a speech signal rendered for a far-field scenario and a source direction indicated by Ωq.

Close modal

The overall results for Experiment II are depicted in Fig. 11. Note that the results for the two different source signals (speech and drum loop), and all source directions Ωq are pooled. We observe a similar behavior as for Experiment I, namely an improvement with increasing order NA, and best performance for the TAC method. The groups in ascending order of quality are (i) DEQ/1, and CGG/1, (ii) TAC/1, DEQ/3, CGG/3, DEQ/5, and CGG/5, (iii) TAC/3, and (iv) TAC/5. While ratings for the TAC method are significantly different across the tested orders NA=[1,3,5], there is no distinct difference between DEQ and CGG for orders NA=[3,5]; see Table II. Moreover, ratings for TAC and NA = 1 are similar to DEQ and CGG for orders NA = 3, and NA = 5. The results per source signal are depicted in Fig. 12 and Fig. 13. Due to the transient and broadband nature of the drum signal (strong components above the aliasing frequency), the overall quality ratings are worse than for the speech signal. However, the TAC method shows smaller dependency on the source signal than the other tested methods.

FIG. 11.

Results of pooled data obtained in Experiment II showing the median (markers) and 95% confidence interval (solid lines) of ratings for all four tested directions using speech and drums as source signal and the entire simulated RIR.

FIG. 11.

Results of pooled data obtained in Experiment II showing the median (markers) and 95% confidence interval (solid lines) of ratings for all four tested directions using speech and drums as source signal and the entire simulated RIR.

Close modal
TABLE II.

p-values (Kruskal-Wallis) for all tested methods and pooled ratings of Experiment II. The numbers after the / indicate the tested order NA. Methods which are not significantly different (p-values > 0.05) are highlighted with gray rectangles.

DEQ/1 1.000         
DEQ/3 0.000 1.000        
DEQ/5 0.000 0.994 1.000       
CGG/1 0.645 0.000 0.000 1.000      
CGG/3 0.000 0.084 0.138 0.000 1.000     
CGG/5 0.000 0.002 0.004 0.000 0.114 1.000    
TAC/1 0.000 0.084 0.143 0.000 0.952 0.105 1.000   
TAC/3 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000  
TAC/5 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.008 1.000 
method DEQ/1 DEQ/3 DEQ/5 CGG/1 CGG/3 CGG/5 TAC/1 TAC/3 TAC/5 
DEQ/1 1.000         
DEQ/3 0.000 1.000        
DEQ/5 0.000 0.994 1.000       
CGG/1 0.645 0.000 0.000 1.000      
CGG/3 0.000 0.084 0.138 0.000 1.000     
CGG/5 0.000 0.002 0.004 0.000 0.114 1.000    
TAC/1 0.000 0.084 0.143 0.000 0.952 0.105 1.000   
TAC/3 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000  
TAC/5 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.008 1.000 
method DEQ/1 DEQ/3 DEQ/5 CGG/1 CGG/3 CGG/5 TAC/1 TAC/3 TAC/5 
FIG. 12.

Results of Experiment II showing the median (markers) and 95% confidence interval (solid lines) of ratings for all four tested directions using a speech signal and the entire RIR.

FIG. 12.

Results of Experiment II showing the median (markers) and 95% confidence interval (solid lines) of ratings for all four tested directions using a speech signal and the entire RIR.

Close modal
FIG. 13.

Results of Experiment II showing the median (markers) and 95% confidence interval (solid lines) of ratings for all four tested directions using a drum signal and the entire RIR.

FIG. 13.

Results of Experiment II showing the median (markers) and 95% confidence interval (solid lines) of ratings for all four tested directions using a drum signal and the entire RIR.

Close modal

The overall results for Experiment III are depicted in Fig. 14 for the two tested directions Ωq=(0°,0°) and Ωq=(90°,0°). As expected, the maximum order NA to achieve near-transparent rendering changes with the source direction. The p-values listed in Table III indicate that for frontal sources, an order of NA = 4 is sufficient (no significant difference to higher orders), but for lateral directions an order of NA = 9 is required, see Fig. 14. As time-alignment of HRIRs reduces the required SH order for representation (see Fig. 4) to NH=15, testing of higher orders is not necessary.

FIG. 14.

Results of Experiment III showing the median (markers) and 95% confidence interval (solid lines) of ratings for all tested orders TAC|NA using a drum loop as source signal and the entire RIR for rendering. Test directions Ωq=(0°,0°), and Ωq=(90°,0°) are depicted separately.

FIG. 14.

Results of Experiment III showing the median (markers) and 95% confidence interval (solid lines) of ratings for all tested orders TAC|NA using a drum loop as source signal and the entire RIR for rendering. Test directions Ωq=(0°,0°), and Ωq=(90°,0°) are depicted separately.

Close modal
TABLE III.

p-values (Kruskal-Wallis) for all tested orders of Experiment III. The upper triangle shows the p-values for Ωq=(0°,0°), the lower triangle for Ωq=(90°,0°) The numbers after the / indicate the tested order NA. Methods which are not significantly different (p-values > 0.05) are highlighted with gray rectangles.

methodTAC/1TAC/2TAC/3TAC/4TAC/5TAC/6TAC/9TAC/12TAC/15
TAC/1 1.000 0.125 0.011 0.003 0.004 0.017 0.004 0.004 0.003 
TAC/2 0.013 1.000 0.142 0.047 0.096 0.116 0.031 0.039 0.024 
TAC/3 0.009 0.848 1.000 0.141 0.277 0.333 0.061 0.109 0.040 
TAC/4 0.018 0.655 0.655 1.000 0.479 0.947 0.465 0.647 0.327 
TAC/5 0.002 0.225 0.3062 0.142 1.000 0.647 0.267 0.245 0.174 
TAC/6 0.002 0.025 0.073 0.035 0.482 1.000 0.620 0.946 0.838 
TAC/9 0.002 0.013 0.018 0.013 0.125 0.179 1.000 0.733 0.838 
TAC/12 0.002 0.002 0.003 0.003 0.006 0.009 0.305 1.000 0.894 
TAC/15 0.002 0.002 0.002 0.002 0.004 0.004 0.089 0.077 1.000 
methodTAC/1TAC/2TAC/3TAC/4TAC/5TAC/6TAC/9TAC/12TAC/15
TAC/1 1.000 0.125 0.011 0.003 0.004 0.017 0.004 0.004 0.003 
TAC/2 0.013 1.000 0.142 0.047 0.096 0.116 0.031 0.039 0.024 
TAC/3 0.009 0.848 1.000 0.141 0.277 0.333 0.061 0.109 0.040 
TAC/4 0.018 0.655 0.655 1.000 0.479 0.947 0.465 0.647 0.327 
TAC/5 0.002 0.225 0.3062 0.142 1.000 0.647 0.267 0.245 0.174 
TAC/6 0.002 0.025 0.073 0.035 0.482 1.000 0.620 0.946 0.838 
TAC/9 0.002 0.013 0.018 0.013 0.125 0.179 1.000 0.733 0.838 
TAC/12 0.002 0.002 0.003 0.003 0.006 0.009 0.305 1.000 0.894 
TAC/15 0.002 0.002 0.002 0.002 0.004 0.004 0.089 0.077 1.000 

In this paper, we presented an improved method for binaural rendering of low order Ambisonic signals (NA5). The proposed binaural renderer is computed using a frequency-dependent time alignment of HRIRs followed by a minimization of the squared error subject to a diffuse-field covariance matrix constraint (TAC). Due to the time alignment, lower SH orders are sufficient to represent the directivity patterns of the ears at higher frequencies, while the covariance constraint ensures that sound scenes rendered with the TAC method achieve the same diffuse-field behavior as scenes rendered with the original high-order HRIRs.

Technical evaluations and comparisons to state-of-the-art methods indicate that the proposed TAC method reduces the direction-dependent colorations, and the ILD errors, and improves the diffuse-field behavior.

In the perceptual evaluation, we tested the overall difference to a reference (rendered with order NA = 30) for four source directions in a free-field condition and in a simulated room. The results of the TAC method show a significant improvement of overall quality for all tested directions as well as smallest direction-dependent quality variation compared to the other tested methods. Furthermore, we found that the rendering order NA can be reduced significantly for the TAC method in order to achieve similar quality ratings as other binaural rendering methods for Ambisonic signals. Ratings of auralization in a simulated room indicate that the proposed method using NA = 1 achieves comparable results to the other tested methods using NA = 5.

As the TAC method shows little direction-dependent quality changes, we assume an improved externalization and localization performance. Thus, future work includes testing of the proposed method in a dynamic binaural rendering setup, where localization accuracy, externalization, and spaciousness are evaluated separately.

1

The JNDs are defined for the single valued interaural correlation, and thus broadband signals. We assume that the frequency-dependent IC is an indicator for the interaural correlation for narrow-band signals.

1.
Ahrens
,
J.
,
Thomas
,
M. R.
, and
Tashev
,
I.
(
2012
). “
HRTF magnitude modeling using a non-regularized least-squares fit of spherical harmonics coefficients on incomplete data
,” in
Proceedings of the Signal & Information Processing Association Annual Summit and Conference (APSIPA ASC)
, December 3–6, Hollywood, CA, pp.
1
5
.
2.
Avni
,
A.
,
Ahrens
,
J.
,
Geier
,
M.
,
Spors
,
S.
,
Wierstorf
,
H.
, and
Rafaely
,
B.
(
2013
). “
Spatial perception of sound fields recorded by spherical microphone arrays with varying spatial resolution
,”
J. Acoust. Soc. Am.
133
(
5
),
2711
2721
.
3.
Begault
,
D. R.
(
2000
).
3D Sound for Virtual Reality and Multimedia
(
Academic Press
,
New York
).
4.
Begault
,
D. R.
,
Wenzel
,
E. M.
, and
Anderson
,
M. R.
(
2001
). “
Direct comparison of the impact of head tracking, reverberation, and individualized head-related transfer functions on the spatial perception of a virtual speech source
,”
J. Audio Eng. Soc.
49
(
10
),
904
916
.
5.
Ben-Hur
,
Z.
,
Brinkmann
,
F.
,
Sheaffer
,
J.
,
Weinzierl
,
S.
, and
Rafaely
,
B.
(
2017
). “
Spectral equalization in binaural signals represented by order-truncated spherical harmonics
,”
J. Acoust. Soc. Am.
141
(
6
),
4087
4096
.
6.
Bernschütz
,
B.
(
2013
). “
A Spherical Far Field HRIR/HRTF Compilation of the Neumann KU 100
,” in
Proceedings the AIA-DAGA 2013
, March 18–21, Merano, Italy, pp. 592–595.
7.
Bernschütz
,
B.
,
Vazquez Giner
,
A.
,
Pörschmann
,
C.
, and
Arend
,
J.
(
2014
). “
Binaural reproduction of plane waves with reduced modal order
,”
Acta Acust. united Acust.
100
(
5
),
972
983
.
8.
Brungart
,
D. S.
,
Kordik
,
A. J.
, and
Simpson
,
B. D.
(
2006
). “
Effects of headtracker latency in virtual audio displays
,”
J. Audio Eng. Soc.
54
(
1–2
),
32
44
.
9.
Carlile
,
S.
,
Martin
,
R.
, and
McAnally
,
K.
(
2005
). “
Spectral information in sound localization
,”
Int. Rev. Neurobiol.
70
,
399
434
.
10.
Daniel
,
J.
(
2000
).
Représentation de champs acoustiques, application à la transmission et à la reproduction de scènes sonores complexes dans un contexte multimedia
” (“Representation of acoustic fields, application to the transmission and reproduction of complex soundscapes in a multimedia context”), Ph.D. thesis, University of Paris 6, Paris, France.
11.
Duraiswami
,
R.
,
Li
,
Z.
, and
Zotkin
,
D.
(
2005
). “
Plane-wave decomposition analysis for spherical microphone arrays
,”
Appl. Signal Process. Audio Acoust.
1
(
5
),
150
153
.
12.
Engdegard
,
J.
,
Resch
,
B.
,
Falch
,
C.
,
Hellmuth
,
O.
,
Hilpert
,
J.
,
Hoelzer
,
A.
,
Breebaart
,
J.
,
Koppens
,
J.
,
Schuijers
,
E.
, and
Oomen
,
W.
(
2008
). “
Spatial audio object coding (SAOC): The upcoming MPEG standard on parametric object based audio coding
,” in
Proceedings of the 124th AES Convention
, May 17–20, Amsterdam, the Netherlands, pp.
1
15
.
13.
Epain
,
N.
, and
Jin
,
C. T.
(
2016
). “
Spherical harmonic signal covariance and sound field diffuseness
,”
IEEE Trans. Audio Speech Lang. Process.
24
(
10
),
1796
1807
.
14.
Evans
,
M. J.
(
1997
). “
The perceived performance of spatial audio for teleconferencing
,” Ph.D. thesis, University of York, York, UK.
15.
Evans
,
M. J.
,
Angus
,
J. A. S.
, and
Tew
,
A. I.
(
1998
). “
Analyzing head-related transfer function measurements using surface spherical harmonics
,”
J. Acoust. Soc. Am.
104
(
4
),
2400
2411
.
16.
Fliege
,
J.
, and
Maier
,
U.
(
1999
). “
The distribution of points on the sphere and corresponding cubature formulae
,”
IMA J. Numer. Anal.
19
(
2
),
317
334
.
17.
Frank
,
M.
(
2013
). “
Phantom Sources using multiple loudspeakers in the horizontal plane
,” Ph.D. thesis, University of Music and Performing Arts, Graz, Austria.
18.
Freyman
,
R. L.
,
Helfer
,
K. S.
,
McCall
,
D. D.
, and
Clifton
,
R. K.
(
1999
). “
The role of perceived spatial separation in the unmasking of speech
,”
J. Acoust. Soc. Am.
106
(
6
),
3578
3588
.
19.
Gabriel
,
K. J.
, and
Colburn
,
H. S.
(
1981
). “
Interaural correlation discrimination: 1. Bandwidth and level dependence
,”
J. Acoust. Soc. Am.
69
,
1394
1401
.
20.
Gerzon
,
M. A.
(
1973
). “
Periphony: With-height sound reproduction
,”
J. Audio Eng. Soc.
21
(
1
),
2
10
.
21.
Hartmann
,
W. M.
,
Rakerd
,
B.
,
Crawford
,
Z. D.
, and
Zhang
,
P. X.
(
2016
). “
Transaural experiments and a revised duplex theory for the localization of low-frequency tones
,”
J. Acoust. Soc. Am.
139
(
2
),
968
985
.
22.
ITU-R
(
1997
). 1116-1:
Methods for the Subjective Assessment of Small Impairments in Audio Systems Including Multichannel Sound Systems
, International Telecommunication Union, Geneva, Switzerland, pp.
1
26
.
23.
Jot
,
J.-M.
,
Larcher
,
V.
, and
Pernaux
,
J.-M.
(
1999
). “
A comparative study of 3-D audio encoding and rendering techniques
,” in
Proceedings of the AES 16th International Conference on Spatial Sound Reproduction
, April 10–12, Arktikum, Finland, pp.
281
300
.
24.
Jot
,
J.-M.
,
Wardle
,
S.
, and
Larcher
,
V.
(
1998
). “
Approaches to binaural synthesis
,” in
Proceedings of the 105th Convention of the Audio Engineering Society
, September 26–29, San Francisco, CA, pp.
1
8
.
25.
Katz
,
B. F. G.
, and
Noisternig
,
M.
(
2014
). “
A comparative study of interaural time delay estimation methods
,”
J. Acoust. Soc. Am.
135
(
6
),
3530
3540
.
26.
Kim
,
C.
,
Mason
,
R.
, and
Brookes
,
T.
(
2008
). “
Initial investigation of signal capture techniques for objective measurement of spatial impression considering head movement
,” in
Proceedings of the 124th AES Convention
, May 17–20, Amsterdam, the Netherlands, pp.
1
17
.
27.
Kruskal
,
W. H.
, and
Wallis
,
W. A.
(
1952
). “
Use of ranks in one-criterion variance analysis
,”
J. Am. Stat. Assoc.
47
(
260
),
583
621
.
28.
Laitinen
,
M. V.
, and
Pulkki
,
V.
(
2009
).
“Binaural reproduction for directional audio coding,”
in
Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics
, October 18–21, New Platz, NY, pp.
337
340
.
29.
Lebedev
,
V. I.
(
1977
). “
Spherical quadrature formulas exact to orders 25–29
,”
Sib. Math. J.
18
(
1
),
99
107
.
30.
Lindau
,
A.
(
2014
). “
Binaural resynthesis of acoustical environments—Technology and perceptual evaluation
,” Ph.D. thesis, University of Berlin, Berlin, Germany.
31.
Litovsky
,
R. Y.
(
2012
). “
Spatial release from masking
,”
Acoust. Today
8
(
2
),
18
25
.
32.
Macpherson
,
E. A.
, and
Middlebrooks
,
J. C.
(
2002
). “
Listener weighting of cues for lateral angle: The duplex theory of sound localization revisited
,”
J. Acoust. Soc. Am.
111
(
5
),
2219
2236
.
33.
Maday
,
Y.
,
Nguyen
,
N.
,
Patera
,
A.
, and
Pau
,
S.
(
2008
). “
A general multipurpose interpolation procedure: The magic points
,”
Commun. Pure Appl. Anal.
8
(
1
),
383
404
.
34.
Malham
,
D. G.
, and
Myatt
,
A.
(
1995
). “
3-D sound spatialization using Ambisonic techniques
,”
Comput. Music J.
19
(
4
),
58
70
.
35.
Menzer
,
F.
(
2010
). “
Binaural audio signal processing using interaural coherence matching
,” Ph.D. thesis, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland.
36.
Møller
,
H.
(
1992
). “
Fundamentals of binaural technology
,”
Appl. Acoust.
36
(
3–4
),
171
218
.
37.
Noisternig
,
M.
,
Musil
,
T.
,
Sontacchi
,
A.
, and
Höldrich
,
R.
(
2003
). “
3D binaural sound reproduction using a virtual ambisonic approach
,” in
Proceedings of the VECIMS 2003
, July 27–29, Lugano, Switzerland, pp.
174
178
.
38.
Okano
,
T.
,
Beranek
,
L. L.
, and
Hidaka
,
T.
(
1998
). “
Relations among interaural cross-correlation coefficient (IACCE), lateral fraction (LFE), and apparent source width (ASW) in concert halls
,”
J. Acoust. Soc. Am.
104
(
1
),
255
265
.
39.
Ono
,
K.
,
Pulkki
,
V.
, and
Karjalainen
,
M.
(
2001
).
“Binaural modeling of multiple sound source perception: Methodology and coloration experiments,”
in
Proceedings of the AES 111th Convention
, November 30–December 3, New York, NY, pp.
1
12
.
40.
Ono
,
K.
,
Pulkki
,
V.
, and
Karjalainen
,
M.
(
2002
). “
Binaural modeling of multiple sound source perception: Coloration of wideband sound
,” in
Proceedings of the AES 112th Convention
, May 10–12, Munich, Germany, pp.
1
8
.
41.
Pinchon
,
D.
, and
Hoggan
,
P. E.
(
2007
). “
Rotation matrices for real spherical harmonics: General rotations of atomic orbitals in space-fixed axes
,”
J. Phys. A
40
(
7
),
1597
1610
.
42.
Pollack
,
I.
, and
Trittipoe
,
W.
(
1959
). “
Binaural listening and interaural noise cross correlation
,”
J. Acoust. Soc. Am.
31
(
9
),
1250
1252
.
43.
Pollow
,
M.
,
Nguyen
,
K. V.
,
Warusfel
,
O.
,
Carpentier
,
T.
,
Mueller-Trapet
,
M.
,
Vorlaender
,
M.
, and
Noisternig
,
M.
(
2012
). “
Calculation of head-related transfer functions for arbitrary field points using spherical harmonics decomposition
,”
Acta Acust. united Acust.
98
(
1
),
72
82
.
44.
Pulkki
,
V.
(
2007
). “
Spatial sound reproduction with directional audio coding
,”
J. Audio Eng. Soc.
55
(
6
),
503
516
.
45.
Pulkki
,
V.
,
Delikaris-Manias
,
S.
, and
Politis
,
A.
(
2017
).
Parametric Time-Frequency Domain Spatial Audio
(
John Wiley & Sons
,
New York
).
46.
Rafaely
,
B.
(
2005
). “
Analysis and design of spherical microphone arrays
,”
IEEE Trans. Speech Audio Process.
13
(
1
),
135
143
.
47.
Rasumow
,
E.
,
Blau
,
M.
,
Hansen
,
M.
,
van de Par
,
S.
,
Doclo
,
S.
,
Mellert
,
V.
, and
Püschel
,
D.
(
2014
). “
Smoothing individual head-related transfer functions in the frequency and spatial domains
,”
J. Acoust. Soc. Am.
135
(
4
),
2012
2025
.
48.
Rayleigh
,
L.
(
1907
). “
On our perception of sound direction
,”
Philos. Mag. Ser. 6
13
(
74
),
214
232
.
49.
Romigh
,
G. D.
(
2012
). “
Individualized head-related transfer functions: Efficient modeling and estimation from small sets of spatial samples
,” Ph.D. thesis, Carnegie Melon University, Pittsburg, PA.
50.
Romigh
,
G. D.
,
Brungart
,
D. S.
,
Stern
,
R. M.
, and
Simpson
,
B. D.
(
2015
). “
Efficient real spherical harmonic representation of head-related transfer functions
,”
IEEE J. Selected Topics Signal Process.
9
(
5
),
921
930
.
51.
Schärer
,
Z.
, and
Lindau
,
A.
(
2009
). “
Evaluation of equalization methods for binaural signals
,” in
Proceedings of the AES 126th Convention
, May 7–10, Munich, Germany, pp.
1
17
.
52.
Schörkhuber
,
C.
, and
Höldrich
,
R.
(
2017
). “
Ambisonic microphone encoding with covariance constraint
,” in
Proceedings of the International Conference on Spatial Audio
, September 7–10, Graz, Austria, pp.
70
74
.
53.
Sheaffer
,
J.
, and
Rafaely
,
B.
(
2014
). “
Equalization strategies for binaural room impulse response rendering using spherical arrays
,” in
Proceedings of the 28th Convention of Electrical and Electronics Engineers in Israel
, December 3–5, Eliat, Israel, pp.
1
5
.
54.
Sheaffer
,
J.
,
Villeval
,
S.
, and
Rafaely
,
B.
(
2014
). “
Rendering binaural room impulse responses from spherical microphone array recordings using timbre correction
,” in
Proceedings of the EAA Joint Symposium on Auralization and Ambisonics
, April 3–5, Berlin, Germany, pp.
81
85
.
55.
Solvang
,
A.
(
2008
). “
Spectral impairment for two-dimensional higher order ambisonics
,”
J. Audio Eng. Soc.
56
(
4
),
267
279
.
56.
Stern
,
R.
,
Brown
,
G.
, and
Wang
,
D.
(
2006
).
“Binaural sound localization,”
in
Computational Auditory Scene Analysis: Principles, Algorithms and Applications
(
Wiley
,
New York
), Chap.
5
, pp.
147
185
.
57.
Stroud
,
A. H.
(
1966
).
Gaussian Quadrature Formulas
(
Prentice-Hall
,
Englewood Cliffs, NJ
).
58.
Vilkamo
,
J.
, and
Pulkki
,
V.
(
2013
). “
Minimization of decorrelator artifacts in directional audio coding by covariance domain rendering
,”
J. Audio Eng. Soc.
61
(
9
),
637
646
.
59.
Wabnitz
,
A.
,
Epain
,
N.
,
Jin
,
C. T.
, and
Van Schaik
,
A.
(
2010
). “
Room acoustics simulation for multichannel microphone arrays
,” in
Proceedings of the International Symposium on Room Acoustics
, August 29–31, Melbourne, Australia, pp.
1
6
.
60.
Wallach
,
H.
(
1940
). “
The role of head movement and vestibular and visual cues in sound localization
,”
J. Exp. Pyshcol.
27
(
4
),
339
368
.
61.
Wightman
,
F. L.
, and
Kistler
,
D. J.
(
1989
). “
Headphone simulation of free field listening I: Stimulus synthesis
,”
J. Acoust. Soc. Am.
85
,
858
867
.
62.
Wightman
,
F. L.
, and
Kistler
,
D. J.
(
1992
). “
The dominant role of low frequency interaural time differences in sound localization
,”
J. Acoust. Soc. Am.
91
(
3
),
1648
1661
.
63.
Williams
,
E. G.
(
1999
).
Fourier Acoustics: Sound Radiation and Nearfield Acoustical Holography
(
Academic Press
,
New York
).
64.
Zotkin
,
D. N.
,
Duraiswami
,
R.
, and
Gumerov
,
N. A.
(
2009
).
“Regularized HRTF fitting using spherical harmonics,”
in
Proceedings of Applications of Signal Processing to Audio and Acoustics 2009
, October 18–21, New Platz, NY, pp.
257
260
.
65.
Zotter
,
F.
, and
Frank
,
M.
(
2012
). “
All-round Ambisonic panning and decoding
,”
J. Audio Eng. Soc.
60
(
10
),
807
820
.