Binaural rendering of Ambisonic signals is of great interest in the fields of virtual reality, immersive media, and virtual acoustics. Typically, the spatial order of head-related impulse responses (HRIRs) is considerably higher than the order of the Ambisonic signals. The resulting order reduction of the HRIRs has a detrimental effect on the binaurally rendered signals, and perceptual evaluations indicate limited externalization, localization accuracy, and altered timbre. In this contribution, a binaural renderer, which is computed using a frequency-dependent time alignment of HRIRs followed by a minimization of the squared error subject to a diffuse-field covariance matrix constraint, is presented. The frequency-dependent time alignment retains the interaural time difference (at low frequencies) and results in a HRIR set with lower spatial complexity, while the constrained optimization controls the diffuse-field behavior. Technical evaluations in terms of sound coloration, interaural level differences, diffuse-field response, and interaural coherence, as well as findings from formal listening experiments show a significant improvement of the proposed method compared to state-of-the-art methods.

## I. INTRODUCTION

Binaural rendering (synthesis) of acoustic scenes aims at evoking an immersive experience for listeners, which is desirable in virtual reality and 360-degree multimedia productions (Begault, 2000), and can also improve speech intelligibility in teleconferencing applications (Evans, 1997) by exploiting the effect of spatial unmasking (Freyman *et al.*, 1999; Litovsky, 2012).

Typically, binaural rendering involves a convolution of source signals with measured or modeled head-related impulse responses (HRIRs) or binaural room impulse responses (BRIRs) and playback via headphones (Møller, 1992; Wightman and Kistler, 1989). Both HRIRs and BRIRs implicitly contain the cues that are evaluated by the human auditory system to perceive sound from a certain direction and distance, with a certain source width, and spaciousness. A first theory on binaural perception was introduced by Rayleigh (1907) and is known today as the *Duplex theory*, which states that lateralization of sound sources is due to interaural time differences (ITD) for low frequencies, and due to the interaural level difference (ILD) at higher frequencies, although the frequency ranges of ITD and ILD localization overlap significantly. However, ILD and ITD information cannot be mapped directly to a certain direction, as sources that originate from the same *cone of confusion* evoke similar ILDs and ITDs, and thus, spectral cues are needed to resolve the ambiguities; a review is presented in Carlile *et al.* (2005). Besides the localization cues, the interaural coherence (IC) is related to the apparent source width, or parameters such as envelopment or spaciousness (Lindau, 2014; Okano *et al.*, 1998; Pollack and Trittipoe, 1959).

It has been previously shown that localization ambiguities can be reduced and externalization can be improved by using dynamic binaural rendering, where the natural head rotation of listeners is accounted for (Begault *et al.*, 2001; Brungart *et al.*, 2006; Wallach, 1940). Dynamic binaural rendering of object-based audio typically involves fading or switching of filters (HRIRs or BRIRs) (Engdegard *et al.*, 2008), while for dynamic binaural rendering of Ambisonic signals (scene-based audio), no filter switching is needed as the entire sound scene can be rotated by a simple frequency-independent matrix multiplication (Jot *et al.*, 1999; Pinchon and Hoggan, 2007).

Ambisonics (Daniel, 2000; Gerzon, 1973; Malham and Myatt, 1995) is based on the representation of a three-dimensional sound field with an orthonormal basis, the *spherical harmonics* (SH). Typically, the maximum SH order *N* used for representation determines the spatial resolution, the number of channels $(N+1)2$, and the minimum number of loudspeakers required for playback. Binaural rendering of Ambisonics signals typically consists of decoding to virtual loudspeakers using a state-of-the-art method, e.g., Zotter and Frank (2012), followed by a summation of the virtual loudspeaker signals convolved with the HRIRs for the corresponding directions, see also Jot *et al.* (1998) and Noisternig *et al.* (2003). However, more recent methods (Bernschütz *et al.*, 2014; Sheaffer *et al.*, 2014) employ a rendering in the SH domain without the intermediate step of decoding to a virtual loudspeaker setup.

For direct rendering in the SH domain, the spatial order of the Ambisonic signals and HRIR description must match (Bernschütz *et al.*, 2014). It has been shown that HRIRs represented in the SH domain contain a significant amount of energy in orders up to *N* = 30. However, recording, transmitting, and processing of 961 (*N* = 30) channels is usually not feasible. Thus, a low-order HRIR representation has to be used for rendering of practical Ambisonic orders *N* < 5. This low-order representation typically leads to impairments of localization cues (ILD, ITD), reduced spaciousness (IC), and a severe roll-off at high frequencies (Avni *et al.*, 2013; Bernschütz *et al.*, 2014; Sheaffer and Rafaely, 2014) (see also Sec. II A). Strategies which aim at improving the perceptual aspects of binaurally rendered order-limited Ambisonic signals include (i) a static equalization filter which is derived from the diffuse-field response of an HRIR set or spherical head model (Ben-Hur *et al.*, 2017; Sheaffer and Rafaely, 2014) and (ii) using a composite grid (spatial resampling) that matches the SH order *N* (Bernschütz *et al.*, 2014).

Further methods for reducing the SH order of the HRIR representation are known in the field of SH-based HRIR interpolation. The order reduction is achieved by an independent description of the minimum-phase response and linear-phase of the HRIRs; see Evans (1997), Evans *et al.* (1998), Jot *et al.* (1999), Romigh *et al.* (2015). In order to retain the ITD information in the rendered signals, the direction of sources must be known in advance, and thus these methods are well-suited for rendering of object-based audio or rendering of a parameterized sound scene (Laitinen and Pulkki, 2009; Pulkki, 2007), but not for direct binaural rendering of Ambisonic signals.

We suggest the computation of a binaural renderer for Ambisonic signals which consists of a HRIR preprocessing stage and an optimization stage. In the preprocessing stage, a frequency-dependent time alignment is applied to the original HRIRs; in the optimization stage, the binaural renderer is obtained by frequency domain minimization of the squared approximation error with respect to the high-frequency time-aligned HRIRs subject to quadratic constraints such that the diffuse-field properties of the rendered signals match the diffuse-field properties of a model. This article is structured as follows. In Sec. II, we address the challenges of binaural rendering of low-order Ambisonic signals and summarize state-of-the-art binaural rendering methods based on suggestions given in Ben-Hur *et al.* (2017), Bernschütz *et al.* (2014), and Sheaffer and Rafaely (2014). The proposed binaural renderer is presented in detail in Sec. III. The evaluation of the proposed renderer and a comparison to existing methods via technical measures is presented in Sec. IV. Finally, the findings from formal listening experiments are discussed in Sec. V.

## II. BINAURAL RENDERING OF AMBISONIC SIGNALS

Let us consider a continuous distribution of sources in the far-field $s(\omega ,\Omega )$, where *ω* is the frequency, $\Omega \u2261(\varphi ,\theta )\u2208S2,\u2009\varphi =(0,2\pi ]$ is the azimuth angle, which is measured counter clockwise from the Cartesian *x*-axis in the horizontal plane, and $\theta =(\pi /2,\u2212\pi /2)$ is the elevation angle, which increases upwards from the Cartesian *x–y* plane. With a continuous description of the far-field head-related transfer functions (HRTFs) $h(\omega ,\Omega )=[hl(\Omega ,\omega ),hr(\Omega ,\omega )]T$, the ear signals are obtained by

where $\u222bS2(\xb7)d\Omega \u2261\u222b02\pi \u222b\pi /2\u2212\pi /2(\xb7)\u2009cos\u2009\theta d\theta d\varphi ,\u2009(\xb7)l,r$ indicates the left and right ear, respectively, and $(\xb7)T$ is the transpose operator. The corresponding acoustic scene in Ambisonics of order *N _{A}* is given by the $(NA+1)2\xd71$ vector

where $Ynm(\Omega )$ are the real-valued SHs (Williams, 1999)

where $Pnm(\xb7)$ is the associated Legendre function, and $0\u2264n\u2264NA,\u2212n\u2264m\u2264n$ are the order and degree, respectively.

As Ambisonics is a scene-based, and not an object-based format, the source signals and directions are typically not known (a parametric description of the scene can be obtained by the directional audio coding method from Laitinen and Pulkki, 2009, and Pulkki, 2007, but is beyond the scope of this article) and thus, a desired renderer is independent of the actual source signal $s(\omega ,\Omega )$. The binaural rendering matrix $BNA(\omega )$ yields the ear signals $x\u0302(\omega )=BNAH(\omega )a(\omega )$ from an Ambisonic signal of the order *N _{A}*. And it is obtained by solving a least-squares (LS) problem of the form

where $||\xb7||F$ is the Frobenius norm. In most practical situations, only samples of the underlying continuous HRTFs are available and Eq. (5) is approximated by numerical integration

where

is an arbitrary set of HRTFs that is defined at the discrete directions $\Omega p\u2261(\varphi p,\theta p)\u2208S2$ with $p={1,\u2026,P}$ as the index of the available grid points,

and ** W** is a real-valued frequency-independent diagonal weighting matrix containing the quadrature weights. We note that selection of an optimal sampling grid for the sphere is an open research question as the sampling must approximate the integral well and also be practical (Duraiswami

*et al.*, 2005; Fliege and Maier, 1999; Maday

*et al.*, 2008; Zotkin

*et al.*, 2009).

### A. Binaural renderer

For the sake of readability the frequency dependency is not indicated in the remainder of this section. From Eq. (6), the optimal solution of the LS formulation is found as

where $BNALS$ contains the approximated SH expansion coefficients of the HRTFs (Ahrens *et al.*, 2012; Pollow *et al.*, 2012). Here we assume a high-resolution closed sampling grid $P>(NA+1)2$ that achieves $(YNAWYNA,PH)\u2248I$ (orthogonality property of SH), where ** I** is the identity matrix and thus, the LS solution can be further simplified to

and the binaurally rendered ear signals are

For a unity-gain plane wave impinging from direction Ω_{q}, the Ambisonic signals are defined as $a=yq$, and by substituting Eq. (12) in Eq. (13), the rendered ear signals are

where $hNA,q$ are the reconstructed HRTFs at direction Ω_{q}, which are obtained by a weighted summation of the original HRTFs. The directional weighting of that summation is defined by

Please note that Eq. (14) can be interpreted as an approximation method for reconstructing HRTFs corresponding to any direction. However, we do not intend to present a method for HRTF interpolation, but solely use the differences between the reconstructed and the original HRTFs as performance measures for binaural rendering of Ambisonic signals. For SH-based HRTF approximation and modeling, the readers are referred to Evans *et al.* (1998), Romigh (2012), and Zotkin *et al.* (2009).

Now let us consider a HRTF set (KU-100) which is measured at *P* = 2702 discrete directions arranged according to a Lebedev grid (Bernschütz, 2013). It can be seen in Fig. 1 that the SH order *N _{H}* in order to obtain near perfect binaural rendering increases as frequency increases, and that for the chosen HRTF set $NH\u224830$ is necessary (Bernschütz

*et al.*, 2014). However, in practice, the maximum order of the HRTFs represented in the SH domain must match the order of the Ambisonic signals to be rendered for playback, cf. Eq. (13). Typically, the signals obtained from spherical microphone arrays are encoded in orders

*N*< 5, and thus the reduction of order leads to severe loss of spatial detail, especially at higher frequencies. Moreover, a reduction of the order

_{A}*N*<

_{A}*N*leads to a broader main-lobe of the directional weighting function and thus, more HRTF directions around the source direction Ω

_{H}_{q}contribute substantially to the rendered ear signals. The directional weighting functions for orders

*N*= 1 and

_{A}*N*= 5 and $\Omega q=(0\xb0,0\xb0)$ are depicted in Fig. 2. Although the shape of the directional weighting is independent of the source-direction Ω

_{A}_{q}, the effect on the rendered ear signals is strongly direction-dependent due to the off-center position of the ears. For a head radius $rH=8.5$ cm, we calculate the time offsets $\tau pl,r$ (time difference between the center and the ears) for each grid point

*p*via a simple geometric model

where $c=343(m/s)$ is the speed of sound. Utilizing Eq. (15), the IR between all grid nodes *P* and an omnidirectional microphone at the position of the right ear is

where *t* is the time, and $\delta (\xb7)$ is the delta function. The obtained IRs, and corresponding transfer functions for directions $\Omega q=(\varphi q,0\xb0)$ on the horizontal plane, and *N _{A}* = 5 are depicted in Figs. 3(a) and 3(b), respectively. For directions from the front and back, we observe strong colorations (low-pass behavior), as the time offsets for neighboring grid nodes which comprise the main-lobe (almost equally weighted) are highest. On the contrary, less coloration is expected for lateral directions as the variation of time offsets for nodes within the main-lobe is smaller. Similar findings are outlined in Solvang (2008), where an overview of the relation between number of loudspeakers, reproduction error, and coloration of two-dimensional Ambisonic systems is given.

Furthermore, the direction-dependent coloration for binaural rendering of low order Ambisonic signals is discussed in Avni *et al.* (2013), Ben-Hur *et al.* (2017), Bernschütz *et al.* (2014), and Sheaffer and Rafaely (2014) and improvements are obtained by (i) using a spectral (diffuse-field) equalization filter, which is based on a spherical head model (Ben-Hur *et al.*, 2017), or (ii) spatial resampling of HRTFs at a reduced grid (Bernschütz *et al.*, 2014). Both suggested methods are reviewed in the following paragraphs.

#### 1. Spectral equalization

With the model of a perfectly diffuse sound field *s _{d}* consisting of an infinite number of plane waves impinging from every direction in space, with random mutually uncorrelated phases, and a total power

*ρ*(see Epain and Jin, 2016), the covariance matrix of the ear signals $xd(t)$ is defined as $RH=Et{xd(t)xdH(t)}$, where $Et{\xb7}$ denotes the statistical expectation operator over the time

*t*. With $xd(t)=\u222bS2sd(t,\Omega )h(\Omega )d\Omega $, we get $RH=\rho \u222bS2h(\Omega )hH(\Omega )d\Omega $, and by numerical integration and assuming

*ρ*= 1, the estimated diffuse-field covariance matrix is obtained as

where the main diagonal entries contain the diffuse-field energies of the left and right ear, respectively. A similar formulation is used in Pulkki *et al.* (2017). With the diffuse-field energies of the order *N* rendered signals $rNqq(\omega )$, where $q\u2208{r,l}$, the transfer function of the static diffuse-field equalization (timbre correction) filter (Sheaffer and Rafaely, 2014), which achieves $rNAqq(\omega )=rNHqq(\omega )$, where *N _{A}* <

*N*is given by

_{H}If instead of the measured HRTFs a rigid sphere head model is used, the diffuse-field energy according to Williams (1999) is

where

and $jn(\xb7)$ is the spherical Bessel function, $hn(\xb7)$ is the spherical Hankel function of the second kind, $jn\u2032(\xb7),\u2009)hn\u2032(\xb7)$ are their first derivatives, and $k=\omega /c$.

In practice, we set $(Gl(\omega )|NA\u2192NH+Gr(\omega )|NA\u2192NH)/2=G(\omega )|NA\u2192NH$ and the diffuse-field equalized (DEQ) binaural renderer is obtained by

#### 2. Spatial resampling

It has been found in Bernschütz *et al.* (2014) that selecting a feasible (sparse) set of grid points can reduce colorations. The HRTFs at the chosen composite grid nodes Ω_{g}, with $g={1,\u2026,G}$, and *G* < *P*, are either selected from the original HRTFs (nearest neighbor), or are interpolated according to Eqs. (14) and (12) as

In Bernschütz *et al.* (2014) an equiangular Gaussian quadrature (Stroud, 1966) with $G=2(NA+1)2$ grid nodes and a Lebedev (Lebedev, 1977) quadrature were compared. In accordance with the results from listening experiments we use a composite Gaussian grid (CGG) binaural renderer, which is obtained by

where $WG$ contains the quadrature weights for the Gaussian sampling for the comparison of rendering methods.

## III. PROPOSED BINAURAL RENDERER

The computation of the proposed binaural renderer consists of a preprocessing and an optimization stage. In the preprocessing stage, we apply a high-frequency time alignment to the original HRTF set. It has been shown in the context of SH-based HRTF interpolation that removing the linear phase (ITD equalization) leads to an order reduction of the HRTF representation. In contrast to Evans *et al.* (1998), Rasumow *et al.* (2014), and Romigh *et al.* (2015), we suggest a frequency-dependent ITD equalization that retains the ITD at low and removes ITD at high frequencies. Furthermore, the high frequency ITD is not re-synthesized after rendering.

Here, the time-aligned HRTF set is computed as

where the frequency response of the allpass filter $Apl,r(\omega )$ is defined as

where $\omega c=2\pi fc,\u2009i=\u22121$, and the time offset $\tau pl,r$ is calculated according to Eq. (16). Note that the time offsets could be estimated from the HRTF set as well (see Katz and Noisternig, 2014 for a comparison of different methods). However, in pre-tests, we found no significant improvement and therefore used the simple geometric model. Due to the time alignment, the energy contained in higher SH orders is significantly reduced, cf. Figs. 1 and 4. Thus, lower orders $NH\u2218<NH$ are sufficient to represent the HRTFs at higher frequencies. As, according to the *Duplex Theory* (Hartmann *et al.*, 2016; Macpherson and Middlebrooks, 2002; Rayleigh, 1907; Wightman and Kistler, 1992), the ITD cue becomes less relevant as frequency increases, we expect that high-frequency time alignment of HRIRs with a cut-on frequency of $fc=1.5$ kHz (empirically chosen) allows for efficient order reduction while retaining the perceptually relevant localization cues.

In order to achieve the diffuse-field response and IC behavior of the original HRTF set ** H** [see Eq. (18)], we cast the computation of the binaural renderer as a constrained optimization problem of the form

where $H\u2218$ is the high-frequency time-aligned HRTF set, $RY\u2261I$ is the SH spatial covariance matrix, and $RH$ is defined in Eq. (18). A similar formulation is used in Schörkhuber and Höldrich (2017) and Vilkamo and Pulkki (2013). The set of solutions which satisfy the covariance constraint Eq. (28) is given by

where ** Q** is an $(NA+1)2\xd72$ arbitrary unitary matrix such that $QHQ=I$ and

**is obtained by some suitable matrix decomposition of $RH$ such that $CHC=RH$. With the properties $||M||F2=tr{MHM}$ and $tr{M1+M2}=tr{M1}+tr{M2}$, where $tr{\xb7}$ is the trace of a matrix, and by inserting Eq. (29) into Eq. (27), we restate the minimization problem**

*C*with

where $R(\xb7)$ denotes the real part of a complex number. Since $T3$ is independent of ** Q**, we drop it in the sequel. By assuming that $YNA,PWYNA,PH=RY=I$, we can also drop the first term. Hence, the problem of determining

**is reduced to**

*Q*where $A=CH\xb0WYNA,PH$. Using the singular value decomposition $U\Sigma VH=A$, the solution is given by

where $\Lambda =[I2\u20090(NA+1)2\u22122\xd72]T$. The final form of the time-aligned, and diffuse-field covariance-constrained (TAC) binaural renderer for order *N _{A}* is thus given by

## IV. EVALUATION

In this section, the rendered signals, which are obtained by the proposed TAC method are analyzed and compared with state-of-the-art methods presented in Sec. II. The quality criteria include (i) the direction-dependent coloration (presented for directions at the horizontal plane), (ii) the ILD errors in octave bands, and (iii) the diffuse-field behavior, i.e., the diffuse-field response and the interaural coherence.

### A. Coloration

The composite loudness level (CLL) (Frank, 2013; Ono *et al.*, 2001, 2002) is a measure to describe the perceived timbre. We use the simplified definition

where $xl,r(\Omega p,\omega )$ are the reference ear signals [see Eq. (1)] due to a single unity-gain plane wave impinging from direction Ω_{p}. The CLL error between the reference and rendered Ambisonic signals [see Eq. (13)] is defined as

where $CLLpNA(\omega )$ is the CLL of the rendered signals using a rendering order *N _{A}*. The resulting CLL errors obtained for all discussed binaural rendering methods using

*N*= 3 are depicted in Fig. 5 for directions on the horizontal plane ($\theta p=0\xb0$). CLL errors for directions on the median plane show similar trends and are therefore not depicted here.

_{A}Below the aliasing frequency $fNA\u2248NAc/2\pi rH\u22481.9$ kHz (Rafaely, 2005) the CLL errors are negligible for all tested methods. Above the aliasing frequency, we observe a severe low-pass behavior for frontal and dorsal directions using the LS method, Eq. (11) [see Fig. 5(a)]. As the diffuse-field equalization [DEQ, see Eq. (22)] filter is basically a direction-independent high-shelving filter, the coloration error is shifted from frontal to lateral directions, cf. Fig. 5(b). The spatial resampling approach using a CGG as defined in Eq. (24) reduces the coloration for most directions, see Fig. 5(c). However, best performance in terms of minimal CLL error is observed for the proposed method (TAC), see Fig. 5(d).

### B. ILD errors

The obtained ILD errors between the reference and rendered signals in octave-bands are defined as

where *ω _{o}* indicates an octave-band center frequency, and $el,r$ are the energies contained in the octave-bands of the left and right ear signal, respectively. The absolute values of the ILD errors are calculated for five octave-bands with center frequencies at 1, 2, 4, 8, and 16 kHz for all grid directions (

*P*= 2702) and are analyzed with a histogram between 0 and 15 dB with 30 equally spaced bins. Figure 6 depicts the resulting cumulative density function (CDF) which is defined for each band as

where *c _{j}* is the number of elements in bin

*j*, and

*i*is the histogram bin index.

When comparing the sub figures depicted in Fig. 6, it can be observed that above the aliasing frequency $fNA\u22481.9$ kHz ILD errors are increasing with frequency. Whereas for LS and DEQ, the distribution of absolute ILD errors is similar, rendering using the CGG approach shows the highest, and rendering using the TAC approach shows lowest overall ILD errors.

### C. Diffuse-field response and interaural coherence

In order to compare the algorithms for rendering of diffuse sound fields, the main- and off-diagonal elements of the diffuse-field covariance matrix as defined in Eq. (18) are compared in Figs. 7(a) and 7(b), respectively. While the proposed TAC approach yields the same diffuse-field behavior as the reference set (due to constraint), all other approaches show deviations. The diffuse-field response of the LS renderer clearly indicates the discussed low-pass behavior. Results for DEQ and CGG show improvements, but above the aliasing frequency we observe colorations, cf. Fig. 7(a).

According to Menzer (2010) the interaural coherence (IC) is defined as

where $rlr(\omega ),\u2009rll(\omega )$, and $rrr(\omega )$ are defined in Eq. (18). The IC of the tested rendering methods is depicted in Fig. 7(b). Again, the TAC yields the same behavior as the reference while the IC for LS and DEQ (same IC), and CGG shows significant deviations from the reference.

According to Gabriel and Colburn (1981), Pollack and Trittipoe (1959), and Stern *et al.* (2006), just noticeable differences (JNDs) of interaural correlation values change depending on the source frequency, bandwidth, and reference condition, and findings indicate a JND of 0.08 for a reference condition with interaural correlation of 1, and a JND of 0.35 for a reference condition with interaural correlation of 0, respectively. The JNDs for intermediate reference conditions are between 0.08 and 0.35 (Kim *et al.*, 2008).^{1} As the IC deviations exceed the JNDs for certain frequency ranges, an altered spaciousness or envelopment is expected for LS, DEQ, and CGG rendering methods, especially for orders $NA\u22643$.

## V. LISTENING EXPERIMENTS

In order to study and compare the perceptual aspects of binaural rendering using the TAC and state-of-the-art methods, formal listening experiments were conducted.

### A. Methodology

Test participants were asked to rate the overall difference between a reference [rendered according to Eq. (11) with $NA=NH=30$] and the test signals on a scale from *no audible difference* to *severe difference*. A hidden reference was used for screening of ratings, and thus the test procedure can be described as MUSHRA-like [multi stimulus with hidden reference and anchor (ITU-R, 1997)]. The presented test signals were continuously looped and participants were allowed to seamlessly switch between signals in real-time as often as desired.

The Ambisonic signals are obtained by a convolution of a monophonic source signal with a room impulse response (RIR) in the SH domain. For simulation of the RIRs, we used the multichannel room acoustics simulation toolbox (McRoomSIM) (Wabnitz *et al.*, 2010) for a shoebox room of dimensions $9.5\xd712\xd74.2$ m, with a mean absorption coefficient $\alpha =0.2360$, a mean $T60=0.8$ s, and a source/listener setup as depicted in Fig. 8. The listener at position [3.5,3,1.7] m is facing towards the positive *x*-axis and the omnidirectional source is positioned relative to the listener as defined by the evaluation angle Ω_{q} on a radius $rq=1.5rc$, where $rc=1.39$ m is the critical distance. The tested discrete source directions include $\Omega q=(0\xb0,0\xb0),\u2009\Omega q=(90\xb0,0\xb0),\u2009\Omega q=(35\xb0,45\xb0)$, and $\Omega q=(\u221245\xb0,0\xb0)$. The perceptual evaluation is segmented in three experiments.

##### a. Experiment I.

A speech signal and the direct part of the RIR were used for computing the test signals at all four test directions Ω_{q}. The tested methods were (i) LS, (ii) DEQ, (iii) CGG, and (iv) TAC for orders *N _{A}* = 1 and

*N*= 5.

_{A}##### b. Experiment II.

The test signals were a speech signal and a drum loop (kick drum, snare drum, cymbals). In order to evaluate the performance of the algorithms in reflective environments, the entire simulated RIR (direct, early reflections, and diffuse part) for all four test directions Ω_{q} was used. The tested algorithms include (i) DEQ, (ii) CGG, and (iii) TAC for orders *N _{A}* = 1,

*N*= 3, and

_{A}*N*= 5.

_{A}##### c. Experiment III.

The dependence of the overall quality on the order $NA=[1,2,3,4,5,6,9,12,15]$ was tested for the TAC method only. We used the entire RIR, the drum signal (as is it more complex), and $\Omega q=(0\xb0,0\xb0)$ and $\Omega q=(90\xb0,0\xb0)$ for testing.

Overall, 14 test pages were presented and the order of test signals within one page as well as the order of test pages were randomized. Depending on the experiment, the nine participants (expert listeners, no hearing impairments) were asked to rate the perceived overall difference on a continuous scale from no difference to severe difference for 9–10 test signals per page.

In order to ensure equal listening conditions for all participants and test signals, no head-tracking was used. This is valid as participants rated the difference to a reference and not the localization or externalization of stimuli. The test signals were played back via an AKG-702 (half-open) headphone and equalization according to Schärer and Lindau (2009) was used.

### B. Results and discussion

The median and 95% confidence interval of all ratings (for all four test directions Ω_{q}) from Experiment I are depicted in Fig. 9. The results indicate a clear perceptual improvement for higher orders and that the proposed method (TAC) overall outperforms the other tested methods. The *p*-values (significance level) of a Kruskal-Wallis test (Kruskal and Wallis, 1952) presented in Table I indicate that there are five groups that are significantly different to each other.

LS/1 | 1.000 | |||||||

LS/5 | 0.000 | 1.000 | ||||||

DEQ/1 | 0.000 | 0.333 | 1.000 | |||||

DEQ/5 | 0.000 | 0.171 | 0.006 | 1.000 | ||||

CGG/1 | 0.000 | 0.800 | 0.123 | 0.096 | 1.000 | |||

CGG/5 | 0.0000 | 0.001 | 0.000 | 0.042 | 0.000 | 1.000 | ||

TAC/1 | 0.000 | 0.053 | 0.000 | 0.857 | 0.005 | 0.009 | 1.000 | |

TAC/5 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |

method | LS/1 | LS/5 | DEQ/1 | DEQ/5 | CGG/1 | CGG/5 | TAC/1 | TAC/5 |

LS/1 | 1.000 | |||||||

LS/5 | 0.000 | 1.000 | ||||||

DEQ/1 | 0.000 | 0.333 | 1.000 | |||||

DEQ/5 | 0.000 | 0.171 | 0.006 | 1.000 | ||||

CGG/1 | 0.000 | 0.800 | 0.123 | 0.096 | 1.000 | |||

CGG/5 | 0.0000 | 0.001 | 0.000 | 0.042 | 0.000 | 1.000 | ||

TAC/1 | 0.000 | 0.053 | 0.000 | 0.857 | 0.005 | 0.009 | 1.000 | |

TAC/5 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |

method | LS/1 | LS/5 | DEQ/1 | DEQ/5 | CGG/1 | CGG/5 | TAC/1 | TAC/5 |

The groups in ascending order of quality are (i) LS/1, (ii) DEQ/1, CGG/1, and LS/5, (iii) TAC/1, and DEQ/5, (iv) CGG/5, and (v) TAC/5. Overall, the TAC method yields the least perceptual differences to the reference for all tested orders and results for *N _{A}* = 1 are comparable to results for LS and DEQ using

*N*= 5.

_{A}The detailed results (all test directions Ω_{q} separately) for Experiment I are shown in Fig. 10, where it can be seen that the performance of most methods varies with the source direction Ω_{q}. As most pronounced coloration of the LS approach is observed for frontal directions [see Fig. 10(a)] it is rated to have severe difference to the reference. On the other hand, the DEQ approach shifts the coloration from frontal to lateral directions and thus, performance is worst for $\Omega q=(90\xb0,0\xb0)$ [see Fig. 10(b)]. The CGG approach can give an improvement, but still the performance is highly dependent on the source direction [compare Figs. 10(a)–10(c)]. Results for the TAC method do not only show an overall improvement, but also the least variation across the different test directions.

The overall results for Experiment II are depicted in Fig. 11. Note that the results for the two different source signals (speech and drum loop), and all source directions Ω_{q} are pooled. We observe a similar behavior as for Experiment I, namely an improvement with increasing order *N _{A}*, and best performance for the TAC method. The groups in ascending order of quality are (i) DEQ/1, and CGG/1, (ii) TAC/1, DEQ/3, CGG/3, DEQ/5, and CGG/5, (iii) TAC/3, and (iv) TAC/5. While ratings for the TAC method are significantly different across the tested orders $NA=[1,3,5]$, there is no distinct difference between DEQ and CGG for orders $NA=[3,5]$; see Table II. Moreover, ratings for TAC and

*N*= 1 are similar to DEQ and CGG for orders

_{A}*N*= 3, and

_{A}*N*= 5. The results per source signal are depicted in Fig. 12 and Fig. 13. Due to the transient and broadband nature of the drum signal (strong components above the aliasing frequency), the overall quality ratings are worse than for the speech signal. However, the TAC method shows smaller dependency on the source signal than the other tested methods.

_{A}DEQ/1 | 1.000 | ||||||||

DEQ/3 | 0.000 | 1.000 | |||||||

DEQ/5 | 0.000 | 0.994 | 1.000 | ||||||

CGG/1 | 0.645 | 0.000 | 0.000 | 1.000 | |||||

CGG/3 | 0.000 | 0.084 | 0.138 | 0.000 | 1.000 | ||||

CGG/5 | 0.000 | 0.002 | 0.004 | 0.000 | 0.114 | 1.000 | |||

TAC/1 | 0.000 | 0.084 | 0.143 | 0.000 | 0.952 | 0.105 | 1.000 | ||

TAC/3 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | |

TAC/5 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.008 | 1.000 |

method | DEQ/1 | DEQ/3 | DEQ/5 | CGG/1 | CGG/3 | CGG/5 | TAC/1 | TAC/3 | TAC/5 |

DEQ/1 | 1.000 | ||||||||

DEQ/3 | 0.000 | 1.000 | |||||||

DEQ/5 | 0.000 | 0.994 | 1.000 | ||||||

CGG/1 | 0.645 | 0.000 | 0.000 | 1.000 | |||||

CGG/3 | 0.000 | 0.084 | 0.138 | 0.000 | 1.000 | ||||

CGG/5 | 0.000 | 0.002 | 0.004 | 0.000 | 0.114 | 1.000 | |||

TAC/1 | 0.000 | 0.084 | 0.143 | 0.000 | 0.952 | 0.105 | 1.000 | ||

TAC/3 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | |

TAC/5 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.008 | 1.000 |

method | DEQ/1 | DEQ/3 | DEQ/5 | CGG/1 | CGG/3 | CGG/5 | TAC/1 | TAC/3 | TAC/5 |

The overall results for Experiment III are depicted in Fig. 14 for the two tested directions $\Omega q=(0\xb0,0\xb0)$ and $\Omega q=(90\xb0,0\xb0)$. As expected, the maximum order *N _{A}* to achieve near-transparent rendering changes with the source direction. The

*p*-values listed in Table III indicate that for frontal sources, an order of

*N*= 4 is sufficient (no significant difference to higher orders), but for lateral directions an order of

_{A}*N*= 9 is required, see Fig. 14. As time-alignment of HRIRs reduces the required SH order for representation (see Fig. 4) to $NH\u2218=15$, testing of higher orders is not necessary.

_{A}method . | TAC/1 . | TAC/2 . | TAC/3 . | TAC/4 . | TAC/5 . | TAC/6 . | TAC/9 . | TAC/12 . | TAC/15 . |
---|---|---|---|---|---|---|---|---|---|

TAC/1 | 1.000 | 0.125 | 0.011 | 0.003 | 0.004 | 0.017 | 0.004 | 0.004 | 0.003 |

TAC/2 | 0.013 | 1.000 | 0.142 | 0.047 | 0.096 | 0.116 | 0.031 | 0.039 | 0.024 |

TAC/3 | 0.009 | 0.848 | 1.000 | 0.141 | 0.277 | 0.333 | 0.061 | 0.109 | 0.040 |

TAC/4 | 0.018 | 0.655 | 0.655 | 1.000 | 0.479 | 0.947 | 0.465 | 0.647 | 0.327 |

TAC/5 | 0.002 | 0.225 | 0.3062 | 0.142 | 1.000 | 0.647 | 0.267 | 0.245 | 0.174 |

TAC/6 | 0.002 | 0.025 | 0.073 | 0.035 | 0.482 | 1.000 | 0.620 | 0.946 | 0.838 |

TAC/9 | 0.002 | 0.013 | 0.018 | 0.013 | 0.125 | 0.179 | 1.000 | 0.733 | 0.838 |

TAC/12 | 0.002 | 0.002 | 0.003 | 0.003 | 0.006 | 0.009 | 0.305 | 1.000 | 0.894 |

TAC/15 | 0.002 | 0.002 | 0.002 | 0.002 | 0.004 | 0.004 | 0.089 | 0.077 | 1.000 |

method . | TAC/1 . | TAC/2 . | TAC/3 . | TAC/4 . | TAC/5 . | TAC/6 . | TAC/9 . | TAC/12 . | TAC/15 . |
---|---|---|---|---|---|---|---|---|---|

TAC/1 | 1.000 | 0.125 | 0.011 | 0.003 | 0.004 | 0.017 | 0.004 | 0.004 | 0.003 |

TAC/2 | 0.013 | 1.000 | 0.142 | 0.047 | 0.096 | 0.116 | 0.031 | 0.039 | 0.024 |

TAC/3 | 0.009 | 0.848 | 1.000 | 0.141 | 0.277 | 0.333 | 0.061 | 0.109 | 0.040 |

TAC/4 | 0.018 | 0.655 | 0.655 | 1.000 | 0.479 | 0.947 | 0.465 | 0.647 | 0.327 |

TAC/5 | 0.002 | 0.225 | 0.3062 | 0.142 | 1.000 | 0.647 | 0.267 | 0.245 | 0.174 |

TAC/6 | 0.002 | 0.025 | 0.073 | 0.035 | 0.482 | 1.000 | 0.620 | 0.946 | 0.838 |

TAC/9 | 0.002 | 0.013 | 0.018 | 0.013 | 0.125 | 0.179 | 1.000 | 0.733 | 0.838 |

TAC/12 | 0.002 | 0.002 | 0.003 | 0.003 | 0.006 | 0.009 | 0.305 | 1.000 | 0.894 |

TAC/15 | 0.002 | 0.002 | 0.002 | 0.002 | 0.004 | 0.004 | 0.089 | 0.077 | 1.000 |

## VI. CONCLUSION

In this paper, we presented an improved method for binaural rendering of low order Ambisonic signals ($NA\u22645$). The proposed binaural renderer is computed using a frequency-dependent time alignment of HRIRs followed by a minimization of the squared error subject to a diffuse-field covariance matrix constraint (TAC). Due to the time alignment, lower SH orders are sufficient to represent the directivity patterns of the ears at higher frequencies, while the covariance constraint ensures that sound scenes rendered with the TAC method achieve the same diffuse-field behavior as scenes rendered with the original high-order HRIRs.

Technical evaluations and comparisons to state-of-the-art methods indicate that the proposed TAC method reduces the direction-dependent colorations, and the ILD errors, and improves the diffuse-field behavior.

In the perceptual evaluation, we tested the overall difference to a reference (rendered with order *N _{A}* = 30) for four source directions in a free-field condition and in a simulated room. The results of the TAC method show a significant improvement of overall quality for all tested directions as well as smallest direction-dependent quality variation compared to the other tested methods. Furthermore, we found that the rendering order

*N*can be reduced significantly for the TAC method in order to achieve similar quality ratings as other binaural rendering methods for Ambisonic signals. Ratings of auralization in a simulated room indicate that the proposed method using

_{A}*N*= 1 achieves comparable results to the other tested methods using

_{A}*N*= 5.

_{A}As the TAC method shows little direction-dependent quality changes, we assume an improved externalization and localization performance. Thus, future work includes testing of the proposed method in a dynamic binaural rendering setup, where localization accuracy, externalization, and spaciousness are evaluated separately.

^{1}

The JNDs are defined for the single valued interaural correlation, and thus broadband signals. We assume that the frequency-dependent IC is an indicator for the interaural correlation for narrow-band signals.