Binaural room responses are normally measured on a listening subject in a room. The measurements, however, rapidly change with the source and receiver position. In addition, the measurements taken in a room can only be used to simulate scenes of that environment. In this work, an efficient parameterization of the binaural room transfer function is proposed, which has separable representations for the direct-path component and the reverberation component, thus providing a flexible way to generate binaural room responses for different environments and listeners. In addition, this parameterization uses wave equation solutions as basis functions and is continuous in space.
1. Introduction
Binaural recording and rendering refer specifically to recording and reproducing sounds at two ears, which has become one of the key technologies in virtual reality, augmented reality, and 3D rendering (Bimber and Raskar, 2005; Burdea and Coiffet, 2003). In such systems, not only the directional information, or the cues for localization, but also the acoustic environment, have to be accurately reproduced for a realistic perception of the sound in 3D space. The former is characterized by the individualized head-related transfer function (HRTF) while the latter is described by the room impulse response (RIR) or the room transfer function (RTF).
It is widely adopted to measure the binaural room impulse response (BRIR) on a listening subject or a human mannequin for the most natural auditory rendering. However, there are two particular issues associated with this method: (1) the measurements rapidly change with the source/receiver position and are discrete in space; (2) the measurements taken in a particular room can only be used to simulate scenes of that environment.
It is desirable to characterize the HRTF and room responses separately so that a more flexible method can be developed to form BRIR [or binaural RTF (BRTF)] for different environments and listeners. In the work of Menzer et al. (2011), the B-format impulse responses are used to generate the BRIR, where the HRTF corresponding to each direct sound part of the B-format RIR is applied with a linear combination of the reflections part. This method still requires a DOA (direction-of-arrival) estimation of the direct path. In the work of Kearney et al. (2009b), the higher-order Ambisonic scheme is adopted and the HRIRs corresponding to the virtual speaker positions are used to generate binaural signals. Interpolation of RIR is implemented to synthesize the room responses, where the dynamic time warping (DTW) method was developed to temporally align only the direct path and sparse early reflection parts from two adjacent measurements (Kearney et al., 2009a) with its tail (the diffuse decay part) synthesized using statistical time-frequency room models. This method requires the early reflections to be sparse, and thus cannot be used for BRIR synthesis in small rooms. In addition, this method is computationally expensive when more measurements are involved, such as measurements on 2D plane or 3D sphere. A time-frequency domain implementation of DTW was recently proposed to improve the quality and computational efficiency (Garcia-Gomez and Lopez, 2018).
In this work, we introduce an efficient parameterization of the binaural room transfer function, which has separable HRTF/RTF representations and is continuous in space. Following Zhang et al. (2010) and Samarasinghe et al. (2015), this parameterization uses the eigen-solutions of the acoustic wave equation, i.e., spatial harmonics, to decompose soundfields. We demonstrate the accuracy of the proposed parameterization by comparing it with synthetic and measured HRTFs in simulated rooms.
2. Parametrization of the binaural room transfer function
The binaural room transfer function is decomposed into the direct-path part and reverberation part. We then develop parameterization for each term as a weighted sum of 3D spatial basis functions. The model is continuous in space and in particular the direct-path part and reverberation part are separated thus provides a flexible way to generate the BRTF.
2.1 HRTF decomposition
The HRTF, i.e., the direct path of the BRTF, is normally measured by placing the source at a variety of positions on a sphere of certain radius (Cheng and Wakefield, 2001). It is regarded as a function of the source position and is decomposed using spherical harmonics as follows (Evans et al., 1998):
where represents the source position in 3D space with the distance, elevation, and azimuth, respectively. k = 2πf/c is the wave number, where f is the frequency and c is the speed of sound. represents the spherical harmonics of order n and degree m. The notation represents the left ear or right ear, which is omitted in the following equations for simplicity.
This model can be further extended based on wave propagation theory and using wave-equation solutions (Williams, 1999; Zhang et al., 2010), i.e.,
where is the spherical Hankel function of the first kind to represent the radial part, and are the decomposed coefficients corresponding to an individualized HRTF. The original derivation in Zhang et al. (2010) is based on acoustic reciprocity principle, where the source position and receiver position (i.e., the listener's ear) are interchanged, and the scattering effects due to human head and body are modelled as secondary level sources. The primary source at the listener's ear and secondary level sources together constitute an effective source field that is included in a sphere and based on which the HRTF modal decomposition was developed.
In this work, we propose a new approach to interpret the HRTF model (2) so that its integration with the modal-domain RTF is straightforward. Here, we view human head and torso as a continuous aperture that feeds sound waves impinging on its surface to the two ears. Instead of modelling the exact shape of the head and torso we consider an equivalent spherical aperture of an appropriate radius with a continuous aperture function at wavenumber k where . Note that the equivalent (theoretical) aperture function encapsulates the scattering and diffraction by the head and torso. The response of this theoretical aperture to a source located at position x can be written as
i.e., the surface integral (sum) of the aperture function multiplied by the signal received at point y on the aperture from the source at x. Note that, the radius of the equivalent spherical aperture is chosen as . An appropriate choice of s is given in the work of Zhang et al. (2010), which determines the bound on the spherical harmonics expansion order of the HRTF.
The aperture response can be decomposed by spherical harmonics as
The radiation pattern of the sound source is given by the Green's function, i.e.,
where is the spherical Bessel function of order n and represents the complex conjugate.
By substituting Eqs. (4) and (5) into Eq. (3) and based on the orthogonality property of spherical harmonics on a sphere, we have
This is equivalent to Eq. (2) with
and the infinite summation in Eq. (6) truncated to a finite order Nh. The only difference is that Eq. (2) interprets the HRTF as an outgoing field due to a directional source, and Eq. (7) views the HRTF as a directional receiver with an aperture function given by . Both approaches are valid due to the acoustic reciprocity principle for directional transducers (Samarasinghe et al., 2017). Note that Eq. (7) is only for theoretical derivation not used in implementation. The decomposition coefficients are obtained from HRTF measurements decomposed using Eq. (2).
2.2 Parameterization of RTF using modal decomposition
In this subsection, we use the modal decomposition method for region-to-region RTF parameterization to describe the reverberation part of the BRTF, i.e., the BRTF without the direct path.
Two spatial configurations of the source region and receiver region are considered in this work as shown in Fig. 1. In the source region ζ, an outgoing field due to a point source is written as
where denotes the outgoing mode, , and denotes the source position and receiver position, respectively. Both and z are with respect to the origin of the source region .
Using the region-to-region RTF concept (Samarasinghe et al., 2015), the reflected field in the receiver region η for each order (n, m) is
where represents room reverberation in the modal domain, i.e., room mode coupling between the source region ζ and a spatial point within the receiver region η where the human head lies.
We then use the continuous receiving aperture response to obtain the received binaural signals, that is,
where the second line is obtained by substituting Eqs. (9) and (4) into Eq. (10). The integral of Eq. (10) at the fixed radius means that to apply the region-to-region RTF with HRTF, the receiver region is also bounded by s. The number of modes to represent the receiver region should be the same as that of the HRTF data, i.e., Nh. In total, coefficients are required to describe the RTF between the Nsth order source region and the Nhth order receiver region.
Given an omni-directional point source and referring to Eq. (7), we have
By including all the modes (n, m) of the receiver region, the reflected part of the BRTF is
2.3 BRTF model
The binaural room transfer function model is parameterized as
where x denotes the source position with respect to the origin of the receiver region , and denote the same source position with respect to the origin of the source region . For the two scenarios considered as shown in Fig. 1, the source region and receiver region are concentrically positioned or non-overlapping spaced. In the first case while in the second case they are different.
The above representation requires at least coefficients of the form and coefficients of the form to describe the BRTF between the Nsth order source region and the Nhth order receiver region. While Nh is determined by the HRTF data, the order given the modeling frequency and the source region radius rs (Kennedy et al., 2007). The advantage of this formulation is that the modal coefficients of the HRTF and RTF, i.e., βnm(k) and , are separated, thus providing a flexible way to generate BRTF for different environments and listeners. In addition, this parameterization uses wave equation solutions as basis functions and is continuous in space.
In terms of calculating these coefficients, can be analytically simulated when the head is modeled as a rigid sphere (Duda and Martens, 1998) or obtained by decomposing HRTF measurements on a partial sphere according to Eq. (2). , which are used to characterize the RTF and thus obtained without human listener present, can also be measured in practice (Samarasinghe et al., 2015) or simulated for a rectangular room using the spherical harmonics based generalized image source method (Samarasinghe et al., 2018).
3. Simulations
3.1 Simulation setup
The proposed BRTF model is validated through simulation based experiments. We use two HRTF datasets. The first is the synthetic HRTF, which is generated from an ideal rigid sphere of radius 0.09 m, i.e., the spherical head model (Duda and Martens, 1998). The second is the HRTF measured on a KEMAR from the MIT media laboratory, which is a 512-tap impulse response with a sampling frequency of 44.1 kHz and has 750 measurement positions on a sphere of 1.4 m (Gardner and Martin, 1995). The room size is 5 × 7 × 2.5 m and the center of the room is defined as the origin. In our simulations, the spherical head or the KEMAR head is located at the center of the room. Based on the recent study on the perceptually-relevant features of an HRTF, a fourth-order spherical harmonics expansion is adopted in this work (Romigh et al., 2015).1 The source region is simulated as a spherical region of radius 0.25 m with two configurations investigated, i.e., concentrically (config1) or non-overlapping (config2) placed with the receiver region (or the human head) as shown in Fig. 1. The latter is centered at the point (2, 1, 0) with respect to the origin.
We evaluate the accuracy of the proposed model for BRTF reconstruction at 50 points uniformly distributed in the source region over a broadband frequency range of 10 kHz in different reverberation conditions. The performance measure is the mean square error (MSE) between the true BRTF and the reconstructed BRTF, i.e.,
The true BRTF is simulated using the image source method (Allen and Berkley, 1979) with each image source modelled as a point source and multiplied with the corresponding HRTF data [original or interpolated according to Eq. (2)]; while the reconstructed BRTF is generated using the proposed model (13), where is obtained from model decomposition of synthetic/measured HRTFs, and is simulated using the spherical harmonics based generalized image source method (Samarasinghe et al., 2018).
3.2 Simulation results
Figure 2 shows the MSE of the reconstructed BRTF over 10 kHz frequency range using the synthetic HRTF and the measured MIT data. The room is simulated with an image order of 7 and the reverberation time T60 of 144.5 ms. It can be seen that the performances are roughly the same using these two data sets and for two spatial configurations of the source region. The average MSE for the measured HRTF is slightly larger than that of the synthetic HRTF.
We further evaluate the model accuracy for BRTF reconstruction in different reverberation conditions. In this experiment, the reflection coefficient increases from 0.1 to 0.9, which corresponds to the image order increasing from 5 to 27 and the reverberation time T60 increasing from 100 to 570 ms. The source operates in a bandwidth of 10 kHz. Figure 3 shows the reconstruction error as a function of T60. The MSE increases with the reverberation time. This is due to the fact that as reverberation increases, image sources appear from a larger number of directions. The effect of higher-order spatial modes becomes more significant. Currently, we use a fourth-order spherical harmonics expansion to model the HRTF and receiver region. Higher order terms can be included to further improve model accuracy for highly reverberant environments. As given in the works of Kennedy et al. (2007) and Zhang et al. (2010), a theoretical bound is , where ku is the upper limit of the wavenumber and s is the effective receiver region radius. However, using high expansion orders increases the simulation time especially for the spherical harmonics based generalized image source method to generate the BRTF in highly reverberant environment. In practice, using high expansion orders means that a larger number of loudspeakers and microphones are required to measure the RTF coefficients, which inevitably poses implementation difficulties. This is the limitation of the proposed research.
4. Conclusion
An efficient parameterization of the binaural room transfer function has been proposed in this work. We decomposed the binaural room responses into the direct-path and reverberation components, which can be described separately by HRTF and RTF. We then developed parameterization of both terms as a weighted sum of 3D basis functions. The accuracy of the proposed parameterization was verified through simulation-based experiments by comparing with synthetic HRTF for a spherical head and measured HRTF on a KEMAR in simulated room environments.
Acknowledgments
This research was supported by the National Natural Science Foundation of China (NSFC) funding scheme under Project No. 61671380.
The higher order terms in spherical harmonics (SH) expansion of the HRTF are mainly caused by rapid phase changes at high frequencies while altering the interaural phase difference (IPD) at high frequencies is perceptually irrelevant. Pre-processing of HRTFs with phase modification has been shown to be capable to reduce the SH order for binaural rendering of Abmisonic signals (Schörkhuber et al., 2018)