In this article, the application of spatial covariance matching is investigated for the task of producing spatially enhanced binaural signals using head-worn microphone arrays. A two-step processing paradigm is followed, whereby an initial estimate of the binaural signals is first produced using one of three suggested binaural rendering approaches. The proposed spatial covariance matching enhancement is then applied to these estimated binaural signals with the intention of producing refined binaural signals that more closely exhibit the correct spatial cues as dictated by the employed sound-field model and associated spatial parameters. It is demonstrated, through objective and subjective evaluations, that the proposed enhancements in the majority of cases produce binaural signals that more closely resemble the spatial characteristics of simulated reference signals when the enhancement is applied to and compared against the three suggested starting binaural rendering approaches. Furthermore, it is shown that the enhancement produces spatially similar output binaural signals when using these three different approaches, thus indicating that the enhancement is general in nature and could, therefore, be employed to enhance the outputs of other similar binaural rendering algorithms.
I. INTRODUCTION
The binaural reproduction of sound scenes captured using wearable microphone arrays has gained renewed interest in recent years with such arrays now being integrated into head-worn devices and used for augmented and virtual reality (AR/VR) applications.1–5 In the context of hearing assistive devices, such as hearing aids, the relatively recent trend of including a data-link between devices has also prompted new proposals that take advantage of this freedom to share signals.5–7 While there are similarities between the binaural rendering algorithms intended for AR/VR devices and those intended for binaural hearing aids, it should be acknowledged that there are some differing requirements. However, it should be emphasized that one important design criteria, which is relevant to all modern head-worn devices and considered in recent related research,4–9 is the preservation of sound source localization cues. Furthermore, although such wearable devices have historically been limited in terms of hardware, it may be argued that with the introduction of a data-link in binaural hearing aids and as more sensors are integrated into future models, such devices are converging toward the high-sensor count microphone arrays used for high resolution spatial audio applications.
Traditionally, spherical microphone arrays (SMAs) with uniform sensor distributions have been popular for spatial audio capturing and reproduction due to their consistent spatial resolution for all directions. SMAs also allow for convenient conversions of the microphone array signals into spherical harmonic signals with numerous signal-independent proposals available for mapping these signals to the binaural channels.10–13 Other linear methods include binaural beamforming approaches.2,14 As a result of the linear mapping of signals, these methods retain high signal fidelity. However, the spatial accuracy of the reproduction is inherently limited by the number of microphones in the array. Signal-dependent binaural rendering alternatives, on the other hand, have been demonstrated to surpass linear rendering methods in terms of the perceived spatial accuracy15–18 when using the same number or fewer input channels. These methods are often built on perceptually motivated sound-field models and estimate the spatial parameters over time and frequency, subsequently using this information to map the input signals to the binaural channels in an adaptive and more informed manner. However, due to the nature of time-frequency processing, the signal fidelity of the output signals may be degraded. Furthermore, in practice, such processing is not always guaranteed to produce output signals that have the intended interchannel relationships dictated by the employed sound-field model. Acknowledging these issues, the concept of employing spatial covariance matching was proposed by Vilkamo et al.,19 which may be considered as a general framework that can be used to enhance spatial audio algorithms by posing them as optimal mixing problems. This alternative approach relies on specifying the interchannel relationships that the output signals should exhibit and working backward to determine the suitable mixing matrices to apply to binaural signals that are produced by an existing binaural rendering method. Such spatial covariance matching based solutions have been shown to attain both high spatial accuracy and signal fidelity.20–22 However, these previous works only considered the use of SMAs as input, with the application of spatial covariance matching yet to be explored in the context of microphone arrays affixed to wearable devices.
With the integration of microphone arrays becoming increasingly common in AR/VR devices and the adoption of a data-link in binaural hearing aids, there is a growing need for robust and general algorithms that can render binaural signals of high spatial accuracy for application within future devices. Given the benefits of spatial covariance matching based enhancements, as demonstrated for the capture and reproduction of sound-fields using SMAs,20–22 it is postulated that similar processing may be used for head-worn arrays, which have sensors nonuniformly arranged over irregular and comparatively larger geometries and, thus, existing spherical harmonic domain solutions would be limited by a narrow operating bandwidth.
Therefore, in this article, spatial covariance matching19 is explored for the task of enhancing the binaural rendering of head-worn microphone arrays. A sound-field model comprising a single source and an isotropic diffuse component per time-frequency tile, as used previously in Refs. 15, 18, and 23, serves as the foundation for this study. Three starting binaural rendering methods, which are inspired by hearing aid related literature,24–27 are formulated and used to produce initial estimates of the binaural signals. These methods are based solely on signal-domain operations and may, therefore, not be able to produce binaural signals that fully conform to the employed sound-field model; due to, for example, frequency-dependent variations of the employed beamformers and/or their handling of diffuse components in the captured sound scene. It is then upon these initial estimates of the binaural signals that the proposed covariance domain enhancements are applied to obtain refined estimates of the binaural signals. These refined binaural signals aim to more closely match the employed sound-field model and should, therefore, better reproduce the intended spatial cues.
The evaluation of the proposed enhancement involved the construction of an eight-sensor microphone array, which was affixed to the temples of a pair of eyeglasses. The array was then mounted on a dummy head and subsequently measured in a free-field environment. This permitted the simulation of reference binaural signals using the head related transfer functions (HRTFs) of the dummy head, along with the array transfer functions used to simulate the corresponding array recordings to be passed through the rendering algorithms under test. Next, objective evaluations were performed based on a single source in a diffuse-field with varying ratios, followed by subjective listening tests of multisource scenarios with and without simulated room reflections. The results for both of the evaluations indicate that, when applied to and compared with the initial binaural renders, the proposed enhancements produce binaural signals that more closely resemble the reference binaural signals in the majority of cases.
This article is organized as follows. Section II provides background literature regarding binaural rendering methods intended for hearing assistive and AR/VR devices. Section III details the sound-field model employed for this study. Section IV describes the proposed spatial covariance matching based enhancements, which may be applied to the output signals of the three suggested rendering approaches detailed in Sec. V. Information pertaining to the constructed eight-sensor microphone array is provided in Sec. VI, which is used for the evaluations described in Sec. VII. The evaluation results and discussions are given in Sec. VIII, and the article is concluded in Sec. IX.
II. BACKGROUND
A. Binaural rendering in hearing assistive devices
Within the long-established and vast body of literature surrounding hearing aid processing,6,28 there are references to a number of proposals for rendering the signals of head-worn microphone arrays. Many of the approaches cited tend to focus only on enhancing the signal-to-noise ratio (SNR) with the primary requirements being to improve speech intelligibility29 and reduce cognitive listening effort.30 Blind source separation31 and multichannel Wiener filtering32 are examples of SNR enhancing algorithms that are well established in practice for monaural or bilateral hearing aid devices. These algorithms, however, have also been shown to lead to degradations in signal quality28 and often do not seek to preserve the spatial attributes of the original sound scene.33 In the context of monaural and bilateral hearing aids, however, the benefits arising due to the improved SNR are generally deemed to outweigh these drawbacks.
However, owing to the introduction of a data-link between a pair of modern hearing aid devices, collectively referred to as a binaural hearing aid, the primary design goals for newer devices have gravitated more toward enhancing the SNR and preserving the localization cues.7,8,24,34,35 Many of the algorithms employed are based on the use of relative-transfer functions (RTFs),36 which, in the free-field case, refer to the array steering vectors aligned to two reference sensor positions located near the left and right ears. Spatial filters (also known as beamformers) may be steered toward sound sources from the perspective of each reference sensor and routed to the respective ear canals of the listener. The binaural minimum variance distortionless response (MVDR) algorithm25 is one example of an applicable beamformer for this task. Not only is the SNR enhanced by such processing, but the interaural time difference (ITD) cues are inherently preserved due to the physical location of the two reference sensors. Additionally, many of the interaural level difference (ILD) cues that arise as a result of head-shadowing effects are preserved. The application of binaural linearly constrained minimum variance (LCMV) beamformers may then extend this cue preservation to also encompass interfering sound sources,5,37 which can lead to improved speech intelligibility in multi-speaker scenarios due to the spatial release from masking.38–41 Other localization cue preserving proposals include those based on multichannel Wiener filtering.9
Many of these aforementioned approaches, however, do not preserve the monaural localization cues unless the reference sensors are located near the entrance of the listener's ear canals or HRTFs are included as gain constraints during beamforming.42 In practice, the preservation of monaural cues may be considered less important in the context of assistive hearing devices since meaningful pinna interactions occur above 6 kHz,43 which may be above the detection threshold of the hearing impaired listener. The spatial attributes of other components in the sound scene, such as reverberation and weakly directional sounds, are also rarely addressed as they directly conflict with the SNR enhancement requirement. Furthermore, when rendering the output binaural signals, retaining a high sound quality is still considered to be less important in the hearing aid processing context when compared to improving speech intelligibility. Although, the addition of more microphones, and/or of higher quality, can help alleviate such issues. The importance of producing spatially accurate auralisations of sound scenes with hearing aid devices was highlighted by Best et al.,44 who drew specific attention to the fact that sound externalization has been overlooked in the hearing aid research literature. This study aims to contribute to this discussion by offering a formal computational framework for rendering sound scenes for hearing aid users, which is easily augmentable for future perceptual studies.
B. Binaural rendering in AR/VR devices
Another area that has received little scientific attention is the binaural rendering of sound scenes captured by microphone arrays integrated within AR/VR devices; this is likely because commercial devices45 have only become widely available in recent years. Nonetheless, the recent release of datasets intended for developing algorithms for such devices46 does highlight that there is growing interest in this area. Contrary to the requirements of binaural hearing aids, the retention of high signal quality is often an important requirement in the AR/VR context along with the preservation of localization cues. Additionally, appropriately reproducing the spatial attributes of reverberation present in the sound scene may be favored over increased SNR. Therefore, while binaural hearing aid algorithms could conceivably be used in the context of AR/VR systems, most proposals have relied on linear signal-independent processing2,3,14 to forgo the need for source separation and retain high signal quality. However, the spatial accuracy attained through purely linear processing is inherently limited by the number of microphones in the array.
Considering other options, one may look to parametric signal-dependent alternatives,15–18,47–49 which have been demonstrated to yield higher perceived spatial accuracy compared to their linear counterparts when using either the same number, or fewer microphones. It is noted, however, that such time-varying processing can introduce signal fidelity degradations. Parametric methods based on the use of spatial covariance matching,20–22 on the other hand, have been shown to largely address such problems. These solutions rely on computing mixing matrices, which, when applied to an initial estimate of the binaural signals, aim to optimally produce output signals that conform to the specified spatial characteristics while constraining the solutions to retain high signal fidelity.19 Therefore, contrary to SNR enhancements, which are often sought after in the field of hearing processing, these spatial covariance matching solutions aim, instead, to spatially enhance the existing binaural signals. In Ref. 20, a modelless approach was proposed predicated on rendering loose approximations of binaural beamformers derived from SMAs such that they resemble, instead, the spatial selectivity of much sharper but also noisier binaural beamformers while retaining much of the original signal fidelity. The spherical harmonic domain proposals outlined in Refs. 21 and 22 instead used the approach to spatially enhance binaural Ambisonic decoders based on the use of parametric sound-field models. In Ref. 21, the model involved applying sector based processing to softly mix between multiple source estimates and an anisotropic diffuse-field. Whereas, in Ref. 22, the focus was on the application of post-filters to improve the spatial segregation achieved through source and ambient beamforming, which without a constrained spatial covariance matching solution would, otherwise, lead to a reduction in signal fidelity.
Spatial covariance matching based enhancements, however, have not yet been explored within the context of head-worn devices, where the microphones are typically mounted on nonspherical geometries with nonuniform sensor placements. Given additional practical limitations regarding the number of available sensors, which are also spaced more widely apart relative to compact SMAs sensor arrangements, existing spherical harmonic domain solutions would be heavily bandwidth limited, and the patterns of space-domain beamformers may vary greatly with the direction. Therefore, this article differentiates from the aforementioned past works through the formulation of a spatial enhancement that is specifically intended for head-worn devices. Three binaural rendering methods are devised, which are inspired by hearing aid related literature,24–27 and used to acquire initial estimates of binaural signals based on head-worn microphone array signals. It is demonstrated how space-domain spatial parameter analysis may be conducted to construct target binaural covariance matrices corresponding to a sound-field model comprising a single source mixed with an isotropic diffuse-field. The study also involves an in-depth objective and subjective evaluation of the approach in the context of using a makeshift head-mounted microphone array comprising eight sensors, which represents a potential configuration for future devices.
III. SOUND-FIELD MODEL
It is assumed that the sound-field is captured via a head-mounted array of M microphones worn by the listener. The array signals are then transformed into the time-frequency domain , where f denotes the frequency and t is the down sampled time index. In practice, a short-time Fourier transform (STFT) or a perfect/near-perfect reconstruction filterbank may be employed for this task. For each time-frequency tile, it is assumed that the sound-field may comprise a single dominant source component, s, an ambient component encapsulating isotropic diffuse noise and reverberation , or a combination of the two. The array signal vector may, therefore, be expressed as
where is the array steering vector for a sound source incident from the direction . Note that the array steering vectors may be obtained through free-field measurements or simulations of the array while it is worn by the listener/manikin or modeled analytically by approximating the listener's head as a sphere.50,51 It is, henceforth, assumed that array steering vectors, , are available for a dense grid of K directions .
Assuming that the source signals are uncorrelated with the diffuse noise and reverberation, the array signal statistics may be expressed via their spatial covariance matrix (SCM) as
where denotes the expectation operator.
Note that this assumption of a single source mixed with diffuse sound, although simplistic, is often met during practical scenarios provided that the frequency resolution of the transform is sufficiently high and the sound sources present in the scene are sufficiently sparse in frequency and/or over time.
IV. PROPOSED SPATIAL COVARIANCE MATCHING BASED ENHANCEMENT
In this section, the spatial covariance matching framework under consideration is formulated and applied for the task of enhancing an initial estimate of binaural signals, henceforth referred to as baseline binaural signals, which are obtained with
where is the baseline mixing matrix. Suitable candidates for this matrix will be introduced in Sec. V and evaluated later (both with and without the proposed enhancement) in Sec. VII. A block diagram of the overall method is also provided in Fig. 1.
The proposed enhancement is based on the idea that the narrow-band SCMs of the output binaural signals should ideally match those of the target SCMs, which are derived directly through the employed sound-field model. Continuing from the assumptions laid down thus far and by describing the balance between direct and diffuse components using a diffuseness term , the target narrow-band binaural SCMs are given as
where is the total input signal power; is the HRTF corresponding to the source direction; is a binaural diffuse coherence matrix (DCM), which is derived from a dense grid of HRTF measurements ; and is an optional diagonal matrix of integration weights to account for a nonuniform measurement grid. Note that the inclusion of the binaural DCM serves to enforce the diffuse isotropic properties of the nondirect sounds by imposing the appropriate interaural coherence (IC) cues that would be experienced by the listener while under such conditions.52 Furthermore, it is noted that the direct-to-diffuse ratio (DDR), which is more commonly used in the signal enhancement literature, is directly related to the employed diffuseness measure as . The time and frequency indices are also omitted henceforth for the brevity of notation.
Depending on the choice of the baseline mixing matrix, the narrow-band SCMs of the baseline binaural signals may deviate from their respective target narrow-band SCMs. For example, such scenarios may arise due to beamformers encapsulating not only direct sounds but also other sound components (such as reflections) or by rendering the residual components of the scene in a manner that deviates from the isotropic and diffuse characteristics dictated by the employed model. The proposed enhancement approach is, therefore, principally tasked with determining the mixing matrices to apply to the baseline binaural signals such that the resulting signals directly match the target SCMs and, consequently, also match the employed model
where
One option for solving this problem is to first decompose the target and baseline covariance matrices as and using, for example, the eigenvalue or Cholesky decomposition, and computing
However, although the solution described by Eq. (7) will produce signals that conform to the employed sound-field model, it will not necessarily do so with any consistency across frequency. Therefore, the time-domain representation of this matrix of filters may be ill-conditioned, which would subsequently result in signal fidelity degradations. However, it is also highlighted that these decompositions are not unique, since
hold true for any unitary matrixes and . Therefore, it is clear that additional degrees of freedom exist, which may be used to optimize the solution, and it is upon this principle that the covariance domain framework proposed in Ref. 19 aims to fulfill the SCM matching task while also optimally constraining the solution to preserve the high signal fidelity.
In this study, this optimized solution was employed as
where are obtained from the singular value decomposition , where
is a nonnegative diagonal matrix, which is used to normalize the channel energies.
V. BASELINE BINAURAL RENDERING APPROACHES
With the target sound-field model and SCM matching framework now outlined, three suitable candidate approaches for the baseline mixing matrix, Q, which will be later employed for the evaluations in Sec. VII, are now described.
A. Using reference sensor signals as baseline signals
The simplest baseline approach applicable to this study is to select two reference microphone signals, which are ideally located nearest to the left, , and right, , ear canals of the listener, and route the signals directly as . Note that this represents bilateral or binaural hearing aids set to low-power/pass-through modes.24 Here, the elements of the baseline mixing matrix should be zero, except for the indices mapping the left and right reference sensors to the respective binaural channels, which should be one. For example, if M = 8 and the reference sensors are index one for the left ear and index five for the right ear, the baseline mixing matrix is expressed as
Note that if these reference microphone sensors are located inside the ear canals of the listener, then their signals will capture both the binaural and monaural localization cues and, thus, the computed spatial enhancement mixing matrix will tend toward an identity matrix provided that the captured sound-field conforms to the assumed sound-field model. However, in practice, the reference sensors will likely be situated away from the ear canals. For example, binaural hearing aids will often indicate the top forwardmost sensors as the reference sensors; in which case, the ITD and much of the head-shadowing related ILD cues will be preserved by the baseline signals, and the SCM matching will mainly seek to introduce the missing monaural cues at higher frequencies where pinna interactions are more prevalent. Whereas for an augmented reality device, which may have the sensors located much further away from the listener's ears, the SCM matching solution may require more severe mapping of the input signals to fulfill the target interchannel dependencies.
B. Baseline signals using spatial analysis and beamforming
Alternative suitable baseline candidates include those based on beamformers informed by direction-of-arrival (DoA) estimates, which may provide a starting point that is closer to the assumed model. In this work, the filter-and-sum (FaS) beamformer53 and the binaural MVDR beamformer25,54 are explored for obtaining the source signal estimates. Then, by assuming that the reference microphones selected by Eq. (11) may be used to approximate diffuse binaural signals when the listener is under such conditions, the source signals and these assumed diffuse signals can be mixed based on the diffuseness parameter. Therefore, the baseline mixing matrix in the case of FaS beamformers is obtained as
where, because it is assumed that the reference sensor signals correspond to diffuse components, an equalization term is also included:
where is the DCM of the array, which serves to bring the diffuse-field spectral response of the microphone array to reflect, instead, that of the diffuse-field response of the employed HRTFs. Note that balancing between the binaural beamformer and reference signals has been formulated previously in Refs. 27 and 28, through a user-controllable parameter as opposed to the time-frequency-dependent diffuseness term employed in this study.
The third baseline mixing approach explored in this study alternatively involves the use of binaural MVDR beamformers, which are popularly employed in binaural hearing aid device studies,5,8,24 and is given as
where it is noted that should, in theory, be replaced with an estimate of the array noise covariance matrix to achieve higher noise suppression. However, was selected for this study to eliminate problems that may arise due to erroneous source-activity detection.
Note that FaS and binaural MVDR should be able to better capture and represent the source components, provided that the DoA estimates are correct, as both beamformers have the unity gain constraint, and the HRTF directivities are then imposed onto the signals they capture. The target of the SCM matching, therefore, is to bring the interaural cues delivered by the diffuse/reference signals to be more in line with the binaural DCM. However, overall, it is expected that these beamforming based baseline alternatives will produce signals that are closer to the assumed model than those in the basic case represented by Eq. (11) and, thus, will require less severe corrections to match the SCMs. To illustrate this, a metric describing the deviation of the calculated mixing matrix from an identity matrix was derived as This metric was then computed for two different source directions under diffuse conditions, averaged over time, and plotted over frequency for all three baseline cases, as depicted in Fig. 2. Here, it is evident that both of the beamforming based baselines require less drastic modification to produce output signals with the target interchannel dependencies, especially for high frequencies, although the effect of having a baseline that is closer to the assumed model is shown to be minimal in the later evaluations.
VI. IMPLEMENTATION
To investigate the performance of the proposed SCM matching based enhancement for the binaural rendering of wearable microphone arrays, eight DPA IMK4060 microphones (DPA Microphones, Denmark) were first affixed to a pair of safety glasses, as depicted in Fig. 3. The safety glasses were mounted onto a KEMAR 45 BC dummy head (GRAS, Denmark), which was placed in an anechoic chamber, with the array directional responses subsequently measured for every 1° on the horizontal plane using the swept-sine technique.55 An omnidirectional microphone in the same location as the dummy head was used to create a compensation equalization curve to mitigate colorations incurred by the measurement loudspeaker. For processing the signals, the alias-free STFT design, as described in Ref. 57, was selected and configured to employ a window size of 256 samples (sample rate 48 kHz) with 90% window overlap. The three baseline binaural rendering approaches were implemented as described in Sec. V and subjected to the SCM matching based enhancement as detailed in Sec. IV.
To offer further insights into the practical application of the proposed SCM matching solution, the objective evaluation described in Sec. VII A was conducted both with known/Oracle spatial parameters and estimated spatial parameters. Additionally, since the subjective evaluation described in Sec. VII B involved multiple simultaneous sound sources, and due to the single-source assumption of the employed sound-field model, processing based on known parameters would not be meaningful. Therefore, the spatial parameter estimators, which are used to inform the processing of the adaptive binaural rendering algorithms explored in this study, are now described.
A. Spatial parameter estimation
The input SCM is first spatially whitened to ensure that it exhibits an identity-like structure when the microphone array is placed under isotropic diffuse-field conditions. The whitened input SCM, thus, conforms to
where given the eigenvalue decomposition of the array DCM, with . The subspace decomposition with the employed single-source assumption is then applied as
where the eigenvalues σ are given in descending order and correspond to their respective eigenvectors v. The eigenvectors corresponding to the M – 1 smallest eigenvalues make up the noise subspace .
The employed diffuseness parameter estimation, which is based on the COMEDIE algorithm,57 is then determined through observing the variance of the eigenvalues
where the normalization , the deviation , and the mean .
For the DoA estimation, the mutiple-signal classification (MUSIC) approach58 was used, which, given the single-source assumption, is formulated as
A peak-finding exercise is then conducted to numerically extract the DoA estimate from the resulting pseudospectrum per frequency.
VII. EVALUATION
A. Objective evaluation
The proposed SCM solution was first evaluated objectively in the context of binaural cue preservation. Here, 360 single-source reference binaural signals (1 for each degree on the azimuthal plane) were simulated using whitenoise stimuli and then mixed with cylindrically isotropic diffuse noise to obtain the following DDRs: [−60, −6, 0, 6, 12, Inf] dB. Note that the gains required to attain these DDRs were determined based on an omnidirectional receiver, i.e., without the presence of the array. Next, the binaural reference signals and microphone array recordings of these simulated scenarios were obtained by convolving incident plane-waves with either HRTFs or the measured array steering vectors, respectively. The array recordings were then rendered to the binaural channels using the three baseline approaches formulated in Sec. V, , and , with and without the proposed SCM matching (CM) enabled, as described in Sec. IV.
The binaural covariance matrix, based on the estimated binaural signals , is then given by
and the ILD, interaural phase difference (IPD), IC, and binaural coloration metrics may be computed as11,56
Since past studies have demonstrated that binaural hearing aid algorithms may perform differently with known or estimated DoAs,7,59 the above objective perceptual metrics were computed for all three baselines with and without CM using both known/Oracle parameters and those estimated through the parameter analysis described in Sec. VI A.
B. Subjective evaluation
A multiple stimulus test was conducted to evaluate the proposed SCM matching solution for the task of binaurally reproducing the microphone signals in more realistic multisource scenarios. To create the listening test scenes, three source stimuli were placed on the horizontal plane at positions directly to the left, in front, and to the right of the listener in the simulation. Three different sets of simultaneously played source stimuli were selected, which represent a diverse range of different time-frequency content, (1) a shaker, bass guitar, and strings; (2) a male English speaker, a female English speaker, and a male Danish speaker; and (3) cicadas, a dog barking, and birds tweeting. The stimuli durations were between 13 and 16 s. Two different acoustic settings were selected: anechoic (dry) and a moderately reverberant medium sized class room (rev). The reference scenarios for the anechoic cases were created by directly convolving the source stimuli with HRIRs in the directions [-90 0 90] degrees on the horizontal plane. The eight-channel array responses for these same directions were also convolved with the same stimuli to create a synthetic microphone array recording of the same anechoic scene. To reduce the number of test cases for the listening test, and because it is later shown in Sec. VIII that the FaS and MVDR produced similar results in terms of the objective evaluations, only the and baselines were selected. These array recordings were then rendered using these two baseline methods with the CM enhancement either enabled or disabled.
Reverberant counterparts for the above scenarios were then created using a shoebox room simulator based on the image-source method. The simulator60 was configured for room dimensions 10 × 7 × 4 m and had the wall absorption coefficients tuned to resemble a moderately reverberant environment (a broadband T60 of approximately 0.5 s). The listener position was situated directly in the center of the room with the source positions set to 1 m away from the listener position. The simulated direct path and image-source reflections were then convolved with the nearest HRTFs to create the reference reverberant test cases, whereas the nearest microphone array steering vectors were convolved to create synthetic microphone array recordings of the same reverberant scene. These were rendered using the same two baselines, and the CM enhancement was either enabled and disabled, as with the anechoic cases.
In total, there were six test scenes, as summarized in Table I, and five test cases, as summarized in Table II. The listening test was conducted in three parts:
Spatial: In this part of the evaluation, all of the test cases were equalized to the reference case. This was conducted by passing the reference case through the same STFT that was used by the methods under test and determining the reference, , and test case, , energies, which were averaged over the whole stimuli duration, followed by computing the equalization gains as separately for each bin. This served to mitigate the timbral differences while still retaining the spatial differences between the renderings. The participants were instructed to assess the test cases on a scale from 0 to 100 based on their spatial similarity to the reference (with respect to the source localization, externalization, and reverberation characteristics) and ignore any timbral differences that remained.
Timbre: Here, the reference case was instead duplicated and equalized by each test case, , to obtain spatial equivalence across all of the test cases while retaining any timbral colorations that may be introduced by the processing operations associated with the methods under test. The listening subjects were instructed to rate the cases on a scale of 0 to 100 based on their timbral similarity with the reference. It was emphasized that any spatial differences that the listeners perceived should be ignored as equalization can change one's perception of the spatial cues.43,61
Overall: For this part of the listening test, the test cases were simply normalized to the reference based on the broadband root mean squares of the signals, which were averaged across the whole stimuli duration and the binaural channels. The listening subjects were asked to rate the test cases based on their personal preference on a scale of 0 to 100.
Name . | Room . | Source stimuli . |
---|---|---|
Band_dry | Anechoic | Shaker, bass guitar, strings |
Band_rev | Reverberant | Shaker, bass guitar, strings |
Speech_dry | Anechoic | Two male and one female speakers |
Speech_rev | Reverberant | Two male and one female speakers |
Mix_dry | Anechoic | Cicadas, a dog barking, bird calls |
Mix_rev | Reverberant | Cicadas, a dog barking, bird calls |
Name . | Room . | Source stimuli . |
---|---|---|
Band_dry | Anechoic | Shaker, bass guitar, strings |
Band_rev | Reverberant | Shaker, bass guitar, strings |
Speech_dry | Anechoic | Two male and one female speakers |
Speech_rev | Reverberant | Two male and one female speakers |
Mix_dry | Anechoic | Cicadas, a dog barking, bird calls |
Mix_rev | Reverberant | Cicadas, a dog barking, bird calls |
Name . | Rendering . |
---|---|
Hidden ref | Ideal binaural receiver |
MVDR CM | baseline with SCM matching |
Basic CM | baseline with SCM matching |
MVDR | baseline without SCM matching |
Basic | baseline without SCM matching |
Name . | Rendering . |
---|---|
Hidden ref | Ideal binaural receiver |
MVDR CM | baseline with SCM matching |
Basic CM | baseline with SCM matching |
MVDR | baseline without SCM matching |
Basic | baseline without SCM matching |
In total, 15 test subjects participated in the study, all of whom reported having normal hearing and were naive as to the hypothesis of the study.
VIII. RESULTS AND DISCUSSION
The results of the objective evaluations are depicted in Fig. 4. The root mean square error (RMSE), computed with respect to the reference binaural cues and averaged over the frequency and following the perceptually motivated equivalent rectangular bandwidths (ERB) scale, is plotted along the y axis with the error bars denoting the standard deviations. The plots given in Fig. 4(a) were calculated based on the known spatial parameters, whereas Fig. 4(b) used the estimated parameters. It can be seen that the results of the IC and IPD evaluations show significant improvements in the RMSE when CM is enabled with known and estimated spatial parameters. On the other hand, while the application of CM reduces the RMSE of the ILD cue when using a known DoA, this reduction in error is not as prevalent when using the estimated DoA. It follows that in this respect, CM is sensitive to errors in the DoA estimation but no more so than the FaS and MVDR baselines. Finally, the RMSE for the coloration does not show significant improvement for the FaS and MVDR baseline methods but does show some improvement for the basic baseline method, although it is noted that colorations of 2 dB are not easily perceptible. Furthermore, this metric does not appear to be affected by the DoA estimation errors.
The results of the subjective evaluation are provided in Fig. 5. It can be observed that, in the majority of cases, the test cases where CM was enabled were rated higher and closer to the reference than when CM was disabled. To provide further insight, statistical analyses were also performed on the data with the exception of the results for the reference stimuli. The analyses were performed using functions from matlab's Statistics and Machine learning toolbox, version 12.1 (The MathWorks, Natick, MA) with the alpha-error significance level set to 0.05 for all of the tests. The Friedman tests (matlab function Friedman) were performed separately for each stimulus on both the spatial and timbral ratings. The χ square results are listed in Table III, wherein the analyses resulting in p-values lower than 0.01 are denoted by the symbol “**.” It can be seen that the Friedman tests revealed statistically significant differences between the processing methods in terms of the spatial and timbral subjective evaluation outcomes for all of the test scenes. Subsequently, ad hoc multiple comparison tests (matlab function multcompare) were performed with a Tukey honest significance difference (HSD) criterion to establish which methods differed significantly from the others.
. | . | |
---|---|---|
Stimulus . | Spatial . | Timbre . |
Band_dry | 39.36** | 21.02** |
Band_rev | 28.53** | 29.03** |
Speech_dry | 18.81** | 11.56** |
Speech_rev | 12.94** | 34.63** |
Mix_dry | 33.50** | 29.30** |
Mix_rev | 28.00** | 25.06** |
. | . | |
---|---|---|
Stimulus . | Spatial . | Timbre . |
Band_dry | 39.36** | 21.02** |
Band_rev | 28.53** | 29.03** |
Speech_dry | 18.81** | 11.56** |
Speech_rev | 12.94** | 34.63** |
Mix_dry | 33.50** | 29.30** |
Mix_rev | 28.00** | 25.06** |
Statistically significant differences in the spatial ratings were found between the basic and basic CM conditions for the band_rev (p < 0.01), band_dry (p < 0.01), mix_rev (p < 0.01), and mix_dry test scenes (p < 0.01) but did not reach significance for the speech_rev (p > 0.05) and speech_dry test scenes (p > 0.05). Similarly, a statistically significant difference was found between the MVDR and MVDR CM methods for the band_rev (p = 0.05), band_dry (p < 0.01), mix_rev (p < 0.01), and mix_dry test scenes (p < 0.01) but did not reach significance for the speech_rev (p > 0.05) and speech_dry test scenes (p > 0.05). Between the basic CM and MVDR CM methods, only the speech-dry stimulus (p < 0.04) spatial ratings case achieved statistical significance. For the timbre section of the tests, the ad hoc multiple comparison test with a Tukey HSD criterion applied showed significant differences in the comparison of the ratings for the basic and basic CM methods for all of the test scenes (i.e., band_rev, p < 0.01; band_dry, p < 0.01; mix_rev, p < 0.01; mix_dry, p < 0.01; speech_rev, p = 0.017; and speech_dry, p = 0.032). Meanwhile the comparison of the MVDR and MVDR CM methods only reached statistical significance for the speech-rev scene (p < 0.01). There were no statistically significant differences between the ratings for the basic CM and MVDR CM methods.
It should be noted that statistical analyses were not applied to the overall section of the subjective evaluation as the results were considered to be highly subjective, with the ratings of the listener dependent on whether they valued spatial accuracy over timbral fidelity or vice versa. However, a positive trend can be seen in Fig. 5 wherein the majority of the subjects preferred spatial covariance matching solutions over the baseline methods.
The objective and subjective evaluations imply that the application of spatial covariance matching leads to a greater preservation of the spatial cues in comparison to the use of only the baseline techniques. The objective metrics are clearly improved with the application of spatial covariance matching, while the ratings in the subjective evaluation also indicate the improvement is perceptually meaningful given more practical multisource input scenes. The participants in the listening test reported that the spatial accuracy improved with respect to the reference for both the basic and MVDR baseline conditions in the case of the band and mix stimuli. Additionally, whereas the improvement for the speech stimuli was not found to be statistically significant, it can be seen in Fig. 5 that the subjects in the listening test rated the basic and MVDR baseline methods to already be more spatially accurate with respect to the reference for the speech stimuli than for the band and mix stimuli. It follows that, while the application of spatial covariance matching may have improved the spatial accuracy, the improvement was not sufficiently large enough to be statistically significant.
The results of the timbral part of the subjective evaluation show that the use of the spatial covariance matching together with the baseline method reproduces the scene spectral information more faithfully than using the baseline techniques alone. The subjective timbral fidelity appears to be better than that achieved during the objective evaluation. This may be because the objective metrics were calculated using single-source white noise scenarios, whereas the subjective evaluations used three more spectrally diverse sound sources. In the objective evaluation, the beamformers used to generate the baseline prototype signals should have encapsulated the single source with minimal colorations as there were no interferers overlapped by the sidelobes of the beamformers. Additionally, the DoA error rate was presumably lower in the case of a single sound source scene. It follows, therefore, that the coloration for the baseline technique was not noticeably lower in comparison to the coloration for the proposed spatial covariance matching method. On the other hand, during the generation of the listening test stimuli, it is expected that the beamformers will encapsulate some of the signals of the interferers due to a combination of the beamformer sidelobes and DoA estimation errors during periods where the single-source assumption for each time-frequency tile was not met. The spatial covariance matching solution mitigates some of these timbral coloration issues as the target powers Ptotal are not affected by the DoA estimation errors. Hence, although it is not made clear in the single-source objective evaluation, it is expected that the CM will introduce less timbral coloration, which is apparent in the multiple source subjective evaluation.
Aside from the improved spatial and timbral accuracy of the binaural rendering, it is highlighted that the proposed CM method is still based on the parameterization of the sound scene from the point of view of the listener. Therefore, sound-field modifications may be realized in a computationally efficient way by simply manipulating the spatial parameters prior to reproducing the scene. The sound-field modifications may include rotations, direction-dependent loudness manipulations, and exaggeration of the direct components located in front of the device wearer. Spatial audio effects may also be realized through simple parameter manipulations,62 which may be desirable for AR/VR applications. Potential avenues to explore in future work, therefore, include an investigation into the effect of manipulating the parameters involved, such as the diffuseness parameter (which has a directly proportional impact on the DDR), or the application of different signal manipulation techniques, such as dynamic range compression, independently applied to the direct and diffuse streams of audio.
IX. CONCLUSION
This article investigated the application of spatial covariance matching applied to head-worn microphone arrays as a means of enhancing the spatial accuracy when binaurally reproducing the sound scenes that they capture. The sound-field model employed for this study assumes a single sound source per time-frequency index accompanied by an isotropic diffuse component with the proposed enhancements imposed via the spatial covariance matching framework established in Ref. 20. During the study, an eight-sensor microphone array was attached to the temples of a pair of eyeglasses, which was then placed on a dummy head to enable measurements to be taken in a free-field environment to obtain array steering vectors for many directions. These vectors, in conjunction with the HRTFs of the dummy head, were used by the rendering algorithms to produce output binaural signals that may be directly compared with the reference binaural signals. This provided a robust framework for the evaluation of different binaural rendering approaches both with and without the proposed spatial enhancements applied.
In the objective evaluation, the ILDs, IPD, IC, and binaural coloration errors were calculated for renderings of the simulated array recordings in which three baseline techniques were used in isolation and in conjunction with the proposed spatial covariance matching technique. It was found that the application of spatial covariance matching greatly reduced the RMSE for the IC and IPD metrics. The ILD error was also minimized with the application of the proposed enhancement when using known spatial parameters, although this improvement diminished when using the estimated spatial parameters. In the subjective evaluation, a listening test was conducted wherein 15 participants rated multisource sound scenes based on the spatial and timbral similarity with a reference sound scene, as well as overall preference. The results for the listening test indicated that the spatial accuracy of the stimuli significantly improved with the application of spatial covariance matching for the majority of the sound scenes simulated for the test. The timbral attributes were found to be significantly improved over the basic baseline method for all of the stimuli, whereas the improvement for the baseline method that incorporated the MVDR beamforming was found to be statistically significant for only the reverberant speech scenario.
In conclusion, this study demonstrates that spatial covariance matching can be efficiently formulated for application in sound reproduction using head-worn microphone arrays and, on application, produces binaural signals that more closely match those that would, otherwise, have been captured at the ear canals of the listener. In addition, spatial covariance matching improved the spatial attributes and timbral quality of the resultant sound scenes for the basic baseline technique as well as the more complex baseline techniques considered for this study. Although it is noted that the processing does not seek to enhance the SNR and speech intelligibility, with such enhancements, instead, required to be applied by the baseline method prior to applying the spatial enhancements of the proposed approach. Finally, because the proposed spatial enhancements are still based on the parameterization of the captured sound scene, it is noted that aspects of the rendering may be easily augmented, such as manipulating the direct-to-diffuse balance or applying direction-dependent gains to only the sound sources in the scene.