Listener envelopment (LEV), the sense of being surrounded by the sound field, is a perception that has been found to be related to the overall impression of a concert hall. The purpose of this study was to investigate the relationship between the perception of LEV and the direction and arrival time of energy from spatial room impulse responses (IRs). IRs were obtained in a 2000-seat concert hall using a 32-channel spherical microphone array and analyzed using a third-order plane wave decomposition. Additionally, the IRs were convolved with anechoic music and processed for third-order Ambisonic reproductions and presented to subjects over a 30-loudspeaker array. Instances were found in which the energy in the late sound field did not correlate with LEV ratings as well as energy in a 70–100 ms time window. Follow-up listening tests were conducted with hybrid IRs containing portions of an enveloping IR and an unenveloping IR with crossover times ranging from 40 to 140 ms. Additional hybrid IRs were studied wherein portions of the spatial IRs were collapsed into all frontal energy with crossover times ranging from 40 to 120 ms. The tests confirmed that much of the important LEV information exists in the early portion of these IRs.

An important aspect of the overall room impression of a concert hall is the spatial impression of the hall, which includes listener envelopment (LEV), the sense of being fully immersed in the sound field. The purpose of this study was to investigate the relationship between the perception of LEV and the direction and arrival time of energy in performing arts spaces utilizing measurements obtained with a compact spherical microphone array. The impulse response (IR) measurements made using such an array can be used for both an objective analysis of the sound field in full three-dimensional (3D) via beamforming techniques1 and subjective listening tests using 3D reproductions of the sound fields over a loudspeaker array via Ambisonics.2,3

Research aimed at understanding the spatial perception in performing arts spaces initially focused on the directional dependence of early reflections. Originally, the sense of spaciousness was thought to be primarily associated with reverberation, but in the late 1960s, it was found that the spatial impression was heavily influenced by early reflections.4 It was proposed that the spaciousness depended on the arrival direction of early reflections, and that stronger early lateral reflections were related to a quality referred to as “spatial responsiveness.”5 Further work led to the development of the objective metric Early Lateral Energy Fraction JLF (prior notation LF), the ratio between the early lateral energy and the total early energy in the first 80 ms, which was found to be correlated with the subjective level of spatial impression.6 A second metric that has been found to correlate with spatial impression is the interaural cross correlation coefficient (IACC), which is obtained from the cross-correlation of the left and right ears of a binaural IR.7 

More recently, it has been proposed that the spatial impression of a hall contains two distinct perceptions: apparent source width (ASW), which is the sense of how wide or narrow the sound image appears to a listener, and LEV, the sense of being immersed in and surrounded by the sound field.8 ASW has since been shown to be related to early lateral reflections, which can be predicted using IACC and JLF, while LEV has been shown to be related to late lateral energy.9 

Seminal work on LEV was conducted by Bradley and Soulodre9,10 in the early 1990s. Using five loudspeakers distributed in the front half of the horizontal plane, sound fields were generated with a small number of early reflections that were kept constant, and varied certain aspects of the late sound field: the reverberation time (T30), the early-to-late sound energy ratio (C80), and the strength of the late sound field (GLate). The angular distribution of the late sound was also varied, which was accomplished by playing the late sound either out of a single frontal loudspeaker, three frontal loudspeakers spanning 70°, or five frontal loudspeakers spanning 180°. A subjective study showed that the parameters with the highest correlation to LEV were angular distribution and overall late level. These results were used to develop a metric to predict LEV called late lateral energy level, LJ (prior notation LG):9 

LJ=10log10[80mspL2(t)dt0p102(t)dt][dB],
(1)

where pL(t) is the room IR measured with a figure-of-eight microphone, and p10(t) is the IR of the sound source normalized at a distance of 10 m away in a free field. While a strong correlation was found between this metric and LEV, it should be noted that this study used a small number of loudspeakers spanning a limited angular area and the simulated sound fields may not have been representative of a real concert hall. Additionally, the range of LJ values of the stimuli was greater than 20 dB, which is a much larger range than would be found in actual spaces—Bradley reports a range from approximately −7 to +6 dB,11 and Kuusinen et al.12 report a range of approximately 7 dB across a variety of halls.

Although work by Bradley and Soulodre9,10 indicated that late lateral energy is the component of the sound field with the highest correlation with envelopment, a number of studies have shown correlation between listener envelopment and non-lateral sound and/or early reflections. One subjective study showed that adding front early reflections from above, i.e., ceiling reflections, increases LEV.13 A second subjective study found that increasing the energy behind a listener will increase LEV and that early reflections can also contribute to the perception of LEV.14 A third subjective study was conducted using a similar loudspeaker arrangement to Bradley and Soulodre with the addition of a loudspeaker placed directly behind the listener and a loudspeaker overhead.15 By varying the distribution of the late sound, findings showed that late sound both above and behind the listener significantly affect LEV,15 which directly contradicts a prior study's finding that reflections above and behind do not significantly impact envelopment.16 The effect of the early sound on LEV was also investigated in a study using binaural stimuli in which the left ear signal was fed into the right channel and vice versa in order to increase interaural cross correlation.17 Findings from that study showed that cross-mixing the channels to increase the interaural cross correlation in only the early part of the IR decreased LEV, in some cases more so than cross-mixing only the reverberant tail.

Currently, LJ is the most commonly used objective metric to predict LEV, and is included in Annex A of the room acoustics measurement standard ISO 3382.18 However, several other objective metrics have been proposed to predict LEV. Unlike ASW, it has been found that energy fractions and cross-correlation metrics of the late sound such as late lateral energy fraction (LLF)19 and late interaural cross correlation coefficient (IACCL,3)20 are, by themselves, poor predictors of LEV and do not significantly vary between halls or within halls, although they have been used as a component of other metrics. LJ, for example, can be calculated from LLF and GLate. Beranek has also proposed an empirical formula to predict LEV objectively that is based on IACCL,3, strength (G), and clarity index (C80).21 Other metrics that have been suggested include front/back energy ratio22,23 [FBR; Eq. (A1)], and spatially balanced center time [SBTS, Eq. (A5)], which is based on the center time of spatial IRs weighted by the arrival direction24 (see the  Appendix for additional information about these metrics). A recently proposed measure, Surround Sensation Index,25 is intended to improve the performance of IACC using differential short-time interaural cross-correlation functions. Little information exists on the performance of these metrics.

To summarize, many studies have found that the early sound field and non-lateral energy can have an influence on the perception of LEV, which is not accounted for in the metric LJ. Although several other metrics have been proposed to predict LEV, they have not been widely adopted in the architectural acoustics community, and there are only a limited number of studies that evaluate the performance of each of these metrics. Additionally, some of these metrics were designed using simulated sound fields that were reproduced over a limited number of loudspeakers, and may lack realism when compared to actual concert halls.

The purpose of this study was to investigate the relationship between the perception of LEV and the direction and arrival time of energy using measured spatial IRs. In this study, spatial room IRs were obtained in a 2000 seat dedicated concert hall with a volume of 24 000 m3 (850 000 ft3) and a maximum mid-frequency reverberation time of 2.8 s (Peter Kiewit Concert Hall in Omaha, NE). An 8.4-cm (3.3-in) diameter spherical microphone array containing 32 omnidirectional microphone capsules was used to capture the directional IRs. The measured sound fields were analyzed by beamforming directional IRs and plotting the spatial distribution of energy as a function of time in different octave bands. Additionally, the IRs were convolved with an orchestral anechoic music excerpt and reproduced over an array containing 30 loudspeakers located in an anechoic chamber to conduct two listening tests. The first listening test contained the original IRs, as measured. In order to study the time dependence of the energy in the IRs related to LEV, the second listening test contained IRs that were modified to (a) include time portions from two different measured IRs, one perceived with low LEV and one with high LEV, and (b) remove the spatial properties of different time portions of the IRs. The approach used in this study is an improvement over previous envelopment research due to the resolution of the beamformed directional IRs, and due to the accurate reproduction of measured sound fields.

Measurements obtained with a spherical microphone array can be beamformed in the spherical harmonics domain to generate directional IRs. The sound pressure on the surface of a sphere due to an incident plane wave can be represented in the spherical harmonics domain as26 

p(a,θ,ϕ,t)=P04πn=0inbn(ka)m=nnYnm(θ,ϕ)Ynm*(θw,ϕw)eiωt,
(2)

where p is the total sound pressure, a is the radius of the sphere, θ is the zenith angle, ϕ is the azimuthal angle, P0 is the pressure amplitude, i is the imaginary number 1, and k is the wave number. The direction of the incident wave is (θw,ϕw), and * denotes the complex conjugate. The spherical harmonics, Ynm, of order n and degree m are defined as

Ynm(θ,ϕ)=(2n+1)4π(nm)!(n+m)!Pnm(cosθ)eimϕ.
(3)

In Eq. (2), the coefficients bn are radial functions, which are often referred to as plane wave modal coefficients and are dependent on the geometry of the array used. For a rigid sphere, the coefficients are27 

bn=jn(ka)jn(ka)hn(2)(ka)hn(2)(ka),
(4)

where jn are spherical Bessel functions of order n, hn(2) are spherical Hankel functions of the second kind of order n, and indicates a derivative with respect to the argument.

The spherical harmonics form an orthonormal basis set that satisfy the orthogonality relation

02π0πYnm(θ,ϕ)Ynm*(θ,ϕ)sinθdθdϕ=δmmδnn,
(5)

where δ is the Kronecker delta. Taking advantage of the orthogonality, the spatial Fourier transform can be applied to the microphone signals to transform them into the spherical harmonics domain. The spatial Fourier coefficients for the spherical harmonics X̃nm(ka) can be obtained by applying weights to each microphone signal and summing the signals together. For a nearly-uniformly sampled sphere with S microphones, the Fourier coefficients as a function of frequency (f=ck/2π), where c is the sound speed), becomes1 

X̃nm(f)=1bn(f)4πSs=1SXs(f)Ynm*(θs,ϕs),
(6)

where Xs is the complex pressure in the frequency domain measured at microphone s, obtained by performing a discrete Fourier transform (DFT) on each microphone signal, and (θs,ϕs) is the location of the microphone on the sphere. The spatial Fourier components can be weighted and combined to beamform a directional pattern of order N,

X(θl,ϕl)=n=0Ncnm=nnX̃nm(ka)Ynm(θl,ϕl),
(7)

where cn are weights that determine the shape of the beampattern, and (θl,ϕl) is the look direction of the beam in which the main lobe of the beampattern is oriented. For the specific case of plane wave decomposition (PWD), Eq. (7) can be used with the cn weights set to unity.28 The PWD beam pattern represents a plane wave that is order-limited, and for a given order will achieve the maximum directivity.1 Directional room IRs can be obtained by beamforming the IRs measured with a spherical array.

Third-order Ambisonics (more generally referred to here as Ambisonics) was utilized in this study to generate spatial reproductions of the measured data. Ambisonics is a spatial audio playback system originally developed by Gerzon2 in the 1970s as a method to reproduce sound fields represented in the spherical harmonics domain. Ambisonics initially only used the zeroth- and first-order spherical harmonic components and has since been extended to higher orders.3 Ambisonics offers a convenient method of reproducing recordings obtained with a compact spherical microphone array since the processing can be done in the spherical harmonics domain.29 

For the Ambisonic reproduction, this paper uses a form of the spherical harmonics that employs real-valued trigonometric functions, which are more convenient for audio signals, based on the ambiX format convention30 

Ŷnm(α,ϕ)=(2δm0)4π(n|m|)!(n+|m|)!Pn|m|(sinα){sin(|mϕ|)ifm<0cos(|mϕ|)ifm0,
(8)

where α=(π/2)θ is the elevation angle relative to the horizontal plane, and δ is the Kronecker delta. Although not standard notation, the hat is used here to differentiate between the spherical harmonics in Eqs. (8) and (3).

The signals to drive the individual loudspeakers for Ambisonic playback are calculated by

g=Dx̂,
(9)

where g=(g1,g2,,gQ)T are the loudspeaker driving signals, x̂=(x̂00,x̂11,x̂10,,x̂NN)T are the Ambisonics signals to be reproduced, which are encoded in the spherical harmonics domain, and D is known as the Ambisonic decoder matrix. Several methods exist to design a decoder matrix for a given loudspeaker array, such as the traditional mode-matching or pseudoinverse method,31 all-round Ambisonic decoding,32 and energy-preserving decoding.33 The pseudoinverse method as implemented in the Ambisonics Decoder Toolbox34,35 was utilized in this study since it is well suited for nearly spherical loudspeaker distributions. Using this method, the basic decoder matrix is determined by taking the pseudoinverse of the following spherical harmonics matrix evaluated at the specific loudspeaker locations:

D=[Ŷ00(αspk1,ϕspk1)Ŷ11(αspk1,ϕspk1)ŶNN(αspk1,ϕspk1)Ŷ00(αspk2,ϕspk2)Ŷ11(αspk2,ϕspk2)ŶNN(αspk2,ϕspk2)Ŷ00(αspkQ,ϕspkQ)Ŷ11(αspkQ,ϕspkQ)ŶNN(αspkQ,ϕspkQ)].
(10)

The high frequency localization performance in Ambisonic reproductions can be improved by applying order-dependent gains to the decoder matrix, which is called max-rE decoding.34 Applying these gains will effectively change the directivity pattern to reduce side lobes, which helps to preserve interaural level differences (ILDs). The implementation in Ref. 34 uses a phase-matched crossover to transition from basic decoding to max-rE decoding at 400 Hz.

Several pieces of hardware were utilized to capture the spatial room IRs. The primary hardware used consisted of an mh acoustics em32 Eigenmike spherical microphone array [Fig. 1(a)], a Brüel & Kjær (B&K) type 4100-D binaural mannequin [Fig. 1(b)], and a Brüel & Kjær type 4292 dodecahedron loudspeaker [Fig. 1(c)]. Additionally, a Crown XLS 2500 audio amplifier was used to drive the loudspeaker, and an RME Babyface was the audio interface used to send and receive signals.

FIG. 1.

(Color online) Measurement hardware used for the IR measurements: (a) Eigenmike spherical microphone array (mh acoustics em32), (b) B&K binaural mannequin (type 4100-D), and (c) B&K dodecahedron loudspeaker (type 4292).

FIG. 1.

(Color online) Measurement hardware used for the IR measurements: (a) Eigenmike spherical microphone array (mh acoustics em32), (b) B&K binaural mannequin (type 4100-D), and (c) B&K dodecahedron loudspeaker (type 4292).

Close modal

The Eigenmike microphone array consists of thirty-two 1.3-cm omnidirectional microphones mounted on a rigid spherical baffle with a diameter of 8.4 cm. The microphones sample the sphere according to the center of the faces of a truncated icosahedron, which is a nearly-uniform sampling scheme that preserves the orthogonality of the spherical harmonics up to 3rd order. The em32 system contains the microphone array as well as the Eigenmike Interface Box (EMIB). The EMIB interfaces with a PC with a standard ASIO driver, which allows for synchronized IR measurements to be made with commercially available room acoustics software. For the excitation signal, the ADAT output of the EMIB was sent to the RME Babyface, which was used as a D/A converter. The output of the RME Babyface was connected to the amplifier to drive the dodecahedron loudspeaker.

In addition to spatial IRs obtained with the spherical microphone array, additional measurements were made using a Brüel & Kjær type 4100-D head and torso simulator (HATS). The purpose of obtaining the binaural IRs was to be able to directly compare the original measured binaural IR with a binaural IR, which was reproduced from the spherical array measurements (described in Sec. III B). For these binaural measurements, the RME Babyface was used as an external sound card to both drive the amplifier and record the two microphone channels. The dodecahedron loudspeaker was utilized in the same configuration as the spherical array measurements for these binaural measurements.

IR measurements were obtained in the Peter Kiewit Concert Hall in Omaha, NE, a shoebox-shaped hall which opened in 2005. The hall has a volume of roughly 24 000 m3 (850 000 ft3) and 2000 seats. The hall features variable absorption in the form of absorptive panels on the ceiling and walls. Measurements were made using the spherical microphone array in three different hall absorption settings: the most absorptive setting (Setting 1), the most reverberant setting (Setting 2), and a moderately reverberant setting (Setting 3), and in 10 to 11 receiver positions in the hall (Fig. 2) depending on the hall setting. The mid-band reverberation times for these settings are 1.8, 2.8, and 2.4 s, respectively. For all of the measurements, the dodecahedron loudspeaker was placed in the center of the stage. Room IRs were obtained using the room acoustics software EASERA36 with the multi-channel module. The excitation used was a 6-s logarithmic sine sweep, with eight averages and one pre-sweep. The IR measurements can be found at http://sites.psu.edu/spral/lev_irs_2019.

FIG. 2.

(Color online) Receiver locations in the Peter Kiewit Concert Hall. Exact seat locations are indicated by red dots.

FIG. 2.

(Color online) Receiver locations in the Peter Kiewit Concert Hall. Exact seat locations are indicated by red dots.

Close modal

The Ambisonics reproduction in this study was conducted in the AURAS facility at Penn State University, shown in Fig. 3(a). The facility includes a loudspeaker array consisting of 30 two-way sealed-box loudspeakers. The loudspeakers feature a 4 in. (∼10 cm) mid-bass driver and 1 in. (∼2.5 cm) fabric dome tweeter, which are passively crossed over at 1.8 kHz. The loudspeakers are individually equalized from approximately 60 Hz to 20 kHz to account for magnitude and phase differences in the frequency response of each loudspeaker.37 

FIG. 3.

(Color online) The AURAS loudspeaker array (a), and the distribution of the 30 loudspeakers in the array (b).

FIG. 3.

(Color online) The AURAS loudspeaker array (a), and the distribution of the 30 loudspeakers in the array (b).

Close modal

The 30 loudspeakers in the AURAS array are arranged in a nearly-spherical distribution as shown in Fig. 3(b). The majority of the loudspeakers are distributed over three rings: 8 loudspeakers located at α = −30°, 12 loudspeakers located at α = 0°, and 8 loudspeakers located at α = +30°. In each ring, the loudspeakers are distributed equally azimuthally with a loudspeaker located at ϕ = 0° directly in front of the listener. The average distance from the loudspeakers to the center of the array is r = 1.3 m. The remaining two loudspeakers are placed overhead at α = +60°, ϕ = ±90°, and r = 0.57 m.

Auralizations were generated for reproduction in the AURAS facility using the measured IRs described in Sec. II. A block diagram summarizing the stimulus generation is shown in Fig. 4. Stimuli were rendered using the digital audio workstation software REAPER38 and VST plugins from the Ambisonic Decoder Toolbox,34 and the ambiX and mcfx plug-in suites.39 The processing was applied to anechoic music files that were convolved with the IRs measured with the spherical microphone array in matlab.

FIG. 4.

Block diagram for Ambisonic reproduction.

FIG. 4.

Block diagram for Ambisonic reproduction.

Close modal

The first step in processing the microphone signals is to encode them into Ambisonic signals (i.e., spherical harmonic signals) using the Ambix format in Eq. (8) with the ambiX_decoder plug-in with the Eigenmike preset. After Ambisonic encoding, radial filters are applied using the mcfx_convolver plug-in. These radial filters invert the plane wave modal coefficients in Eq. (4) and are implemented as finite impulse response (FIR) filters. The radial filters, shown in Fig. 5, are crossed over with linear phase filters at 50 Hz, 500 Hz, and 1.3 kHz for 1st, 2nd, and 3rd order reproduction, respectively, and at nth order, a (2n + 1) correction factor is applied to preserve the pressure amplitude of the main lobe.40 The crossover frequencies were chosen by measuring a plane wave with the spherical microphone array, performing a plane wave decomposition using Eq. (7) with cn beamforming weights set to one, and determining the frequency in which the beam pattern begins to degrade for a given order. A random-incidence correction is included in these filters to equalize the frequency response of the spherical array's microphone capsules, which have a high-frequency roll-off characteristic.41 

FIG. 5.

(Color online) Radial filters convolved with microphone equalization that are applied after encoding the spherical array's individual microphone signals to Ambisonic signals.

FIG. 5.

(Color online) Radial filters convolved with microphone equalization that are applied after encoding the spherical array's individual microphone signals to Ambisonic signals.

Close modal

After radial filtering, the Ambisonic signals were decoded into the loudspeaker signals. The Ambisonic decoder was designed using the Ambisonic Decoder Toolbox34 using the pseudoinverse method as described in Sec. I B. Finally, the decoded loudspeaker signals are equalized for the individual loudspeakers using the mcfx_convolver plug-in.

The performance of the Ambisonics reproduction in the AURAS facility was evaluated using both objective measurements and subjective listening tests. First, a plane wave oriented directly in front of the listener (α = 0°, ϕ = 0°) was encoded into Ambisonic signals, reproduced over the AURAS loudspeaker array, and measured with the spherical microphone array. A PWD was performed using the SOFiA toolbox42 and the measured plane wave was compared to the results of a simulated plane wave by visual inspection. The measured plane wave matched up with the expected result at the appropriate crossover frequencies for 1st, 2nd, and 3rd order, respectively, up to approximately 10 kHz. An example of the comparison at 2 kHz is shown in Fig. 6, in which the plane wave exhibits the correct 3rd order max-rE beam pattern, which is expected for all frequencies above 1.3 kHz.

FIG. 6.

(Color online) Comparison of 3rd order simulated plane wave with max-rE decoding, representative of a plane wave produced in the AURAS facility above 1.3 kHz (left) to a plane wave produced in the AURAS facility measured at 2 kHz (right). Pressure magnitude is shown on a linear scale normalized to a maximum value of 1.

FIG. 6.

(Color online) Comparison of 3rd order simulated plane wave with max-rE decoding, representative of a plane wave produced in the AURAS facility above 1.3 kHz (left) to a plane wave produced in the AURAS facility measured at 2 kHz (right). Pressure magnitude is shown on a linear scale normalized to a maximum value of 1.

Close modal

In addition to the measurements, informal listening tests were performed in which plane waves convolved with pink noise were encoded into 3rd order Ambisonics signals and panned in full 3D space. Listeners noted that the panning was smooth, even in between loudspeakers, and that they were able to localize sound in the correct directions except when sound was panned below α = −30° elevation, where no loudspeakers were present. The Ambisonic decoder and subsequent filtering do not introduce any artifacts such as phasing or pre-ringing, and the spatial image is stable with reasonable head movement at the central listening locations.

As a perceptual validation of the Ambisonics reproduction, the binaural IRs described in Sec. II B were compared with the reproduced spherical microphone array IRs. The binaural head was placed in the center of the AURAS array and the reproduced IRs were measured with the head. These IRs were convolved with anechoic music, and an ABX listening test was conducted to compare the reproduced binaural IR to the original measured IR. A monaural equalization was applied to the reproduced binaural IRs to match the average spectrum of the left plus right ears to the measured IRs, while still maintaining the reproduced binaural cues. The ABX listening test was taken by six musicians, all with hearing thresholds at or below 15 dBHL for the 250 Hz through 8 kHz octave bands. The subjects reported that they had a difficult time distinguishing between the original IR and the reproduced IR but were able to hear differences better between the stimuli by listening to a single note or small segment containing a small number of notes on repeat. For the subjects that were able to hear differences while listening to entire passages, these subjects needed a long time to make a decision (average of 30 s per trial). Based on the difficulty of this test, the time required to distinguish the differences, and the similarities noted by the listeners, the authors have concluded that the recorded binaural Ambisonic reproductions in AURAS are perceptually nearly indistinguishable from the original binaural recordings.

A subjective study was carried out using the room IR measurements processed for Ambisonic reproduction in the AURAS facility in which participants were asked to rate stimuli in terms of perceived envelopment. The room IRs were convolved with a 64-second anechoic music excerpt, Bizet's L'Arlesienne Suite No. 2: Menuet.43 This piece contains a full orchestra including strings, winds, and timpani, and is played at a moderate tempo (72 beats per minute). The stimuli were presented in four sets of eight signals, which were presented to listeners in a randomized order. Set 1 contained a subset of eight IRs from the most absorptive setting, Set 2 contained a subset of eight IRs from the in-between setting, and Set 3 contained a subset of eight IRs from the most reverberant setting. Set 4 contained a mixture of IRs from the three aforementioned settings that all had similar LJ values.

During the subjective test, each listener was seated in the center of the loudspeaker array and was able to listen to the different stimuli via a graphical user interface (GUI) implemented in Max 7.44 The GUI presented the participants with a screen with individual buttons for all of the stimuli and the subjects were able to instantaneously switch between each of the stimuli without restarting the musical motif. The GUI also enabled listeners to limit the motif to a specific time segment of the passage. Each subject was asked to rate how enveloped they felt by the sound field on a scale from 0 (not at all enveloped) to 100 (completely enveloped).

Before beginning the test, participants completed a short training period. The training began with a tutorial explaining the GUI, with explicit instructions to focus only on LEV, the sense of being surrounded by or immersed in the sound, and to ignore all other aspects of the sound field, including apparent source width. Following the tutorial, participants performed a training session to learn how the GUI worked with a reduced stimuli set of four audio files. After the training session, participants were then given a practice set containing a full set of eight stimuli. However, the participants were instructed that this was the first set in the test, but in actuality this data was not used in the analysis.

The listening test was conducted with 15 participants (6 male and 9 female). All subjects were required to have measured hearing thresholds at or below 15 dBHL from 250 to 8000 Hz, since research has shown that in critical listening tests subjects with near-normal hearing thresholds provide more consistent ratings.45 Additionally, all subjects were required to have a minimum of five years of formal musical training and be musically active at the time of the study (i.e., performing in an ensemble and/or taking private music instruction). This requirement was imposed because musicians have been shown to learn listening test procedures more quickly and to give more consistent responses compared to non-musicians.46,47 The average age of the participants in the subjective test was 26 years old, with an average of 10 years of formal musical training.

Each set was analyzed separately using a one-way repeated-measures analysis of variance (ANOVA).48 The null hypothesis of this ANOVA test was that the mean LEV ratings of all of the stimuli within a set are identical and a significant p-value would indicate that at least one pair of the stimuli within a set have statistically different mean LEV ratings. The results of the statistical analysis yielded significant results for Sets 1 through 3 (p < 0.001, p < 0.001, and p = 0.014, respectively), which indicates that these sets contain significant differences in the mean LEV ratings of at least one pair of stimuli within each set. The result for Set 4, however, was found to be insignificant on a 95% confidence interval (p = 0.071). Therefore, none of the pairs of stimuli in Set 4 were found to contain differences in mean LEV ratings. One possibility for the insignificant results is that the number of subjects in the study was too low to detect such small differences between the stimuli in this set, which all had similar LJ values, leading to low statistical power. The mean LEV ratings from the four test sets are shown in Fig. 7.

FIG. 7.

(Color online) Mean LEV ratings for the four test sets. Error bars depict standard errors. Note that the same stimulus between two different sets may have a different absolute LEV rating, since the stimuli are rated relative to other stimuli within the set.

FIG. 7.

(Color online) Mean LEV ratings for the four test sets. Error bars depict standard errors. Note that the same stimulus between two different sets may have a different absolute LEV rating, since the stimuli are rated relative to other stimuli within the set.

Close modal

For the three sets with significant p-values, pairwise t-tests were conducted to determine the individual pairs with significant differences. Within Sets 1 and 2, pairs were identified in which the LEV ratings were significantly different, but values of LJ were identical. Conversely, within the same sets, pairs were found in which the LJ values differed by up to 1.3 dB, but the LEV ratings were found to be similar. In order to examine the relationship between LEV ratings and the 3D late sound field from the stimuli in Sets 1 through 3 in more detail, the measurements were analyzed using beamformed IRs [Eq. (7)], which is described in Sec. V.

The LEV ratings obtained in each set were fit to a one-way regression model using different room acoustics metrics as a predictor for LEV: EDT, T30, C80, G, GLate, LJ, LLF, SBTS, and front/back ratio. (See the  Appendix for the details about how SBTs and front/back ratio were calculated from spherical microphone array measurements.) Set 1 contained the largest differences in LEV ratings amongst the eight stimuli. The correlation coefficients (R) and associated p-values for each of the regression models are given for Set 1 in Table I. The metrics that show the highest correlation with LEV in multiple octave bands are the metrics that are related to overall level: strength (G), late strength (GLate), and late lateral energy level (LJ). The metric with the highest correlation was GLate, which had a correlation of r = 0.88 (p < 0.01) in the 500 Hz octave band. It should also be noted that LJ was only significantly correlated with LEV at 500 Hz and above, with correlation coefficients ranging from r = 0.76 to r = 0.8, although this metric was originally defined to be related to LEV from 125 Hz to 1 kHz.9 

TABLE I.

Correlation coefficients for different metrics as LEV predictors for Set 1.

MetricSignificant octave bands (Hz)Correlation coefficient, R (by octave band)p-value (by octave band)
EDT None N/A >0.05 
T30 125 −0.80 0.017 
C80 125 −0.78 0.022 
G 500, 1k, 4k 0.73, 0.79, 0.76 0.040, 0.020, 0.028 
GLate 500, 2k 0.88, 0.77 0.004, 0.024 
LJ 500, 1k, 2k, 4k 0.77, 0.76, 0.79, 0.8 0.026, 0.027, 0.019, 0.017 
LLF None N/A >0.05 
SBTS None N/A >0.05 
FBR 250 −0.76 0.029 
FBRLate None N/A >0.05 
MetricSignificant octave bands (Hz)Correlation coefficient, R (by octave band)p-value (by octave band)
EDT None N/A >0.05 
T30 125 −0.80 0.017 
C80 125 −0.78 0.022 
G 500, 1k, 4k 0.73, 0.79, 0.76 0.040, 0.020, 0.028 
GLate 500, 2k 0.88, 0.77 0.004, 0.024 
LJ 500, 1k, 2k, 4k 0.77, 0.76, 0.79, 0.8 0.026, 0.027, 0.019, 0.017 
LLF None N/A >0.05 
SBTS None N/A >0.05 
FBR 250 −0.76 0.029 
FBRLate None N/A >0.05 

The spatial IRs measured with the spherical microphone array were analyzed objectively using beamforming techniques in order to investigate the relationship between the results of the subjective study to the physical sound field. Directional IRs were generated by applying a spatial Fourier transform as in Eq. (6), applying the radial filters shown in Fig. 5, and beamforming via Eq. (7) using look-directions spaced every 3 degrees in azimuth and zenith angle. This process results in a 120 by 60 (azimuth by zenith angle) grid of IRs. Two sets of grids were generated: one grid using PWD coefficients for regions where 2nd and 3rd order spherical harmonics components are used, and one grid using a cardioid-type pattern for low frequencies in which only the 1st order components are usable. A cardioid-type pattern was used for low frequencies since the 1st order plane wave pattern has a large rear lobe, which can lead to misinterpretations of the energy plots as the energy appears to originate from the wrong direction.

The directional IRs were further processed to investigate the timing and direction of the energy in the IRs as a function of frequency. First, a 5 ms-wide rectangular time window was applied to the IRs every 5 ms (i.e., 0 to 5 ms, 5 to 10 ms, etc.). The time windowed IRs were then filtered into octave bands with center frequencies from 125 Hz to 4 kHz, and the energy in each octave band was summed. Using this analysis approach, the energy in different time windows can be plotted on a grid as a function of angle for different octave bands with a resolution of 5 ms. For example, to investigate the early sound field, the energy contained in the 5 ms time segments from 0 to 80 ms can be summed together and plotted on a grid.

For each of the IRs, the late energy (80 ms onward) was investigated, and instances were found where the energy in the late sound field did not correlate with the subjective envelopment ratings. For example, the LEV ratings in Set 2 of R3 (LJ = −1.4 dB) and R8 (LJ = −0.1) were nearly identical (73 and 72, respectively), but the late sound fields are very different, as shown in Fig. 8(a). In Fig. 8 and all subsequent plots, the 1 kHz octave band results are shown, although the trends are observed broadband (500 Hz–4k Hz). The late energy in R3 is concentrated in the front of the listener and overhead with lower side energy, whereas the energy in R8 appears more diffuse with the most energy coming from behind the listener. Differences in the PWD for the late arriving lateral energy are on the order of 3 dB.

FIG. 8.

(Color online) (a) Comparison of the late sound field at 1 kHz (80 ms to ∞) between receiver positions with similar LEV ratings (R3 and R8 from Set 2), with 0° azimuth pointing toward the stage and 0° zenith angle pointing straight up. Energy at R3 is concentrated toward the front (R3 is underneath a balcony), whereas energy at R8 is more evenly distributed throughout the sphere (R8 is in the top balcony). (b) Comparison of the late sound field at 1 kHz between receiver positions with different LEV ratings (R9 and R10 from Set 1). The spatial distribution of late energy is similar between the two receivers, yet the LEV ratings of these stimuli were significantly different. Sound pressure level is shown ranging from −10 to 0 dB, where 0 dB is the maximum level of both sound fields (overall level differences are maintained between the top and bottom plots). The images on the left show the energy distributions over spheres, while the images on the right show the same information, but flattened onto a 2-D plot (similar to an unraveled map).

FIG. 8.

(Color online) (a) Comparison of the late sound field at 1 kHz (80 ms to ∞) between receiver positions with similar LEV ratings (R3 and R8 from Set 2), with 0° azimuth pointing toward the stage and 0° zenith angle pointing straight up. Energy at R3 is concentrated toward the front (R3 is underneath a balcony), whereas energy at R8 is more evenly distributed throughout the sphere (R8 is in the top balcony). (b) Comparison of the late sound field at 1 kHz between receiver positions with different LEV ratings (R9 and R10 from Set 1). The spatial distribution of late energy is similar between the two receivers, yet the LEV ratings of these stimuli were significantly different. Sound pressure level is shown ranging from −10 to 0 dB, where 0 dB is the maximum level of both sound fields (overall level differences are maintained between the top and bottom plots). The images on the left show the energy distributions over spheres, while the images on the right show the same information, but flattened onto a 2-D plot (similar to an unraveled map).

Close modal

Conversely, instances were found in which the stimuli had significantly different LEV ratings with similar LJ values. The stimuli in one example pair found in Set 1 (R9 and R10) have nearly identical values of LJ (−3.3 and −3.6 dB, respectively) but very different LEV ratings (66 and 48, respectively, p = 0.020), which can be seen in Fig. 8(b). In these receiver positions, the late energy distributions appear to be very similar, and energy differences in the lateral energy in the PWD is on the order of 1–2 dB, which are much lower than the differences seen in the previous pair. The results from these pairs indicate that the directional distribution of late arriving energy does not completely predict the sense of LEV.

Because of the anomalies found in the late sound fields, different time windows were explored to find trends in different parts of the IRs that correlated with the LEV ratings. These additional time windows included the early sound (0 to 80 ms), late IRs with different crossover points (e.g., 40 ms onward or 200 ms onward), and portions in the middle of the IRs (e.g., 50–200 ms). Using this exploratory approach, a time window in the middle of the IR from 70 to 100 ms was found to agree with the LEV ratings. In this time window, R3 and R8 in Set 2 have a very similar distribution of lateral, back, and overhead energy, which can be seen in Fig. 9(a). The frontal energy in this time window is approximately 4 dB greater in R3, but the frontal energy is assumed to not significantly impact LEV.9,14 The energy distribution in this time window could be related to the similarities in LEV ratings. For R9 and R10 in Set 1 in this time window, differences can be seen in that the rear energy is roughly 3 dB higher behind the listener in R9 than in R10, shown in Fig. 9(b), which could be related to the differences in LEV ratings. In order to confirm the trends that were found using this time window, a second subjective study was conducted using modified versions of the IRs, described in Sec. VI.

FIG. 9.

(Color online) (a) Comparison of the sound field from 70 to 100 ms at 1 kHz between receiver positions with similar LEV ratings (R3 and R8 from Set 2). The energy at both receiver positions have a similar level and distribution in terms of lateral, behind, and overhead sound. The frontal energy does differ by 3 dB between the pair, but it is assumed that energy arriving from the front does not influence LEV. (b) Comparison of the sound field from 70 to 100 ms at 1 kHz between receiver positions with different LEV ratings (R9 and R10 from Set 1). R9 has a much stronger energy level from behind, which may be contributing to the perceived LEV. Sound pressure level is shown ranging from −10 to 0 dB, where 0 dB is the maximum level of both sound fields.

FIG. 9.

(Color online) (a) Comparison of the sound field from 70 to 100 ms at 1 kHz between receiver positions with similar LEV ratings (R3 and R8 from Set 2). The energy at both receiver positions have a similar level and distribution in terms of lateral, behind, and overhead sound. The frontal energy does differ by 3 dB between the pair, but it is assumed that energy arriving from the front does not influence LEV. (b) Comparison of the sound field from 70 to 100 ms at 1 kHz between receiver positions with different LEV ratings (R9 and R10 from Set 1). R9 has a much stronger energy level from behind, which may be contributing to the perceived LEV. Sound pressure level is shown ranging from −10 to 0 dB, where 0 dB is the maximum level of both sound fields.

Close modal

In order to investigate the influence of arrival time on the energy in the spatial IR, a follow-up subjective study was conducted using modified IRs created by mixing different portions of two IRs together with a variable crossover time. Two types of hybrid IRs were created for this study: (1) combinations of a pair of IRs that were rated to have the highest and lowest amounts of LEV from Set 1, and (2) combinations of the IR found to have the highest LEV with a modified version of that IR in which the spatial dependence was collapsed into a single direction directly in front of the listener. The objective of the listening tests using the first type of hybrid IRs was to determine at which point in time crossing over from a highly enveloping IR to an unenveloping IR or vice versa impacts the perception of envelopment. The listening test using the second type of hybrid IRs was used to study the same effect in the case where all spatial information is removed from portions of the IR.

For the first type of hybrid IRs, two stimuli were used from the previous listening test. The stimulus with the highest LEV rating from Set 1, which contained stimuli from the most absorptive hall setting (Setting 1), was at R3, while the stimulus with the lowest LEV rating was at R10. These IRs will be referred to as the high listener envelopment impulse response (HLEV (HLEV IR) and the low listener envelopment impulse response (LLEV IR), respectively. Hybrid IRs were generated that contained the early part of the HLEV IR and the late part of the LLEV IR using crossover times ranging from 40 to 140 ms—denoted as the HLEV/LLEV hybrid IR. Similarly, hybrid IRs were also created that contained the early part of the LLEV IR (Setting 1, R10) and the late part of the HLEV IR—denoted as LLEV/HLEV hybrid.

For the second set of hybrid IRs, a more extreme modification of the IRs was utilized in which the spatial aspects of the sound field were completely removed from portions of the HLEV IR (Setting 1, R3). A monaural IR was generated by extracting the omnidirectional component of the HLEV IR and sending it to the loudspeaker located directly in front of the listener, which collapsed the full 3D sound field into the single direction at α = ϕ = 0°. Hybrid IRs were then generated containing the early part of the HLEV IR and the late part of the monaural IR—denoted as the 3D/mono hybrid IR, and vice-versa—denoted as the mono/3D hybrid IR, with crossover times ranging from 40 to 120 ms, respectively.

A 2.5 ms half-Hann window was used to crossover between the early part and the late part of the IRs, as depicted in Fig. 10. This window ensured that there was no audible click in the transition period, while being short enough to not have a noticeable transition. The stimuli were generated by convolving the hybrid IRs with the same anechoic music excerpt used in the first study, Bizet's L'Arlesienne Suite No. 2: Menuet.43 

FIG. 10.

(Color online) Half Hann 2.5 ms time windows used to mix two IRs (left), and example resulting hybrid IR (right). The early window and corresponding IR are shown in solid blue, and the late window and corresponding IR are shown in dotted and solid green, respectively.

FIG. 10.

(Color online) Half Hann 2.5 ms time windows used to mix two IRs (left), and example resulting hybrid IR (right). The early window and corresponding IR are shown in solid blue, and the late window and corresponding IR are shown in dotted and solid green, respectively.

Close modal

During the subjective test, each listener was placed in the center of the loudspeaker array and was able to listen to the different stimuli via instantaneous switching using the GUI discussed in Sec. IV with modifications to include a different number of stimuli for the test cases described below. All subjects were given eight sets of stimuli presented in a random order. Each subject was asked to rate how enveloped they felt by the sound field on a scale from 0 (not at all enveloped) to 100 (completely enveloped). The listening test was conducted with 19 participants (10 male, 9 female). Two subjects out of the 19 participated in the previous listening test. All subjects had measured hearing thresholds at or below 15 dBHL from 250 to 8000 Hz, were required to have a minimum of five years of formal music training and were required to be musically active. The average age of the participants was 23 years old, with an average of eight years of formal musical training.

Six of the test sets contained the HLEV/LLEV and LLEV/HLEV hybrid stimuli. Each of the six sets contained four stimuli presented simultaneously in a random order: stimuli generated with the complete HLEV IR, the complete LLEV IR, the HLEV/LLEV hybrid IR, and the LLEV/HLEV hybrid IR. All of the six sets contained these four stimuli, each having a different crossover point between the early and late sound for the hybrid IRs, which varied from 40 to 140 ms in 20 ms increments.

The remaining two sets contained the 3D/mono hybrid IRs, and the mono/3D hybrid IRs, respectively. A Multiple Stimuli with Hidden Reference and Anchor (MUSHRA) style listening test was implemented for these two sets.49 The reference used in the sets was the full 3D IR, and the anchor was the full monaural IR. Listeners were presented with the reference, which was labeled, and seven stimuli, two of which were the hidden anchor and the hidden reference. The remaining five stimuli had crossover times ranging from 40 to 120 ms in 20 ms increments. Listeners were instructed to treat the reference as having a rating of 100 (completely enveloped).

The six sets of IRs containing the HLEV/LLEV hybrid stimuli were combined into two data sets: one set containing the HLEV/LLEV ratings and one set containing the LLEV/HLEV ratings. These two sets were analyzed separately using one-way repeated measures ANOVA tests,48 and both sets contained statistically significant differences in the LEV ratings (p < 0.0001). Pairwise t-tests were run using Tukey family error-rate corrections.48 The results using the IRs from the early part of R3 and the late part of R10, shown in Fig. 11, suggest that the 40 and 60 ms crossover times were nearly indistinguishable from the original R10 IR in terms of LEV. Additionally, 140 ms crossover time is nearly indistinguishable from the R3 IR in terms of LEV. The in-between crossover times have LEV ratings in between R3 and R10. The difference in LEV ratings between R10 and the 80 ms crossover hybrid is statistically significant (p = 0.040). These results indicate that in this set most of the auditory cues related to the perception of LEV for these two stimuli is contained in the time limits of 80 and 120 ms.

FIG. 11.

(Color online) Hybrid LLEV/HLEV IRs using the early part of R10 (higly unenveloping) and the late part of R3 (highly enveloping) with crossover times ranging from 40 to 140 ms in solid blue, and hybrid HLEV/LLEV IRs using the early part of R3 and the late part of R10 with crossover times ranging from 40 to 140 ms in dashed orange.

FIG. 11.

(Color online) Hybrid LLEV/HLEV IRs using the early part of R10 (higly unenveloping) and the late part of R3 (highly enveloping) with crossover times ranging from 40 to 140 ms in solid blue, and hybrid HLEV/LLEV IRs using the early part of R3 and the late part of R10 with crossover times ranging from 40 to 140 ms in dashed orange.

Close modal

An opposite trend was observed in the hybrid IRs containing the early part of R10 and the late part of R3, which can be seen in Fig. 11. Although the 40 ms crossover time stimulus was very similar to the original R3 IR, the LEV rating of the stimulus was significantly lower than that of the original. Here, the 120 and 140 ms crossover times were nearly indistinguishable from the original R10 IR in terms of LEV. The 60–100 ms crossover times had an LEV rating in between the LEV ratings of R3 and R10, which indicates that this range contains information important to the perception of LEV. In this listening test, more than half of the total change in envelopment occurs within the first 80 ms of crossover times. The difference in LEV ratings between R3 and the 60 ms crossover hybrid is statistically significant (p = 0.021). These results suggest that the early part of the IR before 80 ms influences the perception of LEV. Moreover, a prediction of LEV which only includes energy later than 80 ms may not be sufficient for this pair of stimuli.

The two sets containing the 3D/mono and mono/3D hybrid stimuli were analyzed separately using one-way repeated measures ANOVA tests,48 and both sets contained statistically significant differences in the LEV ratings (p < 0.0001). The results from the 3D/mono listening test can be seen in Fig. 12. The extreme contrast between the reference and the anchor resulted in the subjects utilizing much more of the entire scale from 0 to 100 than in the previous listening tests (mean range of 81 versus 25). By replacing just the first 40 ms of the mono IR with the spatial IR, there is already a large increase in the average LEV rating (LEV rating increases from 13 to 47, p < 0.0001). Increasing the crossover time increases the average LEV rating, and at the 120 ms crossover point, the average LEV rating is not quite as high as the originally fully 3D IR (LEV rating of 80 versus 94, p < 0.0001). These results again illustrate that the early part of the IR does indeed have an effect on the perception of LEV.

FIG. 12.

(Color online) Results of the modified IR test in which the late part of the IR is presented in full 3D, and the early part of the IR is reproduced from a single loudspeaker in solid blue, and results of the modified IR test in which the early part of the IR is presented in full 3D and the late part of the IR is reproduced from a single loudspeaker in dashed orange.

FIG. 12.

(Color online) Results of the modified IR test in which the late part of the IR is presented in full 3D, and the early part of the IR is reproduced from a single loudspeaker in solid blue, and results of the modified IR test in which the early part of the IR is presented in full 3D and the late part of the IR is reproduced from a single loudspeaker in dashed orange.

Close modal

Results from the mono/3D hybrid listening test have the opposite trend as the results from the 3D/mono test set, as seen in Fig. 12. Collapsing just the first 40 ms of the IR into the front of the listener causes a large degradation in average LEV rating, from 92 to 73 (p = 0.0001). Increasing the crossover times (collapsing more and more of the IR in time) results in a decrease in LEV ratings. The 120 ms crossover time did not have as low of an LEV rating as the full mono IR (LEV rating of 37 versus 8, p < 0.0001), which could be because even with the 120 ms crossover time, it is still easy to distinguish the spatial part of the IR from the collapsed portion.

In this study, using a 32-element spherical microphone array, spatial room IRs were obtained in a 2000 seat dedicated concert hall with a volume of 24 000 m3 (850 000 ft3) (Peter Kiewit Concert Hall in Omaha, NE). The hall features variable acoustics with mid-frequency reverberation times ranging from 1.8 to 2.8 s. The IRs were analyzed using beamforming techniques and processed for third-order Ambisonic reproduction in order to conduct listening tests. The Ambisonics playback system was validated both objectively and perceptually. LEV scores from the subjective listening tests were compared to several objective measures and to the energy in the beamformed directional IRs in different time windows.

To determine whether or not existing objective metrics were correlated with the perception of LEV, linear regressions were conducted with the LEV ratings as the response variable and individual metrics as the predictors. The predictors found with the highest correlations were all functions of sound level: strength (G), late strength (GLate), and late lateral energy level (LJ). This finding indicates that throughout this hall, the sense of envelopment is highly dependent on level. Metrics having to do with spatial aspects of the room, but not overall level, including JLF, LLF, FBR, and spatially balanced center time (SBTS), were found to have little to no correlation with the LEV ratings. However, since all of the stimuli from this study were in the same hall, it could be the case that in halls of different sizes and shapes, LEV could be better correlated with these metrics.

In comparing the spatial distribution of the energy in the IRs to the LEV ratings, pairs of stimuli emerged with statistically different mean LEV ratings, but similar late sound fields and vice versa. An exploratory look into different time windows found that there were trends in the energy distribution in an earlier time window, specifically from 70 to 100 ms.

To investigate specific time windows in the IRs, hybrid IRs were generated using a mix of an enveloping (HLEV) IR and an unenveloping (LLEV) IR, as well as hybrid IRs using a mix of the original fully spatial IR (3D) and an IR in which the energy was collapsed to a single direction in front of the listener (mono). For the hybrid IR sets that contained the early part of the HLEV IR and late part of the LLEV IR (HLEV/LLEV), as well as the sets that contained the early part of the LLEV IR and the late part of the HLEV IR (LLEV/HLEV), it was found that a crossover time after 120 ms had very little effect on the envelopment. It was also found that a crossover time between 80 and 120 ms in the HLEV/LLEV hybrids, or 60 and 100 ms in the LLEV/HLEV hybrids, resulted in an LEV rating halfway in between the original enveloping and unenveloping IRs. This finding was in agreement with the spatial energy distribution found in the beamforming analysis of the IRs. This finding suggests that this time region contains information important to the perception of LEV and that the early part of the impulse response should be taken into account in metrics predicting LEV. Hybrid listening tests containing 3D/mono and mono/3D hybrids confirmed that changes in the early part of the IR impact the LEV ratings.

One important note is that these time crossovers are only appropriate for this hall and cannot yet be extrapolated to the general case. The late part of the IR was relatively constant throughout this hall, which could be a reason that modifications to the IRs after 120 ms had a smaller impact on LEV ratings. Similar tests should be conducted using measured spatial room IRs from a number of halls with a range of shapes and sizes to identify the relative importance of different time segments and arrival direction of the energy in the IR in creating the sense of LEV.

The findings in these experiments provide evidence that improvements can be made to metrics intended to predict LEV in performing arts spaces. These measures should include portions of the early sound field since manipulating the early sound field changes the perception of envelopment. Because spherical microphone arrays enable higher angular resolution in arrival energy than previously used methods, new metrics can also include directional energy components that are more selective than a dipole to extract lateral energy.

The authors wish to acknowledge Matthew Neal for his assistance with the project and the AURAS facility; Ed Hurd at the Peter Kiewit Concert Hall; and Dr. Lily Wang, Matthew Blevins, Laura Brill, Hyun Hong, Joonhee Lee, and Zhao Ellen Peng of University of Nebraska-Lincoln for their assistance in obtaining the IR measurements. Approval for human subjects testing was obtained from Penn State's Institutional Review Board (IRB No. 41733). This work was sponsored by the National Science Foundation (NSF) Award No. 1302741.

The room acoustics metrics were calculated from the 32-channel spherical microphone array IR measurements, which were taken using an Eigenmike. For the standard metrics outlined in ISO 3382,18 along with any metrics involving an omnidirectional IR and a figure-8 IR, the zeroth and first order components were extracted from the spatial IR measurements to compute the metrics, as shown in Ref. 41. For the additional metrics evaluated, specifically SBTs24 and FBR,14 the methods to compute these metrics from IR measurements are not well defined in the literature. The ambiguity results from the metrics being developed based on room simulations in which all reflections come from discrete angles where loudspeakers were placed, and not from physical measurements. In measurements obtained in rooms, however, reflections can come from any angle, and the measurement microphones have a directivity pattern with some finite beam width. The authors have made their best attempt at obtaining these metrics from measured IRs.

1. FBR

The front back ratio, FBR, is defined as

FBR=10log(EfEb),
(A1)

where Ef and Eb is the energy in the IR in the front half of the horizontal plane and the back half of the horizontal plane, respectively.22 For this study, the FBR was calculated in each octave band from 125 Hz to 4 kHz separately. Additionally, the FBR was calculated for the total impulse response as a function of time, as well as for only the late part from 80 ms onward.

To calculate this metric, a beamformed first-order cardioid pattern was used to obtain energy from the front (pointed directly at the loudspeaker at θ = π/2, ϕ = 0) and energy from the back (pointed directly in the back at θ = π/2, ϕ = π). These beampatterns are a reasonable approximation for the energy in the front half of the horizontal plane and the energy in the back half of the horizontal plane, respectively.

2. SBTs

To calculate the spatially balance center time, SBTs, a PWD was performed on the spatial IRs with the look direction oriented in 16 equally spaced angles in the horizontal plane, which were the angles used to develop the metric24 based on the results of their subjective study. Three separate PWDs were performed up to order N = 3 as a function of frequency: 1st order from 50 to 500 Hz, 2nd order from 500 Hz to 1.3 kHz, and 3rd order from 1.3 to 8 kHz. The SBTs was calculated separately for each order.

For each of the 16 directionally beamformed IRs, first a center time TSi is calculated,

TSi=0tpi2(t)dt0p2(t)dt,
(A2)

where pi(t) is the pressure beamformed in the specified direction, and p(t) is the omnidirectional pressure. To account for the crosstalk between each of the beamformed IRs, the individual center times are scaled by the sum of the center times

T̂Si=TSiTsiTSi.
(A3)

The center times are weighted by the arrival direction such that energy directly in front or behind the listener is weighted by 0.5, and lateral energy is weighted by 1,

ai=T̂Si1+|sinϕi|2,
(A4)

where ϕi is the look direction of the beamformed directional impulse response. SBTs is calculated by weighting the ai terms by the contributions from the other directions and the sine of the angle in between the other dimensions

SBTs=i=1nj=1naiajsinϕij,
(A5)

where ϕij is the angle in between look direction i and look direction j.

The SBTs obtained from the spatial IR measurements were validated using simulated data recreating Test 3 from Ref. 24. A simulated sound field was generated with a 2 s mid-band reverberation time and all energy coming from the directions specified by the test case. While the absolute level of SBTs was different than the calculated results reported in the article, which was expected since the simulated sound fields were not identical, the shape of the curve matched up identically, indicating that the metric obtained using the spatial IR measurements is performing as expected.

1.
B.
Rafaely
,
Fundamentals of Spherical Array Processing
(
Springer-Verlag
,
Berlin
,
2015
).
2.
M.
Gerzon
, “
Periphony: With-height sound reproduction
,”
J. Audio Eng. Soc.
21
,
2
10
(
1973
), available at http://www.aes.org/e-lib/browse.cfm?elib=2012.
3.
J.
Daniel
, “
Représentation de champs acoustiques, application à la transmission et à la reproduction de scènes sonores complexes dans un contexte multimedia” (“Representation of acoustic fields, application to the transmission and reproduction of complex sound scenes in a multimedia context”)
, Ph.D. thesis,
University of Paris
, Paris, France,
2001
.
4.
A. H.
Marshall
and
M.
Barron
, “
Spatial responsiveness in concert halls and the origins of spatial impression
,”
Appl. Acoust.
62
,
91
108
(
2001
).
5.
A.
Marshall
, “
A note on the importance of room cross-section in concert halls
,”
J. Sound Vib.
5
,
100
112
(
1967
).
6.
M.
Barron
and
A.
Marshall
, “
Spatial impression due to early lateral reflections in concert halls: The derivation of a physical measure
,”
J. Sound Vib.
77
,
211
232
(
1981
).
7.
T.
Okano
,
L. L.
Beranek
, and
T.
Hidaka
, “
Relations among interaural cross-correlation coefficient (IACCE), lateral fraction (LFE), and apparent source width (ASW) in concert halls
,”
J. Acoust. Soc. Am.
104
,
255
265
(
1998
).
8.
M.
Morimoto
,
H.
Fujimori
, and
Z.
Maekawa
, “
Discrimination between auditory source width and envelopment
,”
J. Acoust. Soc. Jpn.
46
,
449
457
(
1990
) (in Japanese).
9.
J.
Bradley
and
G.
Soulodre
, “
Objective measures of listener envelopment
,”
J. Acoust. Soc. Am.
98
,
2590
2597
(
1995
).
10.
J.
Bradley
and
G.
Soulodre
, “
The influence of late arriving energy on spatial impression
,”
J. Acoust. Soc. Am.
97
,
2263
2271
(
1995
).
11.
J. J.
Sendra
, “
The sound field for listeners in concert halls and auditoria
,” in
Computational Acoustics in Architecture
, edited by
J. S.
Bradley
(
WIT Press
,
Southampton, UK
,
1999
), Chap. 5, pp.
101
134
.
12.
A.
Kuusinen
,
J.
Pätynen
,
S.
Tervo
, and
T.
Lokki
, “
Relationships between preference ratings, sensory profiles, and acoustical measurements in concert halls
,”
J. Acoust. Soc. Am.
135
,
239
250
(
2014
).
13.
H.
Furuya
,
K.
Fujimoto
,
Y.
Takeshima
, and
H.
Nakamura
, “
Effect of early reflections from upside on auditory envelopment
,”
J. Acoust. Soc. Jpn.
16
,
97
104
(
1995
).
14.
M.
Morimoto
,
K.
Iida
, and
K.
Sakagami
, “
The role of reflections from behind the listener in spatial impression
,”
Appl. Acoust.
62
,
109
124
(
2001
).
15.
A.
Wakuda
,
H.
Furuya
,
K.
Fujimoto
,
K.
Isogai
, and
K.
Anai
, “
Effects of arrival direction of late sound on listener envelopment
,”
Acoust. Sci. Technol.
24
,
179
185
(
2003
).
16.
P.
Evjen
,
J.
Bradley
, and
S.
Norcoss
, “
The effect of late reflections from above and behind on listener envelopment
,”
Appl. Acoust.
62
,
137
153
(
2001
).
17.
S.
Klockgether
and
S.
van de Par
, “
Model for the prediction of room acoustical perception based on the just noticeable differences of spatial perception
,”
Acta Acust. united Acust.
100
,
964
971
(
2014
).
18.
ISO 3382-1:2009
:
Acoustics—Measurement of Room Acoustic Parameters—Part 1: Performance Spaces
(
International Standards Organization
,
Geneva, Switzerland
,
2009
).
19.
M.
Barron
, “
Late lateral energy fractions and the envelopment question in concert halls
,”
Appl. Acoust.
62
,
185
202
(
2001
).
20.
L.
Beranek
,
Concert Halls and Opera Houses: Music, Acoustics, and Architecture
(
Springer
,
New York
,
2003
).
21.
L. L.
Beranek
, “
Listener envelopment LEV, strength G, and reverberation time RT in concert halls
,” in
Proceedings of the International Congress on Acoustics
, Sydney, Australia (August 23–27,
2010
).
22.
M.
Morimoto
and
K.
Iida
, “
A new physical measure for psychological evaluation of a sound field: Front/back energy ratio as a measure for envelopment
,”
J. Acoust. Soc. Am.
93
,
2282
(
1993
).
23.
M.
Morimoto
,
K.
Nakagawa
, and
K.
Iida
, “
The relation between spatial impression and the law of the first wavefront
,”
Appl. Acoust.
69
,
132
140
(
2008
).
24.
T.
Hanyu
, “
A new objective measure for evaluation of listener envelopment focusing on the spatial balance of reflections
,”
Appl. Acoust.
62
,
155
184
(
2001
).
25.
M.
Nakayama
,
K.
Nakashashi
,
Y.
Wakabayahsi
, and
T.
Nishiura
, “
Surround sensation index based on differential S-IACF for listener envelopment with multiple sound sources
,”
J. Commun. Comput.
14
,
122
128
(
2017
).
26.
E. G.
Williams
,
Fourier Acoustics: Sound Radiation and Nearfield Acoustical Holography
(
Academic Press
,
New York
,
1999
).
27.
J.
Meyer
and
G.
Elko
, “
A highly scalable spherical microphone array based on an orthonormal decomposition of the soundfield
,” in
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing
, Orlando, FL (May 13–17,
2002
).
28.
B.
Rafaely
, “
Plane-wave decomposition of the sound field on a sphere by spherical convolution
,”
J. Acoust. Soc. Am.
116
,
2149
2157
(
2004
).
29.
S.
Moreau
,
J.
Daniel
, and
S.
Bertet
, “
3D sound field recording with higher order ambisonics—Objectrive measurements and validation of a 4th order spherical microphone
,” in
Proceedings of the 120th AES Convention
, Paris, France (May 20–23,
2006
).
30.
C.
Nachbar
,
F.
Zotter
,
E.
Deleflie
, and
A.
Sontacchi
, “
Ambix—A suggested ambisonics format
,” in
Ambisonics Symposium
, Lexington, KY (
2011
).
31.
A.
Heller
,
R.
Lee
, and
M.
Benjamin
, “
Is my decoder ambisonic?
,” in
Proceedings of the 125th AES Convention
, San Francisco, CA (October 2–5,
2008
).
32.
F.
Zotter
and
M.
Frank
, “
All-round ambisonic panning and decoding
,”
J. Aud. Eng. Soc.
60
,
807
720
(
2012
), available at http://www.aes.org/e-lib/online/browse.cfm?elib=16554.
33.
F.
Zotter
,
H.
Pomberger
, and
M.
Noisternig
, “
Energy-preserving ambisonic decoding
,”
Acta Acust. united Acust.
98
,
37
47
(
2012
).
34.
A. J.
Heller
,
E. M.
Benjamin
, and
R.
Lee
, “
A toolkit for the design of ambisonic decoders
,” in
Proceedings of the Linux Audio Conference
, Stanford, CA (April 12–15,
2012
).
35.
A. J.
Heller
and
E. M.
Benjamin
, “
The ambisonic decoder toolbox: Extensions for partial-coverage loudspeaker arrays
,” in
Proceedings of the Linux Audio Conference
, Karlsruhe, Germany (May 1–4,
2014
).
36.
AFMG TechnologiesGmbH, EASERA v 1.2, http://easera.afmg.eu/ (Last viewed 27 August
2015
).
37.
M.
Neal
, “
Investigating the sense of listener envelopment in concert halls using a third-order ambisonic reproduction over a loudspeaker array and a hybrid room acoustics simulation method
,” M.S. thesis,
Penn State University
, University Park, PA,
2015
.
38.
“Reaper Digital Audio Workstation,” www.reaper.fm (Last viewed 1 July
2015
).
39.
M.
Kronlachner
, “
Plug-in suite for mastering the production and playback in surround sound and ambisonics
,” in
AES Student Design Competition
, Berlin, Germany (
2014
).
40.
R.
Baumgartner
,
H.
Pomberger
, and
M.
Frank
, “
Practical implementation of radial filters for ambisonic recordings
,” in
Proceedings of the International Conference on Spatial Audio
, Detmold, Germany (November 10–13,
2011
).
41.
D. A.
Dick
and
M. C.
Vigeant
, “
A comparison of measured room acoustics metrics using a spherical microphone array and conventional methods
,”
Appl. Acoust.
107
,
34
45
(
2016
).
42.
B.
Bernschütz
,
C.
Pörschmann
,
S.
Spors
, and
S.
Weinzierl
, “
SOFiA sound field analysis toolbox
,” in
Proceedings of the ICSA International Conference on Spatial Audio
, Detmold, Germany (November 10–13,
2011
).
43.
Denon
,
Anechoic Orchestral Music Recordings
(
Denon Records
,
1995
).
44.
Max 7
, “
Cycling'74
,” https://cycling74.com/max7/ (Last viewed 30 March
2016
).
45.
F. E.
Toole
, “
Subjective measurements of loudspeaker sound quality and listener performance
,”
J. Audio Eng. Soc.
33
,
2
32
(
1985
), available at http://www.aes.org/e-lib/browse.cfm?elib=4465.
46.
C.
Micheyl
,
K.
Delhommeau
,
X.
Perrot
, and
A. J.
Oxenham
, “
Influence of musical and psychoacoustical training on pitch discrimination
,”
Hear. Res.
219
,
36
47
(
2006
).
47.
A. J.
Oxenham
,
B.
Fligor
,
C. R.
Mason
, and
G.
Kidd
, “
Informational masking and musical training
,”
J. Acoust. Soc. Am
114
,
1543
1549
(
2003
).
48.
A.
Field
,
Discovering Statistics Using SPSS,
3rd ed. (
SAGE Publications
,
Thousand Oaks, CA
,
2009
).
49.
ITU-R BS. 1534
:
Method for the Subjective Assessment of Intermediate Quality Levels of Coding Systems
(
International Telecommunication Union
,
Switzerland
,
2015
).