Listener envelopment has previously been studied in the fields of room acoustics and multichannel sound reproduction. However, the potentially detrimental effect of a directional imbalance remains uninvestigated. This paper presents a listening experiment under anechoic conditions using a ring of 24 loudspeakers. Participants rated perceived envelopment for various loudspeaker subsets fed by incoherent noise signals. Off-center listening positions were simulated for different acoustic source models: −6 dB (point source), −3 dB (line source), or 0 dB attenuation for every doubling of distance. Only the line-source model preserved envelopment off-center, providing a low interaural level difference and a low interaural coherence as perceptual cues.

## 1. Introduction

Listener envelopment (LEV) was introduced as an attribute to characterize the perceived “spatial impression” or “auditory spaciousness” due to reverberation in concert halls (Bradley and Soulodre, 1995a,b; Hidaka *et al.*, 1992; Okano *et al.*, 1998). Various definitions of envelopment were employed in research on architectural acoustics, and further definitions for envelopment were conceived in the fields of electroacoustic music and spatial sound reproduction (Berg, 2009). Berg (2009) states that the unifying concept across the definitions is “the sensation of being surrounded by sound” (surroundedness), a generic definition of envelopment that was used in studies by Soulodre *et al.* (2003) and Lynch and Sazdov (2017). This definition suggests that an ideal diffuse sound field is enveloping, because it is characterized by infinitely many incoherent plane waves impinging from all directions with equal variance (Jacobsen and Roisin, 2000). Listening experiments show that a finite set of directions is sufficient to evoke cues of envelopment, such as a low interaural cross correlation coefficient (IACC) (Cousins *et al.*, 2015; Hiyama *et al.*, 2002; Romblom *et al.*, 2016).

Previous work on envelopment assumed symmetric loudspeaker arrangements, varying either the number of active loudspeakers (Hiyama *et al.*, 2002), or room acoustical parameters such as the reverberation distribution in time and space (Bradley and Soulodre, 1995b). Little experimental evidence is available on perceived envelopment for directionally imbalanced sound fields. Such a directional imbalance can occur at any seat located off-center in a multichannel loudspeaker system and is especially pronounced for seats located in the proximity of loudspeakers at ear height. Our contribution presents and models the results of a listening experiment, revealing how envelopment degrades towards off-center positions when the direct sound of all the surrounding loudspeakers decays with either of these profiles: −6 dB (point source), −3 dB (line source), or 0 dB (constant-pressure source) per doubling of the relative distance. Variably curved line-source arrays may be considered to successfully implement any of these source types in practical applications (Straube *et al.*, 2018).

## 2. Listening experiment

### 2.1 Experimental setup and design

An experiment on the perception of listener envelopment was conducted under anechoic conditions employing a horizontal, circular loudspeaker array of 24 loudspeakers with a radius of 2.50 m, cf. Fig. 1. The 24 loudspeakers (Genelec 8020) were equalized from 200 Hz to 8 kHz to a ±1 dB flat on-axis free-field frequency response by minimum phase filters generated with the room equalization wizard (REW). The filters were computed from time-windowed loudspeaker impulse response measurements (window length of 4 ms), which resulted in filter impulse responses with a length of 256 samples at a sampling rate of 48 kHz. Participants were seated in the center of the setup and were advised to maintain a frontal head orientation while listening. The graphical user interface to collect the ratings was presented on a notebook and allowed the participants to repeatedly switch between the stimuli of a trial, using either keyboard shortcuts or the trackpad of the notebook. Participants were advised to use the keyboard shortcuts for switching between the stimuli to better maintain a horizontal and frontal look direction when listening.

The experimental design is based on the recommendations in ITU-R BS.1116–3 (ITU-R, 2015). Per trial, participants rated two unknown stimuli against a reference and an anchor. Precisely, they were asked to “*rate the perceived envelopment compared to the reference*” for the two stimuli, of which one was a hidden reference. The visible anchor, a monophonic frontal loudspeaker condition, was provided for guidance and to better define the bottom end of the rating scale (not foreseen by the ITU recommendation). The available reference was the 24-loudspeaker setup, which marks the top end of the rating scale. The definition of envelopment given to the participants was “*the sensation of being surrounded by sound*.” According to the ITU recommendation, a 5-point difference scale was defined: “*Reference* (0), *Slightly Different* (−1), *Different* (−2), *Definitely Different* (−3), *Anchor* (−4).” Participants rated the two stimuli with continuous vertical sliders on a software interface, which showed the verbal rating scale adjacent to the sliders, and allowed to switch between the two stimuli, the reference, and the anchor as often as desired.

Diffuse fields can be modeled by stationary incoherent noise signals evenly distributed over all directions (Jacobsen and Roisin, 2000). Consequently, the present study excluded transient onsets and time lags to focus on the stationary state, whose degree of directionality mainly depends on the angular distribution and the weights of the loudspeaker signals. To allow for comparison with results from Hiyama *et al.* (2002), the loudspeaker signals were incoherent noise signals with a duration of two seconds and fade durations of 500 ms (sine-squared fade-in and cosine-squared fade-out). In the first part of the experiment uniformly distributed white noise signals were presented. In the second part, low-pass filtered noise signals were presented, which were generated by applying a 12th-order Butterworth low-pass filter with a cut-off frequency of 1.8 kHz to uniformly distributed white noise signals. According to Hiyama *et al.* (2002), the low-pass noise signals yield similar perceptual results to reverberated cello signals.

The first independent variable was the number of active loudspeakers (LS), ranging from a frontal-only setup (2 LS, $\xb145\xb0$), to quadraphonic (4 LS), to 8 and 12 equiangularly distributed loudspeakers. Each loudspeaker arrangement was rated for both a centered listening position and a *simulated* off-center listening position [relative shift of 0.5 times unit radius to the right, see Fig. 1(b)]. In addition to the actual frontal head orientation of the listener, a virtual $90\xb0$ head rotation to the left was simulated by a correspondingly right-rotated activation of the loudspeakers. In the case of 4/8/12 LS, the lateral off-center condition with rotation can also be interpreted as a back off-center condition without rotation, and the on-center conditions are invariant to the rotation. Generally, the position shift causes a change in the directions of the active loudspeakers and their distance to the listener. The direction change is rendered by remapping of the playback signals to the nearest available direction of the 24 loudspeakers, cf. Fig. 1(c). Given the $15\xb0$ angular resolution of the setup, remapping results in an angular error of up to 7.5°, but we gain a fully blind test design which does not require the participant to move and memorize off-center/on-center conditions. Plots archived online (Riedel, 2022a) outline the influence of angular rounding on the binaural cues, which can safely be assumed to be negligible in a broadband sense. In addition to the angular remapping, distance weights *g _{i}* are applied to the noise signals to simulate the effect of source models with direct-sound decay profiles: $gi=ri\u22121$ (point source), $gi=ri\u22121/2$ (line source), or $gi=ri0=const.$ (constant-pressure source), where

*r*indicates the distance between the

_{i}*i*th sound source and the receiver. To ensure equal loudness across the stimuli, the loudspeaker signals were divided by a compensation factor $c=\u2211iIgi2$, which accounts for the number of active loudspeakers $I\u2208{2,4,8,12}$ and their associated weights $0<gi\u22641$.

The experiment was divided into two parts, separating white noise and low-pass noise stimuli, where the bandwidth of the reference and anchor matched the corresponding stimuli. Each of the two parts contained 32 conditions composed of four different loudspeaker arrangements (2, 4, 8, and 12 LS), two head orientations (frontal and a simulated 90° rotation), and four variations regarding the listener position/source model (on-center vs simulated off-center for three source models). The order of the conditions was randomized within each part.

### 2.2 Experiment results and discussion

There were 24 participants in the listening experiment, all normal-hearing individuals (self-reported) and either academic staff or students enrolled in the audio engineering program of the authors' institution. According to the recommendation of ITU-R (2015), the first step before any statistical analysis is to compute the difference between the rating of the hidden reference and the absolute stimulus rating (difference grade per condition and subject). The median difference grade and its 95% confidence interval are then computed per condition, and the experimental results for both the white noise stimuli and the low-pass noise stimuli are shown in Fig. 2. The confidence intervals have been computed via the inverse cumulative distribution function of the Binomial distribution. Below, we refer to specific conditions of the experiment by their row and column index in Fig. 2, and results will be followed by discussion of relevant perceptual cues, cf. Fig. 3. We refer to interaural coherence (IC) as the maximum absolute value of the normalized interaural cross correlation function per frequency band (frequency-dependent IACC). The interaural level difference (ILD) describes the level difference between the left and right ear signals in decibel (dB). The IC and ILD curves in Fig. 3 are computed by the formulas described below in Eq. (3) using a head-related transfer function (HRTF) database of the KU100 dummy head (Bernschütz, 2013). As the KU100 HRTFs are not precisely symmetric, we replaced right-ear HRTFs with the azimuth-mirrored left-ear HRTFs.

The experiment results show that there is a monotonic increase in perceived envelopment with the number of active loudspeakers (LS), which starts to saturate towards the reference for 8 or more loudspeakers. This tendency holds for white noise and low-pass noise, but only for on-center listening, cf. column 1 in Fig. 2. It can be explained by a decrease in interaural coherence (IC) and the inability to discriminate and localize the incoherent sound sources of the active loudspeaker subset. The curves in Fig. 3 (left) show that for conditions with 12 active loudspeakers the IC is close to the one of a diffuse field (black curve), while IC curves for 2 and 4 active loudspeakers are known to deviate clearly from the one of a diffuse field, cf. Walther and Faller (2011). Furthermore, the discrimination of distributed incoherent noise sources was shown to be feasible only up to a number of 5 to 7 sources in a frontal equiangular arrangement, and to be heavily impeded by a limited signal bandwidth (Santala and Pulkki, 2011). This is in accordance with the results of this experiment, which show a reduced difference grade for the 2 LS and 4 LS conditions when loudspeaker signals were low-pass filtered.

Comparing the ratings for off-center conditions (columns 2–4) with the corresponding on-center ratings (column 1), only the *line sources* retain high ratings, given 8 or 12 active sources (rows 3 and 4, column 3). The corresponding ratings for the constant-pressure sources (column 2) and the point sources (column 4) are significantly different from the respective on-center ratings ($p<0.001$, for the pairwise Wilcoxon signed-rank test). It shows that line sources yield the smallest absolute interaural level difference (ILD) at off-center listener positions, cf. Fig. 3 (right). In contrast, point sources and the constant-pressure sources cause significant ILDs below 6 kHz for the lateral off-center position, which leads to the perception of a directionality and a noticeable loss of left-right balance. Considering that the IC curves of the 12 LS conditions remain relatively close to the diffuse field curve below 2 kHz (left in Fig. 3), likely not exceeding corresponding just-noticeable differences (JNDs) (Walther and Faller, 2013), the magnitudes of the ILD appear to explain the perceived reduction of envelopment, as they clearly exceed the JNDs of ILDs [0.5 to 1 dB according to Hartmann and Constan (2002)]. The sign of the mean ILD in Fig. 3 determines the left or right directionality/imbalance. The reader is invited to experience the conditions via binaural auralizations available for download online (Riedel, 2022b).

The results also reveal the effect of high-frequency signal content. For white noise stimuli and on-center listening (column 1), the 2 LS condition was rated significantly lower than the 4 LS condition ($p<0.001$, Wilcoxon signed-rank test). Contrary, no significant difference between the 2 LS and 4 LS conditions can be found for the 1.8 kHz low-pass noise stimuli. From literature it is known that vertical localization and front-back discrimination rely on high-frequency spectral cues. If sufficient high-frequency content is available, listeners are presumably capable of matching the stimulus spectral gradient to a direction-specific template gradient, enabling localization in sagittal planes (Baumgartner *et al.*, 2014). Thus, for white noise signals a frontal-only stimulus like the 2 LS condition can be assumed to be salient. However, for the 1.8 kHz low-pass noise the high-frequency cues are absent or at least attenuated, causing a front-back ambiguity that serves as an explanation for the convergence of the 2 LS and 4 LS conditions, cf. Fig. 2 (column 1, row 1 and 2). Note that the participants were advised to maintain a static, forward-facing head orientation while listening to stimuli, such that the absence of dynamic perceptual cues can be assumed.

We observe a tendency that conditions simulating an off-center listener with a 90°-rotated head (Fig. 2, columns 5–7) were rated to provide more envelopment than the off-center conditions with a non-rotated head (Fig. 2, columns 2–4, left-right imbalance). This holds for 4, 8, and 12 LS, but not for 2 LS, as in this case the soundfield becomes unilateral after rotation. It can be deduced that an imbalance towards a lateral direction is more detrimental to the perceived envelopment than an imbalance towards the front or back, which again stresses the importance of a vanishing overall ILD for envelopment.

## 3. Stationary model for IC and ILD

In the following equations, we provide a closed-form expression for the interaural cross-spectral density $SLR(\omega )$ and the auto-spectral densities $SLL(\omega )$ and $SRR(\omega )$ caused by a distribution of discrete sound sources. It enables us to confirm that line sources minimize the absolute value of the ILD across an extended listening area. We assume a set of sound sources emitting signals $si(\omega )$, where *i* denotes the source index and *ω* denotes the radial frequency. The source directions are denoted as $\Omega i\u2261(\varphi i,\theta i)$, where $\varphi i$ refers to the azimuth angle and *θ _{i}* to the elevation angle ($\theta i=0$ in our study). The acoustic transfer to the human ear is described by convolution of the source signal $si(\omega )$ with a far-field head-related transfer function (HRTF) $h\zeta (\omega ,\Omega i)$, where $\zeta \u2208{L,R}$ indicates the left and right ear channel. To account for the distance

*r*of the sound sources we apply the weights $ri\u2212\beta $, where $\beta \u2208{1,12,0}$ to model the radiation from either a point source, a line source, or a theoretical constant-pressure sound source. We can write the equations in matrix notation by stacking the HRTFs corresponding to the source directions $\Omega i$ in the vector $h\zeta (\omega )$, the respective source signals in $s(\omega )$. The distance attenuation is written as a diagonal matrix $\Lambda r=diag{ri\u2212\beta}$. The frequency-dependent ear signals $x(\omega )$ derive to

_{i}Assuming stationary source signals $si(\omega )$, the covariance matrix of the source signals $Css$ defines the inter-signal coherence and the signal variances. Note that this is a legitimate simplification, as we model the diffuse field by stationary noise signals. The covariance matrix of the ear signals $Cxx$ becomes

The frequency-dependent formulation allows to compute IC and ILD within a desired bandwidth described by frequency-domain magnitude windows $0\u2264wb(\omega )\u22641$, where $b=1,\u2026,Nb$ is the band index,

where the lag *τ* is typically limited to $\u22121\u2009ms\u2264\tau \u22641\u2009ms$. The computations can be conducted efficiently using a fast Fourier transform (FFT). The integrals are evaluated as inverse FFTs of the windowed cross-spectral density and auto-spectral densities, which yield the cross correlation and auto-correlation functions respectively. In this paper we use a publicly available HRTF set of the KU100 dummy head in place of $h\zeta (\omega )$ (Bernschütz, 2013). Note that Fig. 3 used $Nb=320$ gammatone windows on an equivalent rectangular bandwidth (ERB) frequency scale for detailed curves. This corresponds to an eightfold density of bands (1/8-ERB spacing), cf. *density* parameter in the *pyfilterbank* package. To efficiently simulate numerous listener positions, the model is applied using $Nb=22$ gammatone magnitude windows $wb(\omega )$ with center frequencies according to the (onefold) ERB frequency scale from $\omega 1/2\pi =414$ Hz to $\omega Nb/2\pi =5.9$ kHz. The proposed frequency range is sufficiently wideband, but intentionally excludes low-frequency bands below 400 Hz due to the bias in IC and irrelevance for ILD and excludes high-frequency bands above 6 kHz to neglect the complex pinna-related directivity of the HRTFs, cf. Fig. 3. The covariance matrix of the source signals is set to be an identity matrix, assuming incoherent signals of unit variance radiated by the sound sources. At each simulated listener position, twelve head orientations are calculated (30° rotations in azimuth) and the worst-case estimate of these orientations is plotted, namely, the max-abs(ILD) and max-IC value after averaging across the frequency bands. Figure 4 shows the resulting contour graphs, computed for a grid of 45 × 45 listener positions. The first contour marks the region where IC is below 0.5 and ILD is below 1 dB. A reference diffuse field in the proposed frequency range of 414 Hz to 5.9 kHz shows frequency-averaged values of $ICref=0.14$ and $ILDref=0$ dB, and literature reports broadband JNDs of $\Delta IC=0.4$ for $IC\u22640.2$ and of $\Delta ILD=0.5\u20261$ dB for $ILD\u22480$ dB (Hartmann and Constan, 2002; Pollack and Trittipoe, 1959; Walther and Faller, 2013). We can therefore assume the first contour to be an estimator of the listening area that provides interaural cues for perceived envelopment.

Four surrounding sources yield a small area of minimal ILD, even when assuming line sources, cf. Fig. 4(a). However, the line-source model yields a significant increase in the highest contour area for six sound sources, see Fig. 4(b). For eight and twelve sources, the benefit of line sources is even more visible when inspecting the first contour of ILD, which clearly includes the off-center position tested in the experiment, cf. Figs. 4(c) and 4(d). When using line sources instead of point sources, the radius of the 1 dB contour for ILD (at which $IC<0.5$) increases from 0.24 to 0.80 (12 sources), from 0.25 to 0.67 (8 sources), from 0.21 to 0.54 (6 sources), and from 0.15 to 0.20 (4 sources).

These results suggest that multichannel sound systems should be deployed with at least six surrounding vertical line sources, when the target is an extended listening area of envelopment. Since ILDs become effective mostly at high frequencies above 400 Hz, even compact line sources seem beneficial for immersive sound reinforcement, cinema sound systems, and art installations. Simulation plots of rectangular source arrangements confirm the benefit of line sources, and can be found online (Riedel, 2022c). Additionally, binaural auralizations of all experimental conditions are provided for download online (Riedel, 2022b). We provide open access to experimental data and *python* code to create Figs. 2–4 (Riedel, 2022d).

## 4. Conclusion

We have shown by experiment that line sources are able to preserve the perception of envelopment at off-center listening positions, which can be explained by a minimal interaural level difference and low interaural coherence. For the line sources to become effective, a certain number of surrounding sources is required. The experiment showed that four sources are not enough, but that eight or more line sources are successful in preserving the sensation of envelopment at off-center listening positions. A closed-form expression was given for the interaural cross-spectral density and the auto-spectral densities caused by arbitrary sound source distributions, assuming stationary source signals described entirely by their covariance matrix. Simulations of surrounding source arrangements confirm that line sources provide a minimal interaural level difference and a low interaural coherence across an extended listening area. Furthermore, the simulations confirm that four surrounding sources are too few and will not provide an extended listening area of envelopment, and that at least six, better eight or twelve, surrounding vertical line sources should be employed.

## Acknowledgments

This research was partially funded by the Austrian Science Fund (FWF) (Project No. P 35254). The authors thank Benjamin Stahl for fruitful discussions on the formulation of stationary models.

## References and links

_{E}), lateral fraction (LF

_{E}), and apparent source width (ASW) in concert halls