Humans possess mechanisms to suppress distracting early sound reflections, summarized as the precedence effect. Recent work shows that precedence is affected by visual stimulation. This paper investigates possible effects of visual stimulation on the perception of later reflections, i.e., reverberation. In a highly immersive audio-visual virtual reality environment, subjects were asked to quantify reverberation in conditions where simultaneously presented auditory and visual stimuli either match in room identity, sound source azimuth, and sound source distance, or diverge in one of these aspects. While subjects reliably judged reverberation across acoustic environments, the visual room impression did not affect reverberation estimates.
1. Introduction
Our awareness of space, and subsequent orientation and navigation in space is dominated by the visual system, relying on the high-resolution topographic representation of space on the retinae of our eyes. Nevertheless, our auditory system can contribute: While the visual field is limited to the viewing direction, spatial hearing is omnidirectional, often guiding head and eye orientation.
Surrounding space affects sounds generated therein. It is very rare that we are in completely anechoic spaces. Instead, most man-made enclosed spaces, as well as natural enclosed spaces like caves and natural surroundings like forests (Traer and McDermott, 2016), produce echoes and reverberation that linearly distort the sound on its way from source to receiver. Humans and many other vertebrates possess dedicated perceptual strategies to compensate for sound reflections, summarized as the precedence effect [reviewed in Blauert (1997), Brown et al. (2015), and Litovsky et al. (1999)]. While some aspects of the precedence effect can be explained as by-products of peripheral auditory processing (Hartung and Trahiotis, 2001), several studies have demonstrated a high-level, cognitive contribution to precedence (Bishop et al., 2014; Clifton, 1987; Clifton and Freyman, 1989; Clifton et al., 2002; Tolnai et al., 2014).
While the precedence effect describes a short-lasting perceptual phenomenon in listening conditions with a single echo that arrives within a few tens of milliseconds, there is also evidence of compensation mechanisms acting in more natural situations, where the effects of reverberation may last for hundreds of milliseconds. Studies have found that familiarity with a room environment can change thresholds in the categorical perception of speech sounds (Watkins, 2005) and improve speech intelligilibity (Brandewie and Zahorik, 2010). These results are consistent with the idea of a “de-reverberation” process in auditory perception.
Current theoretical and empirical findings indicate that perceptual information from one sense, such as vision, influences evaluation and perception of information in other senses, such as hearing (Stein and Meredith, 1993). Examples of such “cognitive associations” between modalities can easily be found in everyday situations such as when information from visual senses (“this room looks like a typical concert hall”) and auditory senses (“this room sounds like a typical concert hall”) combine to form a total impression of the situation. A recent study has shown that auditory distance judgements are affected by vision in audiovisual virtual environments (Postma and Katz, 2017). Moreover, there is evidence that humans can estimate room acoustical parameters based on photos (McCreery and Calamia, 2006), suggesting that we possess an intuitive awareness of the acoustics of rooms which we can see, but not hear.
Supporting the notion of multi-modal integration, it was demonstrated that, well beyond the classical ventriloquism effect, the visual system may also affect the way we deal with reflections of a sound source in enclosed spaces. It was shown that the strength of the precedence effect can be enhanced when the layout of a visual environment is consistent with the acoustically presented sounds and their reflections. Likewise, the precedence effect is diminished when visual and auditory environments are inconsistent (Bishop et al., 2011, 2012). The precedence effect is typically relevant for suppressing spatial information of early reflections. However, it is to date unclear whether the visual impression of a room may affect the perception of later reflections which overlap in the reverberant tail of a room response. Depending on the temporal decay of the reverberation, technically characterized by the reverberation time (RT60) and the energetic ratio in relation to the direct sound [direct-to-reverberant ratio (DRR)] and early reflections, human can judge the perceived reverberation time and level [e.g., Lindau et al. (2014)]. Extrapolating from the documented effects of the visual system on classical precedence, we assess here whether perceived reverberation is affected by the congruence of the visual and auditory environments.
To this end, we quantified the extent to which subjects perceive the same auditory environment as less or more reverberant when the visual and auditory environments are congruent in comparison to a condition where the visual environment is not shown, or where it is incongruent with the auditory environment. We recruited latest audio-visual stimulation techniques to create highly realistic and immersive environments, and thus ensure that possible effects may be relevant in every-day listening situations.
2. Methods
2.1 Subjects and reproduction setup
Ten listeners (21–27 years of age, mean 24.3, 5 female) participated in the experiment. They were paid for their participation. Procedures were approved by the ethics committee of the Faculty of Medicine, LMU Munich (project No. 18–327).
The listeners were asked to quantify the perceived degree of reverberation in audiovisual and baseline audio-only conditions on a 1–10 integer rating scale. Five repetitions were measured for each trial. The stimulus in each trial paired one of 12 auditory environments with one of 12 visual environments (or the lack of a visual environment), although not all possible combinations were used (see below).
Listeners were seated in an anechoic chamber (2 m × 2 m base, 2.2 m high) and the auditory stimuli were presented via 36 loudspeakers (Plus XS.2, CANTON Elektronik, Weilrod, DE) mounted at head height near the chamber wall in a horizontal circular arrangement at 10° intervals in azimuth. The speakers were driven by four 12-channel power amplifiers (CI9120, NAD Electronics International, Pickering ON, CA) which received input from a PC via two 24-channel audio interfaces (24I/O, MOTU, Cambridge MA, US) running at a sampling rate of 48 kHz. The loudspeakers were fully equalized in spectral magnitude and phase by application of per-speaker compensation impulse responses. The root-mean-square sound pressure in the loudest conditions was 64 dB sound pressure level.
A head-mounted stereo display (Rift DK2, Oculus VR, Menlo Park CA, US) provided the visual stimuli to the subjects. The reference frames of the virtual visual and auditory environments were aligned with each other through careful placement of the infrared tracking camera. The subject's position in the virtual environment was kept fixed (i.e., the translational component of the head tracking readings was ignored for the real-time updates), and subjects were instructed to rotate their head in the horizontal plane, but not move it translationally, or otherwise rotate it. This was verified by a supervisor from outside the chamber via an infrared camera. Rotational head-tracking data also confirmed that the subjects complied with these instructions.
2.2 Stimuli
Auditory environments were defined by three variables: room identity (bedroom, office room, or factory hall); sound source azimuth (60° left, 0°, or 30° right); and sound source–listener distance (1 or 3 m). We used an improved version of RAZR (Wendt et al., 2016) to simulate the room acoustics for these 18 environments, employing an image-source model for early reflections (up to third order), a scattering module and a feedback delay network for late reverberation. The direct sound and the early reflections were mapped onto 36 channels (corresponding to the locations of the loudspeakers in the experimental chamber) by means of vector-based amplitude panning (Pulkki, 1997). The late reverberation was mapped onto twelve channels (three per chamber wall).
Visual environments were defined by the three variables room identity, visual source azimuth, and visual source–listener distance (with the same respective sets of possible values as listed above). The visual environments were rendered from 3-D geometric models to stereo equirectangular panoramic images with the cycles engine for blender (Blender Foundation and community, 2018). They depict rooms with the same dimensions and wall materials as the corresponding auditory environments, with a TV set placed at the visual source position.
The visual and auditory stimuli and the experimental chamber are illustrated in Fig. 1.
2.3 Procedure
Two types of trials were presented in the experiment: audiovisual trials and audio-only trials. Each audiovisual trial combined an auditory environment with a visual environment such that
the auditory and the visual room identities, source positions and sound source–listener distances are pairwise identical (congruent condition), or that
the visual room identity differs from the auditory room identity while the other variables match (room identity incongruence), or that
the visual source azimuth differs from the sound source azimuth while the other variables match (azimuth incongruence), or that
the visual source–listener distance differs from the sound source–listener distance while the other variables match (distance incongruence).
Note that audiovisual trials can be incongruent in azimuth by up to 90° (by combining a 60° left auditory environment with a 30° right visual environment, or vice versa).
All auditory environments were also presented in audio-only trials, where the visual stimulus was a uniformly black image.
Trials of both types were presented in a random order. In each trial, a German-language speech signal extracted from a TV news show (no background music or other sounds; loudness-normalized according to EBU R 128) drawn randomly from a pool of 8 signals was convolved with the 36 impulse responses (all of which contained early reflections, and 6 of which also included late reverberation) for the corresponding auditory environment and played back. Simultaneously, a corresponding video clip was shown through the head-mounted display on the virtual TV screen at the visual source location. The video did not show the human speaker. The subjects were instructed to look toward the TV screen during stimulus presentation, and to reproduce the up-down orientation of an arrow displayed on the virtual TV screen with the joystick (while the arrow was visible) to ensure that their eyes were open. Subsequently, they were asked to judge the perceived degree of reverberation (“wahrgenommene Verhalltheit”) on a purely numeric 1–10 rating scale.
Prior to the experiment, the subjects were given the opportunity to listen to example stimuli representing each value on the rating scale. No visual stimulation took place during this familiarization phase. The familiarization stimuli were the same speech sounds, played through six speakers at 60° offsets in azimuth. The dry speech signal was convolved with random noise impulse responses with an exponentially decaying envelope for each speaker. The impulse responses ranged in broadband reverberation times similarly to those of the synthetic rooms (RT60 = 100 to 4000 ms). The carrier noise was shaped to be spectrally identical to the average magnitude spectrum of the synthetic room impulse responses.
3. Results
To analyze the results from the audio-only control and audiovisually congruent conditions, we fit a linear model with fixed and random effects (mixed-effects model) to the data, with the rating on the 1–10 scale as the dependent variable. The four independent variables auditory room identity, sound source distance, sound source azimuth, and presence of congruent visual stimulation (“visuals on/off”) were included as fixed effects. First-order interactions of these fixed effects were also modelled. The data were grouped by the random factors subject (allowing the slopes for all fixed effects as well as the intercept to vary) and speech signal (allowing only the intercept to vary). All independent variables were treated as categorical.
An adapted F-test, using the approximation of Satterthwaite (1946) for degrees of freedom, revealed a significant (α = 0.05) main effect of room identity [F(2, 9) = 246.93, p < 10−7], and no other significant main effects (azimuth: p = 0.93, distance: p = 0.43, visuals on/off: p = 0.45). Per-subject mean ratings and standard deviations are shown in Fig. 2, panels #1–10 (audio-only control conditions in light grey, congruent conditions in dark grey). Estimated marginal mean differences were 3.20 ± 0.22 (s.e.m.) for bedroom-office, and 2.38 ± 0.26 for office-factory. Two-sided t-tests, again using the Satterthwaite approximation, established all pairwise room identity differences as significant (bedroom-office: t = 21.08, p < 10−6; office-factory: t = 9.172, p < 10−4; p-values Tukey-adjusted for multiple comparisons).
Further analysis also showed a significant interaction between sound source distance and room identity [F(2, 3217) = 18.00, p < 10−7], and no other significant interactions (room identity and azimuth: p = 0.87; room identity and visuals on/off: p = 0.07; azimuth and sound source distance: p = 0.14; azimuth and visuals on/off: p = 0.37; distance and visuals on/off: p = 0.59). Post hoc t-tests conditioned on room identity showed a significant near-far difference in ratings (0.43 ± 0.12) only for the office room (t = 3.666, p = 0.002; Tukey-adjusted), and no significant differences based on presence or absence of visual stimulation (control conditions perceived as insignificantly more reverberant, 0.12 ± 0.09 for the factory, and insignificantly less reverberant, −0.11 ± 0.09 for the bedroom, −0.14 ± 0.09 for the office).
Thus, statistics confirmed that ratings for the bedroom conditions were overall lower than for office room conditions, and those were in turn lower than for factory room conditions. Perceived reverberation for near distances was only lower than for far distances in the intermediate office room (RT60 = 1.5 s). Notably, the presence or absence of simultaneous congruent visual stimulation did not affect the perception of reverberation. Modelled overall ratings are shown for the three rooms and six room-distance pairs in the lower right panel of Fig. 2 (control and congruent conditions averaged over).
In order to uncover possible effects of incongruence between the auditory and visual stimuli, we performed two-sided paired t-tests. First, we averaged the ratings from the ten repetitions obtained for each subject and audiovisually congruent condition where the room identity was either “office” or “factory” (“eligible congruent conditions”). We paired these averages with the average ratings over the ten repetitions obtained for the same subject and same auditory condition, but where the visual stimulus suggested a smaller room. In this way, each pairing contrasts two audiovisual conditions from the same listener, with only one difference between them: a fully congruent visual stimulation on one side vs one that differs from the auditory stimulation only in room identity (namely, a smaller visual room identity, “V smaller”) on the other. Note that congruent conditions with a room identity of “bedroom” are not eligible in this specific comparison, as there was no smaller visual room identity.
We repeated this method of analysis for the other six types of audiovisually incongruent condition, at a time pairing the respective eligible congruent conditions with conditions in which: the visual stimulus suggests a larger room (“V larger”); the visual source distance is smaller, or larger, than the sound source distance (“V nearer” and “V farther,” respectively); or the visual source azimuth differs from the sound source azimuth by 30°/60°/90° (“V 30° off,” “V 60° off,” and “V 90° off,” respectively).
Figure 3 reproduces the raw data used in these seven t-tests in seven scatterplots. A systematic effect of a specific type of audiovisual incongruence would be evident in these plots by a shift of the data points away from the identity line, with a shift towards the horizontal/vertical axis suggesting a higher/lower perceived degree of reverberation in congruent conditions, respectively. We found no type of audiovisual incongruence that was rated significantly differently to congruent conditions by the subjects.
4. Discussion
The current data show that the subjects could reliably judge the degree of reverberation of a presented virtual environment. They show equally clearly that the subjects' reverberation judgements did not depend on whether a visual representation of an auditory environment was provided to them during listening, and that they were also not systematically affected by the congruence or incongruence of the presented auditory and visual environments.
Considering the dominance of the visual system in spatial awareness and its established influence on the perception of single echoes, it might on the one hand appear surprising that judgements of a room-related attribute of sound are entirely unaffected by congruent or incongruent visual input. On the other hand, this result is consistent with recent data which suggest that other spatial parameters (azimuth and compactness of the auditory image) are also assessed by listeners independently of the simultaneous visual impression (Gil-Carvajal et al., 2016). These authors only detected an effect of visual stimulation on sound source distance, where multi-modal integration probably assigns a larger weight to visual information due to the relatively lower reliability of auditory cues for distance perception (Loomis et al., 1998).
In this context, it might be important to note that it was easily possible for our subjects to determine the azimuthal position of the sound source not only as mediated by the visual stimulation (the position of the TV set in the virtual room), but also as mediated by the auditory stimulation. While the audio-visual stimulation technique employed in this experiment is theoretically suitable to elicit a visual capture effect in conditions where the visual source position deviates from the sound source position, this is unlikely to have taken place considering the magnitude of the azimuthal incongruence (30° or more), which exceeds typical thresholds of perceptual fusion (Hendrickx et al., 2015).
It should not be overlooked that despite a lack of statistical significance at α = 0.05, test statistics and potential effect sizes are largest when comparing conditions that modulate sound or visual source azimuth. Relatively high score differences and comparatively low p-values are observed between stimulus conditions that have a 0° vs either a 60° or a 90° discrepancy between sound and visual source azimuth, with lower perceived degrees of reverberation in the azimuthally incongruent conditions. We do not believe that this is due to audiovisual interactions, given that a better-ear effect in listening in rooms offers a simpler explanation: Because subjects always looked towards the visual source position in audiovisual conditions, one of their ears was turned towards the sound source when its location diverged in azimuth. This behaviour was found to improve spatial release from masking in a single-speaker, single-masker condition (Grange and Culling, 2016) and may well affect the perception of reverberation.
It should be noted that the initial audio only training for the judgement of reverberation might have primed a listener's focus on the auditory cues while tending to more disregard visual cues. It is unclear whether results might be different for more naive listeners which do not receive audio only training ahead of the task and which judge perceived reverberation of the overall scenario or its effects on perception in a more indirect task.
Acknowledgments
This research was supported by the Munich Center for Neurosciences, the Bernstein Center for Computational Neuroscience Munich, the Graduate School of Systemic Neurosciences, the Deutsche Forschungsgemeinschaft (DFG) FOR 1732 (TPE) to author S.E., and the DFG Grant No. wi1518/17 to author L.W. We thank Baccara Hizli for her assistance in data acquisition. The visual stimuli depicted in Fig. 1 are based on 3D models provided by Rui Teixeira (bedroom; public domain), DragonautX on blendswap.com (office; CC-BY 3.0), and Fatih Eke (factory; CC-BY 3.0).