In a multi-talker situation, listeners have the challenge of identifying a target speech source out of a mixture of interfering background noises. In the current study, it was investigated how listeners analyze audio-visual scenes with varying complexity in terms of number of talkers and reverberation. The visual information of the room was either congruent with the acoustic room or incongruent. The listeners' task was to locate an ongoing speech source in a mixture of other speech sources. The three-dimensional audio-visual scenarios were presented using a loudspeaker array and virtual reality glasses. It was shown that room reverberation, as well as the number of talkers in a scene, influence the ability to analyze an auditory scene in terms of accuracy and response time. Incongruent visual information of the room did not affect this ability. When few talkers were presented simultaneously, listeners were able to detect a target talker quickly and accurately even in adverse room acoustical conditions. Reverberation started to affect the response time when four or more talkers were presented. The number of talkers became a significant factor for five or more simultaneous talkers.

The human auditory system can focus on a speech stream in the presence of interfering speech stimuli. Such a multi-talker scenario has been termed the cocktail-party situation (Bronkhorst, 2000; Cherry, 1953). Many factors are known to reduce the ability to understand speech in such a cocktail-party situation, e.g., the level of the target speech relative to the interferers, the number of talkers, or the type of listening room. These effects are commonly measured by asking the listeners to repeat a word or a sentence or to write down the perceived stimulus. However, in our daily life, the task in a cocktail-party situation is usually different, where it is necessary to follow a conversation and to identify a certain topic or continuous speech stream out of an interfering speech mixture. In the current study, we investigated the ability of listeners to analyze a virtual audio-visual scene containing multiple speech sources. The scenes varied in complexity in terms of the number of interfering talkers, room reverberation, and coherency of the audio-visual room information.

The number of interfering talkers has been shown to influence the intelligibility of a target talker when keeping the signal-to-noise ratio (SNR) constant. Simpson and Cooke (2005) showed that the intelligibility decreases when increasing the number of interfering speech sources for up to eight interfering talkers, as the ability to listen into speech gaps is reduced and at the same time, the interfering speech remains intelligible and can be confused with the target speech. When further increasing the number of interfering talkers, the intelligibility was shown to improve as the interferers become more noise-like and therefore, do not contain understandable speech.

Reflections and reverberation are present in nearly all communication scenarios. Room reverberation has been shown to negatively affect speech perception in a number of studies (Best et al., 2015; Bronkhorst and Plomp, 1990; Moncur and Dirks, 1967; Nabelek and Mason, 1981; Nábĕlek and Pickett, 1974). Particularly, the diffuse reverberation, i.e., the late reverberant tail, has been shown to reduce speech intelligibility, while early reflections do not seem to harm, or might even improve, speech perception (Arweiler et al., 2013; Arweiler and Buchholz, 2011; Warzybok et al., 2013).

Previous studies have investigated the ability of listeners to identify and locate speech in the presence of other speech sources. Kopčo et al. (2010) measured the localization accuracy of a digit spoken by a female talker in the presence of words spoken by male interfering talkers. The target and the interferers were all presented in the frontal area of the listener. They found that the presence of the interferers reduced the localization accuracy. Buchholz and Best (2020) measured localization accuracy with a similar target digit as in Kopčo et al. (2010) but with a more realistic background noise scene. The interfering signals were seven paired conversations (both male and female) at various locations in a simulated cafeteria. Results showed that the localization accuracy was only affected by the noise when the target source was distant but not when it was nearby. This finding suggests an interaction with reverberation, as farther sources have more reverberant energy relative to the direct sound compared to nearby sources.

While these studies focused on the ability to locate a speech signal in a speech background, Hawley et al. (1999) investigated both the localization accuracy of speech as well as the intelligibility. They showed that the inability to correctly locate a source did not limit the ability to correctly understand it. However, the number of interfering sources was limited to three. Weller et al. (2016) presented a novel method to evaluate the ability to analyze a complex acoustic scene. They asked their listeners to judge the location of all talkers presented in a virtual cocktail-party situation by indicating the gender of the talkers. When varying the number of simultaneously presented talkers, they found that normal-hearing listeners were able to correctly locate and count the number of talkers for up to four sources. When six talkers were presented, the accuracy decreased. Fichna et al. (2021) investigated speech intelligibility in an audio-visual virtual environment with a target direction unknown to the listeners. The target material was a matrix sentence test, and the interferers were unintelligible speech. They showed that intelligibility decreased with reverberation as well as with the number of interfering speech sources.

Fichna et al. (2021) presented a visual environment via virtual reality (VR) glasses that was congruent with the room acoustical properties of the virtual acoustic environment. Visual information of the room has previously been shown to affect perceptual auditory features, such as distance (Calcagno et al., 2012; Gil-Carvajal et al., 2016; Postma and Katz, 2017), but not the perceived reverberation (Schutte et al., 2019). Prior exposure of the room acoustical conditions has been shown to improve speech perception (Brandewie and Zahorik, 2010). Since acoustic exposure of room acoustics can affect speech perception, visual information might as well. The effect of visual room information on speech perception have not yet been reported to the knowledge of the authors.

Most of the beforementioned studies focused on the ability to localize speech but less to comprehend the speech. However, in a real-world cocktail party, listeners need to perform both tasks to successfully communicate. In the current study, we asked listeners to locate a talker speaking about a certain topic, while presenting a varying number of other simultaneous talkers. Thus, the primary task was to understand the speech and the secondary task to locate the talker. The experiment was conducted in an audio-visual virtual environment using a loudspeaker array and VR glasses. Virtual rooms were simulated with congruent visual and acoustic information. Additionally, a condition with incongruent audio-visual room information was presented to investigate the effect of visual room information on speech perception.

Thirteen Danish native speaking normal-hearing listeners aged 20–26 years participated in the experiment (7 female and 6 male). Participants were paid on an hourly basis and gave consent to an ethics agreement approved by the Science-Ethics Committee for the Capital Region of Denmark (reference H-16036391).

The speech material for target and interferers was taken from a database of anechoically recorded monologues in Danish (Lund et al., 2019) (See supplementary materials for recordings).1 The database consists of 10 stories, each spoken by 10 native Danish speakers, resulting in 100 unique combinations of content and speaker. Five of the speakers are female and five are male. The stories are between 74 and 140 s long with an average of 93 s. Each story has a different topic that can be distinguished easily by their content and was assigned an identifying visual icon. The story titles were “skiing,” “soccer,” “birds,” “jellyfish,” “Jimi Hendrix,” “coffee,” “Pac-Man,” “pyramids,” “space travel,” and “Vikings.”

Three different acoustic and visual rooms were used in this study: a high-reverberant, a mid-reverberant, and an anechoic room. The dimensions of all three rooms remained constant as shown in Fig. 1, acoustically and visually. However, the surface materials differed.

FIG. 1.

Top view of the virtual audio-visual room. The listener is wearing VR glasses with a visual simulation of the room including 15 potential talker positions at 2.4 m distance in the frontal hemisphere visualized by the head icons. The height of the room is 2.8 m.

FIG. 1.

Top view of the virtual audio-visual room. The listener is wearing VR glasses with a visual simulation of the room including 15 potential talker positions at 2.4 m distance in the frontal hemisphere visualized by the head icons. The height of the room is 2.8 m.

Close modal

Figure 2 shows the visual appearances of the three rooms. Figure 2(A) shows the anechoic room with foam wedges as commonly seen in anechoic chambers. In Fig. 2(B), the mid-reverberant room can be seen. The visual as well as the acoustical properties were similar to a large living room. The highly reverberant room is shown in Fig. 2(C). It was modelled with bare concrete surfaces to simulate a highly reverberant, yet realistic environment.

FIG. 2.

(Color online) Visual appearance of the three virtual rooms. A: anechoic, B: mid-reverberant, C: high-reverberant. The dimensions in the rooms are identical, whereas the surface materials differ.

FIG. 2.

(Color online) Visual appearance of the three virtual rooms. A: anechoic, B: mid-reverberant, C: high-reverberant. The dimensions in the rooms are identical, whereas the surface materials differ.

Close modal

The rooms were simulated using the room acoustic simulation software Odeon (Odeon A/S, Kgs. Lyngby, Denmark) with the materials and surface absorption coefficients as shown in Table I. For the anechoic room, only the direct sound was considered. In Fig. 3, the reverberation time, clarity, and direct-to-reverberant ratio of the three rooms are shown. The reverberation and the clarity were calculated using the Institute of Technical Acoustics (ITA)-toolbox (Berzborn et al., 2017), the direct-to-reverberant ratio was calculated as the ratio between the sound pressure level of the direct sound and the sound pressure level of the reflections. In the anechoic condition, the clarity and direct-to-reverberant ratio are infinite as no reflections are present (indicated by arrows in Fig. 3).

TABLE I.

Absorption coefficients (α) of the surfaces in the mid-reverberant and high-reverberant room.

α (mid-rev/high-rev)63 Hz125 Hz250 Hz500 Hz1 kHz2 kHz4 kHz8 kHz
Side walls Wooden panels/Brick 0.2/0.06 0.2/0.06 0.2/0.06 0.3/0.07 0.4/0.07 0.4/0.07 0.5/0.08 0.5/0.09 
Floor Parquet/Concrete 0.2/0.05 0.2/0.05 0.15/0.05 0.1/0.05 0.1/0.07 0.05/0.07 0.1/0.07 0.1/0.07 
Ceiling Gypsumboard/Concrete 0.3/0.05 0.3/0.05 0.35/0.05 0.4/0.05 0.4/0.07 0.4/0.07 0.5/0.07 0.55/0.07 
α (mid-rev/high-rev)63 Hz125 Hz250 Hz500 Hz1 kHz2 kHz4 kHz8 kHz
Side walls Wooden panels/Brick 0.2/0.06 0.2/0.06 0.2/0.06 0.3/0.07 0.4/0.07 0.4/0.07 0.5/0.08 0.5/0.09 
Floor Parquet/Concrete 0.2/0.05 0.2/0.05 0.15/0.05 0.1/0.05 0.1/0.07 0.05/0.07 0.1/0.07 0.1/0.07 
Ceiling Gypsumboard/Concrete 0.3/0.05 0.3/0.05 0.35/0.05 0.4/0.05 0.4/0.07 0.4/0.07 0.5/0.07 0.55/0.07 

The virtual visual scenes were rendered on the head-mounted display (HMD) of an HTC Vive Pro Eye (HTC Vive system, HTC Corporation, New Taipei City, Taiwan). The visual virtual scenes were modeled and displayed using Unity (Unity Technologies, San Francisco, CA).

FIG. 3.

(Color online) Reverberation time (T30), clarity (C50), and the direct-to-reverberant ratio (DRR) for the three rooms. The T30 and the C50 are shown with respect to octave frequency bands. The DRR is shown with respect to the source azimuth angle. The arrows indicate that the measure is infinite.

FIG. 3.

(Color online) Reverberation time (T30), clarity (C50), and the direct-to-reverberant ratio (DRR) for the three rooms. The T30 and the C50 are shown with respect to octave frequency bands. The DRR is shown with respect to the source azimuth angle. The arrows indicate that the measure is infinite.

Close modal

The acoustic scenes were reproduced on a 64-channel spherical loudspeaker array housed in an anechoic chamber (see Ahrens et al., 2019a) for details). The loudspeaker signals were generated using the room acoustic simulation using the LoRA-toolbox (Loudspeaker-based Room Auralization System; Favrot and Buchholz, 2010). For the loudspeaker playback, the nearest loudspeaker mapping was applied, where the direct sound as well as the early reflections are mapped to the nearest loudspeaker. The late reverberant tail is reproduced using 1st order ambisonics to achieve a diffuse acoustic field (Favrot and Buchholz, 2010). For the acoustic reproduction of the anechoic room, only the direct sound was reproduced from single loudspeakers.

Prior to the experiment, the listeners performed a familiarization phase, where they familiarized themselves with the speech material and the story content but not with the task itself. The anechoic version of the 10 stories were played back via headphones in a randomized order. Each talker was randomly assigned to one of the stories. Thus, listeners heard each story and each talker once. For the training, listeners were instructed to focus on unique content features or passages of the stories. After completed training, listeners were seated in the loudspeaker environment and introduced to the listening task and the interaction method using the VR controller.

The listeners' task was to identify the location of a talker amongst concurrent talker(s) in a virtual audio-visual room according to the story in the monologue. Accuracy and completion time of the task was emphasized by advising the listeners to “find the correct story as fast as possible.” The number of concurrent talkers varied between two and eight, thus, the number of interfering talkers varied between one and seven. An icon visualizing the target story content was displayed on the backwall in the visual virtual room. The 15 possible talker positions were always represented by semi-transparent humanoid shapes independent of the actual number of concurrent talkers. Figure 1 visualizes the possible talker locations between −105° to 105° separated by 15° in the frontal hemisphere at a distance of 2.4 m. The task was performed by pointing at the position where the target talker was perceived. The participants were using a VR controller that included the visual appearance of a laser pointer in the virtual room.

Between two and eight sources were presented simultaneously. For each scene, the talkers, stories, and spatial positions were chosen pseudo-randomly, i.e., no talker, story, or position could occur twice. A unique talker, story, and position was randomly chosen as the target. For each trial, the acoustic talkers were presented for 120 s. Each story was started at a random point in time. If the story finished before a listener responded or before the 120 s scene duration, the story continued from the beginning. Thus, no bias towards the beginning of each story was introduced. The listener could indicate the perceived target talker position at any time, even after the audio had stopped. Each individual talker was presented at a sound pressure level of 55 dB sound pressure level, which corresponds to a static SNR of 0, –3.0, –4.8, –6.0, –7.0, –7.8, and -8.4 dB for 1, 2, 3, 4, 5, 6, and 7 interfering talkers. However, in the paradigm applied in this study, the concept of static SNRs might be misleading because the sources are at different spatial locations and the listener is moving their head (see more details in the Discussion).

Three congruent audio-visual rooms were used as described in Sec. II B: an anechoic, a mid-reverberant, and a high-reverberant room. In addition to the conditions with congruent audio and visual room information, two conditions with incongruent audio-visual cues were considered. These were anechoic acoustics with the appearance of a highly reverberant room and high-reverberant acoustics with the visuals of the anechoic room. Thus, five room conditions were tested. Each of the conditions was repeated three times resulting in 105 trials, 5 audio-visual conditions, and between 2 and 8 concurrent talkers, i.e., 7 talker conditions.

To evaluate the listeners' ability to successfully analyze a cocktail-party scenario, two outcome measures were evaluated. First, the listeners' ability to correctly identify and locate the target talker, which allows for a binary right/wrong analysis as well as a localization error in degrees azimuth. The second outcome measure was the response time of the listeners, i.e., the time from audio onset to decision.

The response time was analyzed using an analysis of variance of mixed linear models. The computational analyses were done using the statistical computing software R (R Core Team, 2020) and the lmerTest (Kuznetsova et al., 2017) package. Within factor analyses were conducted using marginal means implemented in the estimated marginal means, also known as least-squares means (emmeans) package (Lenth, 2020) with Tukey correction for multiple comparisons. If not described differently, the number of talkers and the reverberation condition were treated as fixed effects. The effect of the participants and the repetitions were treated as random effects.

Figure 4 shows the percentage of correctly located stories. Each bar contains 39 datapoints across the 13 participants and three repetitions. When five or fewer talkers were in a scene, the participants were able to accurately locate the correct story in all reverberation conditions (>85% accuracy). In scenes with more than five talkers, the accuracy in the high-reverberant condition (dark blue) decreases monotonically. In the mid-reverberant condition (mid blue), such a decrease can only be observed when eight talkers were in a scene. In the anechoic condition (light blue), the participants were able to accurately locate (>90% accuracy) the target story for all numbers of talkers.

FIG. 4.

(Color online) The percentage of correct response locations. Each bar contains 39 datapoints across subjects and repetitions. The three colors indicate the room conditions.

FIG. 4.

(Color online) The percentage of correct response locations. Each bar contains 39 datapoints across subjects and repetitions. The three colors indicate the room conditions.

Close modal

Figure 5 shows the response time of the correct responses when two to eight talkers were presented simultaneously. The response time is displayed with the reverberation times indicated by different colors. With an increasing number of simultaneous talkers, the time needed to identify the target talker increased [F(6,755.2) = 73.1, p < 0.0001]. The response time was also found to be dependent on the reverberation time [F(2,755.6) = 83.1, p < 0.0001]. Furthermore, the interaction term between the number of talkers and the reverberation time was found to be significant [F(12,754.8) = 5.4, p < 0.0001]. Specifically, the high-reverberant condition was found to lead to a higher response time when four or more talkers were presented (p < 0.05) but not with less than four talkers (p > 0.5). The differences between the high-reverberant condition and the anechoic/mid-reverberant conditions increases with larger numbers of talkers. No significant differences in response time between the anechoic and the mid-reverberant condition were found (p > 0.1) across all numbers of talkers.

FIG. 5.

(Color online) Response time of only correct responses with respect to the number of talkers in a scene. The colors indicate the room reverberation conditions. The boxes cover the range between the 25th and the 75th percentile. The horizontal line in the boxes indicates the median. The whiskers extend to 1.5 times the inter-quartile range. Outliers are indicated as dots.

FIG. 5.

(Color online) Response time of only correct responses with respect to the number of talkers in a scene. The colors indicate the room reverberation conditions. The boxes cover the range between the 25th and the 75th percentile. The horizontal line in the boxes indicates the median. The whiskers extend to 1.5 times the inter-quartile range. Outliers are indicated as dots.

Close modal

In Fig. 6, the localization error is shown. The size of the dots indicates the number of responses for each azimuth angle. Only few errors occurred when few talkers were presented, and no effect of reverberation was found. In the high-reverberation condition, an increasing localization error was found for six and more talkers, with the eight-talker setting resulting in a median error of 30°, i.e., two potential positions error from the target location. In the anechoic and mid-reverberant conditions, only small errors were found even in conditions with many simultaneous talkers.

FIG. 6.

(Color online) Absolute localization error with respect to the number of talkers. The three colors indicate the room conditions. The size of the circles indicates the number of responses at a given localization error.

FIG. 6.

(Color online) Absolute localization error with respect to the number of talkers. The three colors indicate the room conditions. The size of the circles indicates the number of responses at a given localization error.

Close modal

Figure 7 shows the percentage of correctly identified stories, comparing the congruent and the incongruent audio-visual conditions with and without reverberation. The light blue/green bars indicate the acoustically anechoic conditions and the dark bars indicate the acoustically reverberant conditions. Only negligible differences arise from the audio-visual incongruency with a maximum effect of 7.7% in the high reverberation condition with six and seven talkers.

FIG. 7.

(Color online) The percentage of correct response locations comparing the congruent and incoherent audio-visual conditions. Each bar contains 39 datapoints across subjects and repetitions. The three colors indicate the room conditions.

FIG. 7.

(Color online) The percentage of correct response locations comparing the congruent and incoherent audio-visual conditions. Each bar contains 39 datapoints across subjects and repetitions. The three colors indicate the room conditions.

Close modal

Figure 8 shows the response times for the incongruent audio-visual conditions (green boxes), i.e., the conditions with anechoic acoustic stimuli and the visuals of the reverberant room (light green) and with high acoustic reverberation and the visuals of the anechoic room (dark green). Additionally, the response times from the congruent anechoic and reverberant conditions are shown (blue boxes, as in Fig. 5). For the statistical analysis of the influence of the audio-visual congruency, the congruency (audio-visual congruent vs audio-visual incongruent) and the number of talkers were treated as fixed effects. The reverberation condition was added to the random effects structure of the mixed linear model. The number of talkers [F(61 070) = 78.5, p < 0.0001] was found to significantly affect the response time. Neither the effect of the congruency, [F(1,2) = 0.0029, p = 0.96] nor the interaction between the congruency and the number of talkers [F(61 064) = 1.9, p = 0.08], was found to be significant. While no statistical significance is seen, comparing the acoustically high reverberation conditions (dark blue vs dark green) with seven/eight talkers shows largely lower median response times when the visual room information represent the anechoic room.

FIG. 8.

(Color online) Response time of only correct responses with respect to the number of talkers in a scene. The light blue and light green boxes indicate the anechoic room acoustic condition with congruent and incongruent visual information, respectively. The dark blue and dark green boxes indicate the high reverberant room acoustic condition. The boxes cover the range between the 25th and the 75th percentile. The horizontal line in the boxes indicates the median. The whiskers extend to 1.5 times the inter-quartile range. Outliers are indicated as dots.

FIG. 8.

(Color online) Response time of only correct responses with respect to the number of talkers in a scene. The light blue and light green boxes indicate the anechoic room acoustic condition with congruent and incongruent visual information, respectively. The dark blue and dark green boxes indicate the high reverberant room acoustic condition. The boxes cover the range between the 25th and the 75th percentile. The horizontal line in the boxes indicates the median. The whiskers extend to 1.5 times the inter-quartile range. Outliers are indicated as dots.

Close modal

In the current study, we investigated the ability of normal-hearing listeners to identify and locate a story in the presence of other stories. The task of the listeners was to locate a target story in the presence of a varying number of simultaneous interfering talkers. Furthermore, the effect of incongruency between auditory and visual room information was investigated by testing different audio-visually congruent and incongruent reverberant environments. The data showed that the localization accuracy and the response time are affected by the number of simultaneous talkers as well as by reverberation. With an increasing number of interfering talkers and increasing reverberation time, the performance of the listeners decreased. Presenting incongruent audio-visual room information did not affect the outcome measures.

Several factors are likely to affect the increase in response time and the decrease in localization accuracy with increasing number of talkers. In the present study, the speech level of each talker was kept constant independent of the number of talkers, and therefore, the signal-to-noise ratio (SNR) decreases (see Methods section for static SNR values). Thus, the intelligibility is expected to drop with the number of simultaneous talkers. However, the effective SNR is constantly changing with head-motion and fluctuations in the signals (Grange and Culling, 2016). The head-motion introduces a variation of the target and interferer angles relative to the head and thus, head-shadow and interaural time differences vary. Both head-shadow and interaural time differences have been shown to be utilized to separate target and interfering speech sources (Bronkhorst, 2000; Culling et al., 2004). Fluctuations in the speech signals allow for dip-listening which can significantly improve the SNR in some time-frequency bins. Such glimpses can help to better understand speech (Glyde et al., 2013; Miller and Licklider, 1950). When many speech sources are presented, such glimpses are usually reduced (Cooke, 2006; Freyman et al., 2004). Another effect that likely influences the response time is the amount of informational masking, i.e., confusions between the target and the interferers (Carhart et al., 1969; Durlach et al., 2003; Kidd et al., 2008; Watson, 2005). Previous studies have argued that the amount of informational masking decreases with an increasing number of simultaneous talkers (Carhart et al., 1975; Freyman et al., 2004; Simpson and Cooke, 2005). However, in the current study, the target speaker needs to be identified by understanding the speech and to do so, listeners also need to understand the content of the interferers. Thus, the listener needs to employ a strategy to search through the auditory scene and while performing the search, an interfering talker becomes a temporary target talker. Therefore, the definition of informational masking that was already controversial in classic speech perception tasks (Durlach et al., 2003; Kidd et al., 2008; Watson, 2005) becomes even more complex.

No difference in response time was found comparing the anechoic and the mid-reverberation conditions. Reverberation was found to affect the response time only between the anechoic/mid-reverberation and the high-reverberation conditions, and when there were four or more talkers in a scene. In literature, it is reported that reverberation affects speech intelligibility more with few interfering talkers because potential speech gaps and pauses get “filled” with the reverberant energy (Bolt and MacDonald, 1949; Xia et al., 2018). Such gaps generally do not exist with many overlapping speech sources (Cooke, 2006; Freyman et al., 2004). A potential explanation for the disagreement is that the task remains fairly easy with additional reverberation when few talkers are in a scene and thus, the effect of reverberation is masked.

The inexistent difference between the anechoic and the mid-reverberant condition contradicts results from previous studies where differences in speech perception between mildly reverberant conditions and anechoic conditions were found (Ahrens et al., 2019b; Duquesnoy and Plomp, 1980; Plomp, 1976). The reason for this discrepancy could be that the test paradigm might not be as sensitive to capture small differences in reverberation time, as traditional speech tests. However, Kopčo et al. (2010) discussed a similar finding that mild reverberation does not affect the speech localization in background speech by comparing their study with data from Simpson et al. (2006). This raises the question if there is an effect of mild reverberation on speech intelligibility in everyday situations or if this effect can only be observed in artificial listening scenarios in the laboratory.

Large localization errors were found in the condition with a large number of talkers and high reverberation but not in the anechoic condition. A similar effect was observed by Buchholz and Best (2020) in a condition with a low direct-to-reverberant ratio and additional noise. In some cases, the errors in the current study exceeded 90°, which is significantly larger than in previous studies. These large errors could be driven by actually wrongly location precepts. An alternative explanation is that some listeners might have given up in particularly complicated scenarios.

The spatial scene analysis method employed in this study was similar to Weller et al. (2016). The most significant difference between the approaches is that in the current study, the target speech stimulus needed to be understood while the task in Weller et al. (2016) was to judge the gender of all talkers presented in a scene. Consequently, they used the total number of perceived talkers as their main outcome measure, while we used the response time and the localization accuracy. Furthermore, in their study, the participants needed to translate the spatial percept from an egocentric auditory perception onto a top-down view interface. This translation was not needed in the current study since VR was employed as a user interface.

While the use of VR can allow for a more user-friendly interface, VR could also introduce issues to an experiment. For example, the auditory percept might be affected by the physical presence of the headset which has been shown to be negligible for setups with far spaced sources (Ahrens et al., 2019a; Fichna et al., 2021; Gupta et al., 2018). Furthermore, VR glasses might alter the participant's behavior due to their physical appearance but also because the visual world is not an exact copy of the real world. However, the influence is likely negligible in this experimental setup.

Contrary to classical speech perception studies where a %-correct or a reception threshold is determined, in the present study, the response time was used as the main outcome measure. Drullman and Bronkhorst (2000) used a similar speech localization/identification paradigm with sentences and words instead of ongoing speech. They showed that the trend of change in intelligibility with increasing number of talkers was similar to the trend of the response times, i.e., with more interfering talkers, the intelligibility decreases, and the response time increases. While the material and the task were not fully comparable between these studies, one can expect a correlation between speech intelligibility and response time.

Visual information is known to affect speech perception (McGurk and MacDonald, 1976). However, the effect of visual room information on auditory perception remains unclear. Previous studies showed that visual information of the room can improve auditory distance perception (Calcagno et al., 2012) and incongruent audio-visual cues can disrupt distance or externalization percepts (Gil-Carvajal et al., 2016). But visual information has not been shown to affect the perceived amount of reverberation (Schutte et al., 2019). However, having prior acoustic exposure of the room acoustics was shown to improve speech perception (Brandewie and Zahorik, 2010), which might be evidence that the auditory system builds an internal representation of the room acoustics. Thus, incongruent visual room information might lead to the development of a wrong internal representation, which might introduce a cost on speech processing. In the current study, no impact on speech perception performance was found with incongruent audio-visual room information. The perception of speech is likely too robust to be affected by incongruent audio-visual information. Furthermore, Brandewie and Zahorik (2013) showed that already less than a second of acoustic room exposure can be enough for speech perception improvements. Thus, the learned acoustical room information from the first seconds of acoustic exposure might weigh stronger than the incongruent visual room information in the development of a representation of room acoustic features in the auditory system.

The speech material (10 stories spoken by 10 talkers) was recorded specifically for this study with the aim to have distinctly different content that can be visualized with an icon. Furthermore, we aimed for natural speech as opposed to highly controlled recordings with professional speakers. This approach also comes with disadvantages; for example, some stories or talkers might be easier to understand than others. However, as stories and talkers were chosen randomly, their influence is likely to be small over the sufficiently large number of iterations.

The playback of the stories was started at a random time within the storyline to remove the bias that listeners are learning the first few seconds of the stories. Such a learning could improve performance over the course of the experiment. However, this approach likely introduces random variance in the data as some parts of the stories might be easier to identify than others. Nevertheless, a random variation was judged to be preferable over a bias.

The test paradigm used in this study arguably reflects real-life listening situations more than most traditional speech intelligibility tests. While the task of understanding and locating a speech stream in the presence of interfering speech is more similar to traditional speech tests, it is by no means a replication of a realistic cocktail-party situation. First, all talkers are located at the same distance and with the same speech level and face the listener. This decision was made to not give any level, directional, or direct-to-reverberant energy cues other than the information from the room reflections and the talkers themselves. Second, the visual avatars are highly conceptualized human bodies. Technology does not yet allow for visualization of highly realistic human avatars with conventional computational power and effort. When using avatars that share similarities with real humans but evidently are not human, viewers might get distracted (compare, uncanny valley, Diel et al., 2022). Third, lip-movements have not been included in this study. This choice was made because lip-movement simulations are not, to the authors' knowledge, evaluated for hearing research purposes. Additionally, the aim of the avatars was to be more of a “response-box” than an actual simulation of a human talker.

In the present study, we investigated the ability of listeners to analyze a virtual audio-visual spatial scene with multiple talkers. A varying number of simultaneously spoken stories was presented in environments with different amounts of reverberation and with congruent and incongruent audio-visual room information. The listeners' task was to locate a target story out of interfering speech sources. No effect of incongruent audio-visual room information was found. Furthermore, results showed that the number of simultaneous talkers affected the correct identification as well as the response time only when five or more talkers were presented simultaneously. Reverberation only affected the outcome measures when four or more talkers were presented and when the reverberation time was high but not with moderate reverberation.

The authors thank Marton Marschall, Valentina Zapata Rodriguez, and Jakob Nygard Wincentz for their valuable feedback regarding the plausibility of the virtual audio-visual rooms and Torsten Dau for feedback on the experimental design. Furthermore, we thank the two anonymous reviewers for their constructive criticism that helped to improve this article.

1.
Ahrens
,
A.
,
Lund
,
K. D.
,
Marschall
,
M.
, and
Dau
,
T.
(
2019a
). “
Sound source localization with varying amount of visual information in virtual reality
,”
PLoS ONE
14
(
3
),
e0214603
.
2.
Ahrens
,
A.
,
Marschall
,
M.
, and
Dau
,
T.
(
2019b
). “
Measuring and modeling speech intelligibility in real and loudspeaker-based virtual sound environments
,”
Hear. Res.
377
,
307
317
.
3.
Arweiler
,
I.
, and
Buchholz
,
J. M.
(
2011
). “
The influence of spectral characteristics of early reflections on speech intelligibility
,”
J. Acoust. Soc. Am.
130
(
2
),
996
1005
.
4.
Arweiler
,
I.
,
Buchholz
,
J. M.
, and
Dau
,
T.
(
2013
). “
The influence of masker type on early reflection processing and speech intelligibility (L)
,”
J. Acoust. Soc. Am.
133
(
1
),
13
16
.
5.
Berzborn
,
M.
,
Bomhardt
,
R.
,
Klein
,
J.
,
Richter
,
J. G.
, and
Vorländer
,
M.
(
2017
). “
The ITA-Toolbox: An Open Source MATLAB Toolbox for Acoustic Measurements and Signal Processing
,”
Fortschritte Der Akustik
,
Kiel, Germany
, pp.
222
225
, available at http://www.ita-toolbox.org/publications/ITA-Toolbox_paper2017.pdf.
6.
Best
,
V.
,
Keidser
,
G.
,
Buchholz
,
J. M.
, and
Freeston
,
K.
(
2015
). “
An examination of speech reception thresholds measured in a simulated reverberant cafeteria environment
,”
Int. J. Audiol.
54
(
10
),
682
690
.
7.
Bolt
,
R. H.
, and
MacDonald
,
A. D.
(
1949
). “
Theory of speech masking by reverberation
,”
J. Acoust. Soc. Am.
21
(
6
),
577
580
.
8.
Brandewie
,
E.
, and
Zahorik
,
P.
(
2010
). “
Prior listening in rooms improves speech intelligibility
,”
J. Acoust. Soc. Am.
128
(
1
),
291
299
.
9.
Brandewie
,
E.
, and
Zahorik
,
P.
(
2013
). “
Time course of a perceptual enhancement effect for noise-masked speech in reverberant environments
,”
J. Acoust. Soc. Am.
134
(
2
),
EL265
EL270
.
10.
Bronkhorst
,
A. W.
(
2000
). “
The cocktail party phenomenon: A review of research on speech intelligibility in multiple-talker conditions
,”
Acta acust. united Ac.
86
(
1
),
117
128
.
11.
Bronkhorst
,
A. W.
, and
Plomp
,
R.
(
1990
). “
A clinical test for the assessment of binaural speech perception in noise
,”
Audiology
29
(
5
),
275
285
.
12.
Buchholz
,
J. M.
, and
Best
,
V.
(
2020
). “
Speech detection and localization in a reverberant multitalker environment by normal-hearing and hearing-impaired listeners
,”
J. Acoust. Soc. Am.
147
(
3
),
1469
1477
.
13.
Calcagno
,
E. R.
,
Abregú
,
E. L.
,
Eguía
,
M. C.
, and
Vergara
,
R.
(
2012
). “
The role of vision in auditory distance perception
,”
Perception
41
(
2
),
175
192
.
14.
Carhart
,
R.
,
Johnson
,
C.
, and
Goodman
,
J.
(
1975
). “
Perceptual masking of spondees by combinations of talkers
,”
J. Acoust. Soc. Am.
58
(
S1
),
S35
.
15.
Carhart
,
R.
,
Tillman
,
T. W.
, and
Greetis
,
E. S.
(
1969
). “
Perceptual masking in multiple sound backgrounds
,”
J. Acoust. Soc. Am.
45
(
3
),
694
703
.
16.
Cherry
,
E. C.
(
1953
). “
Some experiments on the recognition of speech, with one and with two ears
,”
J. Acoust. Soc. Am.
25
(
5
),
975
979
.
17.
Cooke
,
M.
(
2006
). “
A glimpsing model of speech perception in noise
,”
J. Acoust. Soc. Am.
119
(
3
),
1562
1573
.
18.
Culling
,
J. F.
,
Hawley
,
M. L.
, and
Litovsky
,
R. Y.
(
2004
). “
The role of head-induced interaural time and level differences in the speech reception threshold for multiple interfering sound sources
,”
J. Acoust. Soc. Am.
116
(
2
),
1057
1065
.
19.
Diel
,
A.
,
Weigelt
,
S.
, and
Macdorman
,
K. F.
(
2022
). “
A meta-analysis of the uncanny valley's independent and dependent variables
,”
J. Hum-Robot. Interact.
11
(
1
),
1
33
.
20.
Drullman
,
R.
, and
Bronkhorst
,
A. W.
(
2000
). “
Multichannel speech intelligibility and talker recognition using monaural, binaural, and three-dimensional auditory presentation
,”
J. Acoust. Soc. Am.
107
(
4
),
2224
2235
.
21.
Duquesnoy
,
A. J.
, and
Plomp
,
R.
(
1980
). “
Effect of reverberation and noise on the intelligibility of sentences in cases of presbyacusis
,”
J. Acoust. Soc. Am.
68
(
2
),
537
544
.
22.
Durlach
,
N. I.
,
Mason
,
C. R.
,
Kidd
,
G.
, Jr.
,
Arbogast
,
T. L.
,
Colburn
,
H. S.
, and
Shinn-Cunningham
,
B. G.
(
2003
). “
Note on informational masking (L)
,”
J. Acoust. Soc. Am.
113
(
6
),
2984
2987
.
23.
Favrot
,
S.
, and
Buchholz
,
J. M.
(
2010
). “
LoRA: A loudspeaker-based room auralization system
,”
Acta Acust. united Ac.
96
(
2
),
364
375
.
24.
Fichna
,
S.
,
Biberger
,
T.
,
Seeber
,
B. U.
, and
Ewert
,
S. D.
(
2021
). “
Effect of acoustic scene complexity and visual scene representation on auditory perception in virtual audio-visual environments
,” in
2021 Immersive and 3D Audio: From Architecture to Automotive (I3DA) 2021
, pp.
1
9
.
25.
Freyman
,
R. L.
,
Balakrishnan
,
U.
, and
Helfer
,
K. S.
(
2004
). “
Effect of number of masking talkers and auditory priming on informational masking in speech recognition
,”
J. Acoust. Soc. Am.
115
(
5
),
2246
2256
.
26.
Gil-Carvajal
,
J.
,
Cubick
,
J.
,
Santurette
,
S.
, and
Dau
,
T.
(
2016
). “
Spatial hearing with incongruent visual or auditory room cues
,”
Sci. Rep.
6
,
37342
.
27.
Glyde
,
H.
,
Buchholz
,
J.
,
Dillon
,
H.
,
Best
,
V.
,
Hickson
,
L.
, and
Cameron
,
S.
(
2013
). “
The effect of better-ear glimpsing on spatial release from masking
,”
J. Acoust. Soc. Am.
134
(
4
),
2937
2945
.
28.
Grange
,
J. A.
, and
Culling
,
J. F.
(
2016
). “
The benefit of head orientation to speech intelligibility in noise
,”
J. Acoust. Soc. Am.
139
(
2
),
703
712
.
29.
Gupta
,
R.
,
Ranjan
,
R.
,
He
,
J.
, and
Gan
,
W.-S.
(
2018
). “
Investigation of effect of VR/AR headgear on head related transfer functions for natural listening
,” in
AES International Conference on Audio for Virtual and Augmented Reality
, August 11, 2018,
Redmond, WA
, http://www.aes.org/e-lib/browse.cfm?elib=19697.
30.
Hawley
,
M. L.
,
Litovsky
,
R. Y.
, and
Colburn
,
H. S.
(
1999
). “
Speech intelligibility and localization in a multi-source environment
,”
J. Acoust. Soc. Am.
105
(
6
),
3436
3448
.
31.
Kidd
,
G.
,
Mason
,
C. R.
,
Richards
,
V. M.
,
Gallun
,
F. J.
, and
Durlach
,
N. I.
(
2008
). “
Informational masking
,” in
Auditory Perception of Sound Sources
, edited by
W. A.
Yost
,
A. N.
Popper
, and
R. R.
Fay
(
Springer
Handbook of Auditory Research
,
New York
), Vol.
29
, pp.
143
189
.
32.
Kopčo
,
N.
,
Best
,
V.
, and
Carlile
,
S.
(
2010
). “
Speech localization in a multitalker mixture
,”
J. Acoust. Soc. Am.
127
(
3
),
1450
1457
.
33.
Kuznetsova
,
A.
,
Brockhoff
,
P. B.
, and
Christensen
,
R. H. B.
(
2017
). “
lmerTest package: Tests in linear mixed effects models
,”
J. Stat. Softw.
82
(
13
),
1
26
.
34.
Lenth
,
R.
(
2020
).
emmeans: Estimated Marginal Means, aka Least-Squares Means
, available at https://cran.r-project.org/package=emmeans (Last viewed September 5, 2022).
35.
Lund
,
K. D.
,
Ahrens
,
A.
, and
Dau
,
T.
(
2019
). “
A method for evaluating audio-visual scene analysis in multi-talker environments
,” in
Proceedings of the International Symposium on Auditory and Audiological Research
,
Auditory Learning in Biological and Artificial Systems
,
Ballerup, Denmark
, Vol.
7
.
36.
McGurk
,
H.
, and
MacDonald
,
J.
(
1976
). “
Hearing lips and seeing voices
,”
Nature
264
(
5588
),
746
748
.
37.
Miller
,
G. A.
, and
Licklider
,
J. C. R.
(
1950
). “
The intelligibility of interrupted speech
,”
J. Acoust. Soc. Am.
22
(
2
),
167
173
.
38.
Moncur
,
J. P.
, and
Dirks
,
D.
(
1967
). “
Binaural and monaural speech intelligibility in reverberation
,”
J. Speech Lang. Hear. Res.
10
(
2
),
186
195
.
39.
Nabelek
,
A. K.
, and
Mason
,
D.
(
1981
). “
Effect of noise and reverberation on binaural and monaural word identification by subjects with various audiograms
,”
J. Speech. Lang. Hear. Res.
24
(
3
),
375
383
.
40.
Nábĕlek
,
A. K.
, and
Pickett
,
J. M.
(
1974
). “
Reception of consonants in a classroom as affected by monaural and binaural listening, noise, reverberation, and hearing aids
,”
J. Acoust. Soc. Am.
56
(
2
),
628
639
.
41.
Plomp
,
R.
(
1976
). “
Binaural and monaural speech intelligibility of connected discourse in reverberation as a function of azimuth of a single competing sound source (speech or noise)
,”
Acta Acust. united Act.
34
(
4
),
200
211
, available at http://www.ingentaconnect.com/content/dav/aaua/1976/00000034/00000004/art00004.
42.
Postma
,
B. N. J.
, and
Katz
,
B. F. G.
(
2017
). “
The influence of visual distance on the room-acoustic experience of auralizations
,”
J. Acoust. Soc. Am.
142
(
5
),
3035
3046
.
43.
R Core Team
. (
2020
).
R: A Language and Environment for Statistical Computing
, https://www.r-project.org/ (Last viewed September 6, 2022).
44.
Schutte
,
M.
,
Ewert
,
S. D.
, and
Wiegrebe
,
L.
(
2019
). “
The percept of reverberation is not affected by visual room impression in virtual environments
,”
J. Acoust. Soc. Am.
145
(
3
),
EL229
EL235
.
45.
Simpson
,
B. D.
,
Brungart
,
D. S.
,
Iyer
,
N.
,
Gilkey
,
R. H.
, and
Hamil
,
J. T.
(
2006
). “
Detection and localization of speech in the presence of competing speech signals
,” in
Proceedings of the 12th International Conference on Auditory Display
, June 20–23, 2006, London, UK, pp. 129–133.
46.
Simpson
,
S. A.
, and
Cooke
,
M.
(
2005
). “
Consonant identification in N-talker babble is a nonmonotonic function of N
,”
J. Acoust. Soc. Am.
118
(
5
),
2775
2778
.
47.
Warzybok
,
A.
,
Rennies
,
J.
,
Brand
,
T.
,
Doclo
,
S.
, and
Kollmeier
,
B.
(
2013
). “
Effects of spatial and temporal integration of a single early reflection on speech intelligibility
,”
J. Acoust. Soc. Am.
133
(
1
),
269
282
.
48.
Watson
,
C. S.
(
2005
). “
Some comments on informational masking
,”
Acta Acust. united Ac.
91
(
3
),
502
512
.
49.
Weller
,
T.
,
Best
,
V.
,
Buchholz
,
J. M.
, and
Young
,
T.
(
2016
). “
A method for assessing auditory spatial analysis in reverberant multitalker environments
,”
J. Am. Acad. Audiol.
27
(
7
),
601
611
.
50.
Xia
,
J.
,
Xu
,
B.
,
Pentony
,
S.
,
Xu
,
J.
, and
Swaminathan
,
J.
(
2018
). “
Effects of reverberation and noise on speech intelligibility in normal-hearing and aided hearing-impaired listeners
,”
J. Acoust. Soc. Am.
143
(
3
),
1523
1533
.