This letter presents a reaction time analysis of a sound lateralization test. Sounds from various directions were synthesized using interaural time–level difference (ITD–ILD) combinations, and human subjects performed left/right detection. Stimuli from the sides yielded quicker reactions and better class accuracy than from the front. Congruent ITD–ILD cues significantly improved both metrics. For opposing ITD–ILD cues, subjects' choices were mostly driven by the ITD, and the responses were significantly slower. The findings, obtained with an easily accessible methodology, corroborate the integrated processing of the binaural cues and promote the use of multiple congruent binaural cues in headphone reproduction.
Spatial hearing evokes multiple facets of sound perception, enabling azimuthal, vertical, and front–back localization, distance and motion perception, and segregation of spatially separated sound streams.1 Spatial hearing, moreover, enriches the real-world listening experience.2 Understanding spatial hearing is important for spatial sound synthesis and reproduction in applications, such as enhanced teleconferencing and augmented reality/virtual reality (AR/VR).3
The prominent acoustic cues aiding spatial hearing include binaural cues, such as interaural time–level differences (ITDs–ILDs, respectively), which localize the sound source down to a so-called “cone of confusion”, and monaural spectral cues arising from the shape of the pinna, head, and torso, which carry stimulus elevation information and help resolve front–back confusion.4 Focusing on the localization of sounds perceived within the head along the interaural axis, also referred to as lateralization, numerous psychoacoustic studies have shown that ITD and ILD cues are best perceived in two largely non-overlapping spectral regions for pure tone signals.4,5 Unlike tone signals, natural sound signals are broadband. For broadband noise–like sounds, there is evidence demonstrating that the temporal fine-structure ITD is the dominant cue for source lateralization.6,7 However, this dominance effect diminishes when there is low coherence between the left and right ear signals,8 encountered, for example, in reverberant ambience. Neurophysiologically, the ITD and ILD cues are extracted separately by specialized brainstem nuclei.9 To reach a coherent representation of the auditory space, the brain needs to fuse the information provided by the two cues. Evidence supporting this hypothesis is found in neuroimaging studies done on human subjects engaged in localization tasks, such as analyzing the auditory brainstem response in the mid-auditory pathways,10 and using electroencephalography and functional magnetic resonance imaging to analyze the response from the cortical pathways.11,12
The neurophysiological processes involved in extraction of ITD and ILD cues entail processing delays. In cognitive science, reaction time (RT), defined as the time from stimulus presentation to subject response acquisition, has been used as a means to understand the mental operations in behavioral tasks.13–15 Focusing on psychoacoustics, early studies have analyzed the RT data distributions obtained from tone-in-noise detection tasks,16 and few others have established an inverse relationship between RT and loudness.17,18 The RT analysis has also been used to investigate the stimulus factors influencing the perception of deviations in spectral characteristics of noise signals.19 Recently, Sharma et al.20 utilized RT analysis to understand human talker change detection in naturalistic speech utterances, and Sharma et al.21 explored RT for analyzing the impact of language familiarity on talker change detection. The RT analysis approach has also been explored to understand the effect of redundant stimuli.22 In this context, it has been shown that when subjects are required to respond as quickly as possible to the onset of any stimulus, they usually respond more quickly when two stimuli are presented than when only one stimulus is presented.23,24 Some studies have also explored localization using RT analysis. In a small-scale study (with six subjects) on localization of external sound sources, the average RT was found to be approximately 260 ms.25 In a study comparing blind and non-blind listeners' localization accuracy, RT was found to improve with training in blind listeners.26 In contrast to sound lateralization, the acoustic cues for sound localization are associated with multiple factors, such as head-related transfer function, head movement, and ambient reverberation. This complicates the stimulus design procedure for sound localization experiments. Further, collecting subject RT for sound localization is influenced by the search for response button (or head orientation) corresponding to the perceived angle.25,26 This search time, independent of acoustic cues, may be biased toward certain locations.
The ease in capturing RT in behavioral listening tests is advancing scientific understanding of sound perception. A largely unexplored topic in this is sound lateralization. In contrast to sound localization, capturing RT for sound lateralization is simplistic as it only requires the subject to choose between left and right using, for example, a joystick. Newer findings on sound lateralization can benefit the design of spatial audio standards, which aim to improve the quality of experience of AR/VR applications.27 For example, designing sound that swiftly elicits lateralization percept in AR/VR gaming applications is becoming essential for creating realistic games.28 Designing interior audio cues for cars and cockpits of aircraft is also gaining interest.29,30 These applications aim to increase situational awareness using interior audio cues, which can quickly draw attention to the desired direction (and/or information).
To further the scientific insights and facilitate sound design in applications requiring sound lateralization, this study focuses on using RT analysis to study the interaction between the ITD and ILD cues. Four different combinations of ITD and ILD cues were considered: ITD only, ILD only, congruent ITD and ILD, and opposing ITD and ILD. A stimulus set composed of binaural broadband sound stimuli embedded with these cue conditions for eight different angles was designed. A listening test was conducted to capture the accuracy and RTs for left-right lateralization [shown in Fig. 1(a)]. The collected human behavioral response data were analyzed to understand the impact of various interaural cue conditions and source directions on accuracy and RTs.
A total of 24 subjects (nine female, 15 male), 24–40 yrs of age and without diagnosed hearing impairment, participated in the listening experiment. All participants provided their written consent for the listening test. The subjects were residents of Erlangen, Germany.
2.2 Sound stimuli
The sound stimuli were synthesized using a parametric binaural renderer (used in Refs. 31 and 32). The ITD and ILD were computed from a pinnaless “blockhead” model33 that has median plane symmetry. The model consists of a half-sphere stacked on a rectangular prism, with a half-sphere radius of a = 9.08 cm, and rectangle height of b = 3 cm, i.e., no individualization has been performed. No elevation adjustments were included, as all sound sources were located on the horizontal plane at an equal distance from the listener. The processing consisted of a time-varying delay line to embed ITD and a head shadowing filter to embed ILD. For each angle, the two cues were embedded independently into the sound signal. Synthetic sound stimuli with four binaural cue conditions, listed in Fig. 1(b), were synthesized for each angle. To synthesize sound stimuli with opposing ITD and ILD cues, the delays (modeling the ITD) corresponding to the left and right channels were switched compared to those used for the congruent cues. The sound source was a pink noise signal, featuring the typical frequency rolloff.34 The rendered binaural audio had a sound pressure level of 66.99 ± 0.60 dB, with a sampling rate of 48 kHz. (See supplementary material for the resulting average behavior of the ITD and ILD cues estimated from the synthesized sound stimuli.35)
2.3 Listening test
The listening test was conducted in an isolated listening booth. The sound stimuli were presented using Beyerdynamic DT 770 PRO headphones, for which the frequency responses are available in Ref. 36. The visual instructions (and indications) were provided on a Fujitsu P24T-7LED monitor kept at a distance of approximately cm, and key press responses were collected using and Xbox One Controller 1708 connected with a universal serial bus cable to the computer. The graphical user interface for stimulus presentation and data collection were designed using the open-access PsychoPy Python utility37 and presented on a Dell computer with an Intel i7–4700k processor, running a Microsoft Windows operating system.
The setup of the listening test is illustrated in Fig. 1(a). Each trial started by displaying a rectangle visual indicator for s, prompting the subject to get ready. Directly afterward, a stereo sound stimulus was presented. Subjects were instructed to listen to sound stimulus and respond (via a key press) as soon as they decided whether the sound source was left- or right-lateralized. The decision (left or right) and the RT to make the decision (since the start of the stimulus) were recorded. The trial ended soon after the subject responded with a key press. This approach, termed response-terminated stimulus playback, is also used in other RT-based psychoacoustics studies.16,19,20 The listening test comprised of five sessions, with 128 trials in each. All five sessions featured the same stimulus set, featuring four repeats from each of the four cue conditions and eight angles, presented in random order. After each session, there was a break, and feedback was provided by displaying the performance of the session in terms of accuracy (computed by pooling all trials corresponding to , and cue conditions). The subjects were encouraged to use the break ( min) for flexing their muscles to avoid fatigue. On average, the listening test experiment took 30 min.
Obtained subject responses were first filtered for outliers, where the RTs that were faster than 100 ms or slower than 3000 ms were removed, per the suggestions in Ref. 38. For every subject, the first session was considered as an unsupervised learning iteration, i.e., the subjects listened to each type of stimulus but without a direction label provided, and only data from sessions two–five are used for the rest of the analysis. This resulted in 16 trials per stimulus, per subject. After preprocessing, we obtained a total of 12 275 responses pooled from all subjects.
2.5 Statistical analysis
After data preprocessing, the accuracy for any specific condition (such as lateralization cue and/or angle) is obtained by computing the percentage of trials with correct responses over all trials corresponding to that condition. Here, accuracy corresponds to left–right class accuracy. Similarly, the subject average RT is computed by pooling the RT data from all correct response trials for the corresponding condition. As a summary result, average accuracy and average RT were computed by pooling the accuracy and average RT from all the subjects.
Bootstrapping (n = 1000 draws with replacement) was used to report 95% confidence intervals (CIs), as described in Ref. 39. To compare the difference between RTs across conditions, classic non-parametric tests were utilized, namely, the Mann–Whitney U40 and Wilcoxon signed-rank tests41 for independent and paired samples settings, respectively. The p-values and statistics are provided in the result description, and annotations on the plots are provided according to the following mapping: n.s. for p > 0.05, for for , and for . To compare cue conditions across multiple angular locations, such as in Fig. 3, p-values were computed by paired comparisons of each angle, and then the harmonic mean was used to aggregate the results across multiple angles, per Ref. 42. The chance accuracy, used in Fig. 4(b), is computed assuming a binomial distribution for classification errors made by a subject that answers according to a coin toss.43 For a number of trials N = 16 and a statistical significance of 95%, we obtain a threshold of 68.75%.
3.1 Task performance
Figure 2(a) depicts the average accuracy computed across subjects as a function of angle. The left–right detection accuracy is greater than 90% for all angles. This suggests that the lateralization task was fairly easy. A comparison of left vs right performance did not show a statistically significant difference in the accuracies. This suggests the subjects were equally good for the sound source presence on either side. Figure 2(b) illustrates average accuracy as a function of absolute angle. There is a larger relative drop in the detection performance for ±15° compared to the larger angles, indicating some difficulty for smaller angles. A statistical comparison showed that the largest significant drop in average accuracy occurs from ±30° to ±15° (N = 24, W = 730, p < 0.001) and a significant drop in accuracy from ±60° to ±30° (N = 24, W = 176.5, p < 0.05). The small drop in accuracy from ±90° to ±60° was not significant.
Figure 2(c) depicts the average RT computed across subjects. The average RTs across the eight angles reside in the range of 300–380 ms. Similar to accuracy, there is no significant difference in average RT for sources on the left vs right. Figure 2(d) illustrates average RT as a function of the absolute angle. There is a significant rise in the average RT for smaller angles, that is, from ±60° to ±30° (N = 24, W = 301, p < 0.01) and ±30° to ±15° (N = 24, W = 60, p < 0.001).
3.2 Comparison between single and two cues combinations
Figure 3(a) depicts the accuracy separately for the three binaural cue conditions: and . Both sides exhibit a decrease in accuracy from larger to smaller lateralized angles. This trend holds strictly across all angles for but not for and . Comparing the accuracy across the three conditions, the performance for is always higher than that for and . A statistical comparison using a one-sided Wilcoxon signed-rank test across all angles suggests this difference is significant. Comparison between and also shows a significant difference in a two-sided Wilcoxon signed-rank test.
Figure 3(b) depicts the average RTs. For both the left and right sides, a trend depicts an increase in average RT from larger to smaller angles. This trend holds strictly for but not across all angles for and . Comparing the RT trend across the three cue conditions, the RT for is always faster than that for and . A statistical comparison across all angles, according to the methodology described in Sec. 2.5, suggest the difference between and the other two as significant in a one-sided Wilcoxon signed-rank test. Comparison between and shows a significant difference in a two-sided Wilcoxon signed-rank test, but a one-sided Wilcoxon test does not yield any of the two as significantly large.
3.3 Impact of opposing cues combination
Figure 4(a) shows a comparison of average RT between (congruent) and (opposing) cue conditions. Across all angles, the RT for is significantly slower than that for . Also, the CIs are larger for in comparison to .
Hypothetically, assuming that the ground truth for is dictated by the ITD cue, we evaluate the accuracy for this cue condition. Figure 4(b) shows the performance accuracy for different subjects across different angles. Most subjects have a performance accuracy greater than the computed chance level threshold of 68.75%.
The analysis of the collected data shows that subjects perform fairly well in the left–right lateralization detection task. The detection accuracy is impacted by lateralization angle, witnessing a drop for angles closer to the front (that is, ±15° and ±30°). Interestingly, analysis of the average RT across subjects shows a significant impact of lateralization angle and lateralization cue condition on the reaction time.
The results indicate a slower average RT for lateralization of sound sources closer to the front (that is, ±15° and ±30°) relative to those toward the sides (that is, ±60° and ±90°). Dissecting this further, results indicate an impact of lateralization cue condition on the average RT. Specifically, sound stimuli containing both ITD and ILD cues (that is, ) featured a faster average RT in comparison to sound stimuli with only one of these cues (that is, and ). This suggests that driving lateralization using multiple binaural cues can benefit spatial hearing in applications that require time-constrained sensory interactions, such as AV/AR with spatial audio.
The finding on faster average RT for stimuli is in line with evidence found by Riedel and Kollmeier10 from auditory brainstem response (ABR) neuroimaging data. There, the “wave V” latency in ABR signals was found to be minimum for sound stimuli with congruent ITD–ILD cue combinations. This suggests that mechanistically, a quicker neural processing of lateralization occurs for stimuli with congruent cue combinations. Our results show that this quicker processing is preserved and reflected in the average RT for the behavioral responses as well.
The results from analyzing the average RT for sound stimuli with opposing cue combination (i.e., ) suggest an increased latency in inferring lateralization from such sounds. The disparity between α for ITD and ILD was quite large (≥30°) for the angular positions used in this study. In line with observations in Hafter and Jeffress,44 this resulted in multiple subjects reporting a split image perception for such sound stimuli. This also contributes to higher variability in RT across subjects for such sounds. For , most subjects' results are consistent with the notion that the lateralization is associated with the ITD cue. This agreement was even stronger for larger lateralization angles. Surprisingly, the choices of two subjects were consistent with the direction associated with the ILD cue. This suggests that a follow-up study on such subjects might be insightful to understand lateralization at a more subject-dependent level. Some of these observations follow the findings in Refs. 6 and 45, which suggest that for broadband stimuli, the lateralization is largely driven by the ITD.
Focusing on understanding the interaction between ITD and ILD cues, we presented an analysis of the data gathered from a lateralization listening test. Specifically, the impact of various binaural cue combinations and the angle on the lateralization accuracy and RT was provided. The results show shorter RTs for sound stimuli farther from the front. Further, sound stimuli with congruent ITD–ILD exhibit a quicker response time when compared to stimuli with ITD or ILD only. Analyzing the response to sound stimuli with exactly opposing ITD–ILD, the RT is found to be significantly slower and is mostly driven by the ITD cue. These results provide additional evidence for an integrated processing of ITD and ILD cues. The observations also promote the usage of multiple congruent cues in spatial audio presentation for applications such as AV/AR with immersive audio.
We are thankful to Dr. Olli Rummukainen for discussions during the initial planning of this study and for sharing resources for simulating the head model for binaural sound synthesis.