The minimum audible angle has been studied with a stationary listener and a stationary or a moving sound source. The study at hand focuses on a scenario where the angle is induced by listener self-translation in relation to a stationary sound source. First, the classic stationary listener minimum audible angle experiment is replicated using a headphone-based reproduction system. This experiment confirms that the reproduction system is able to produce a localization cue resolution comparable to loudspeaker reproduction. Next, the self-translation minimum audible angle is shown to be 3.3° in the horizontal plane in front of the listener.

The minimum audible angle (MAA) in azimuth for a stationary listener and a sound source has been established in multiple studies to be approximately 1° in the frontal listening area and to degrade gradually moving away from the median plane (Mills, 1958; Perrott and Saberi, 1990). In these studies, the test participant is typically seated in an anechoic chamber, and their movement is physically limited or otherwise discouraged. The sound event is fixed to a stationary loudspeaker, or in some studies to a moving boom to study the minimum audible movement angles (MAMA) (Perrott and Musicant, 1977), where a single sound source is dynamically moved across space at a certain distance. The angles found in these studies depend heavily on velocity, frequency content, and listener training, and are found to be on average 2 to 3 times larger than MAA for stationary stimuli (Carlile and Leung, 2016).

In contrast to MAA and MAMA studies, a natural way for humans to observe the world is an active process where motor functions support sensory information processing (Engel et al., 2013). Dynamic cues resulting from head rotation have been shown to resolve front-back confusions in binaural sound reproduction (Begault et al., 2001). Further dynamic cues due to listener translation in a sound field are less studied, but some results are presented stating that the dynamic setting eases the requirements for having individualized head-related transfer functions (HRTFs) in binaural audio reproduction (Loomis et al., 1990) and that the motion parallax and acoustic time to target are informative about the relative motion between observer and source (Speigle and Loomis, 1993). Brimijoin and Akeroyd (2014) studied the minimum moving audible angle (MMAA) where a pair of sound sources are concurrently active and are either rotated around the listener, or the rotation is achieved by head movements. They found 1°–2° smaller angles when the movement was due to head movement. Recently, Genzel et al. (2018) showed that active self-translation improves auditory depth perception via the acoustic parallax phenomenon. They also found that the sensitivity to relative depth deteriorates when subjects are translated by a motion platform or when the sound sources themselves move.

Based on the previous research, active self-motion appears to be beneficial to accuracy in relative judgments of sound event direction and distance in dynamic scenarios. This study explores the absolute perception of sound event stationarity in a dynamic six-degrees-of-freedom (DoF) setting where the listener rotates their head around three axes and self-translates across a three-dimensional space. Binaural reproduction is utilized in the experiment. The translational movement is typically encountered in everyday life, but it is also increasingly important for virtual and augmented reality audio as room-scale tracking becomes commonplace. The goal is to estimate a self-translation minimum audible angle (ST-MAA) and to compare this to a source-translation induced minimum audible movement angle, where the dynamic binaural cues elicited by the two types of translation are identical.

The goal of experiment I is to link our headphone-based reproduction method to existing literature on sound event localization.

In total, 15 people participated in experiment I. Each participant was screened for hearing impairments by a standard pure-tone audiometry. Four of the participants were excluded from the final data analysis due to poor audiometry performance, leaving a total of 11 participants (seven female, four male). Their average age was 34.6 years (SD = 14.7). All participants provided a written informed consent to take part in the test.

Pink noise, where each octave band has an equal amount of energy, was rendered to headphones by a parametric binaural renderer. The interaural time difference (ITD) and interaural level difference (ILD) were computed from a spherical head model (Algazi et al., 2001; Duda and Martens, 1998). The processing consisted of a time-varying delay line and second-order shelving filter. No individualization was performed and the radius of the head model was set to 87 mm. All sound events were located on the horizontal plane at an equal distance from the listener. No pinna model and therefore no elevation adjustment were included. The signal was played back with an RME Babyface audio interface at 48 kHz and buffer size of 32 samples via a Beyerdynamic DT770 PRO headphones. The HTC Vive headset displayed a virtual landscape and provided the real-time position data of the participant. Additionally, the display showed a visual cue where the participant should be facing. The average total latency from movement to stimulus for the HTC Vive system is 22 ms (Niehorster et al., 2017). Experiment logic, interface, and stimuli were programmed, and the experiment controlled, in Max/MSP 7.

Method of constant stimuli was used by implementing a transformed 3-down, 1-up adaptive staircase procedure, where three consecutive correct answers were required to lower the stimulus level (Levitt, 1971). This procedure allowed the estimation of a threshold value for 79.4% correct response rate. The track initiated with step size 1°, which was reduced to 0.5° after five reversals, and finally to 0.1° after another five reversals. The track was completed after a total of 15 reversals or 120 trials.

Each participant completed two randomly interleaved adaptive tracks with starting azimuth angle ±10° from the reference. One trial consisted of two consecutive noise bursts with 1000 ms duration and 200 ms inter-stimulus-interval. The task was to detect whether the second auditory event was located to the right or the left of the first auditory event regardless of the pair's absolute position in space. The correct direction was randomly varied between trials. Additionally, a pseudo-random offset of ±10° was added to both the reference and test sound event in each trial.

Mean MAA was calculated from the five last reversal values in the two adaptive tracks for each participant. Stationary MAA averaged across participants was found to be 1.30°. The participant averages together with the mean MAA are displayed in Fig. 1. The value found here corresponds to the threshold angle at 79.4% correct response rate (Levitt, 1971). This value is in line with previous literature on loudspeaker-based MAA experiments (Mills, 1958) and confirms that our headphone-based reproduction system is capable of producing binaural sounds with similar localization resolution.

Fig. 1.

Mean minimum audible angle for stationary listener. Each participant's angle is based on the average of the last five reversal levels in two adaptive tracks. The whiskers denote 95% confidence interval of the mean.

Fig. 1.

Mean minimum audible angle for stationary listener. Each participant's angle is based on the average of the last five reversal levels in two adaptive tracks. The whiskers denote 95% confidence interval of the mean.

Close modal

Experiment II establishes an estimate for self-translation induced minimum audible angle through two different two-alternative forced choice (2AFC) discrimination tasks where either the listener or the source translates across space.

In total, 24 people participated in experiment II. These participants did not take part in experiment I. They were screened for hearing impairments by a standard pure-tone audiometry and all provided a written informed consent to participate in the study. Out of the 24 participants five were excluded from the final analysis due to missing a control condition, which is defined in Sec. 3.3. The remaining participant pool is composed of 4 females and 15 males with average age of 30.1 years (SD = 6.0).

Stimulus was similar to experiment I, namely, pink noise was rendered to headphones by the parametric binaural renderer. In contrast to experiment I, the sound event was continuous. To introduce onset localization cues which are known to be critical, the pink noise was pulsed with a pulse duration of 100 ms with an interval of 300 ms. In addition, the spatial resolution was examined by rendering the sound events to distances from 1 to 10 m from the listener. Relative distance to the source was used as a proxy to reduce the effect of listener translation on the rendered signals' localization cues. The participant's head was tracked in 6 DoF and the rendering adapted in real-time to positional and rotational changes. The signal level was constant regardless of distance to avoid the possible degradation of angular localization cues due to reduced loudness.

The visual scene on the head-mounted display (HMD) showed a sky-box rendered at infinite distance which did not react to positional changes. There were vertical pillars denoting the end-points of the lateral translation range and the direction where the participant should face. There was a pillar marking the position of the participant within the range and additionally a virtual carpet denoting the area where the participant was allowed to move. The visual scenery was selected to remove any real-world visual cues about the size of the space and to help the participants to imagine distant sound events.

The translating listener session presented the participants with a 2AFC task where the goal was to find the sound event that was stationary in the virtual reality instead of following the participant's translations. The task was implemented with a ±0.25 m lateral translation range with the sound event rendered at the center of the range at distances from 1 to 10 m with a 1-m interval. The allowed lateral movement range was displayed visually in the HMD and the participant received continuous visual feedback of their location within the range. The participant was in a standing position and either slightly swayed laterally or took small steps sideways. As the participant translated within the range, the sound event was either rendered to be stationary in the virtual world (condition A) by updating the ILD, ITD, and spectral cues correspondingly, or it was rendered always at the lateral location of the participant's head (condition B) with ILD = 0 and ITD = 0 irrespective of the listener's absolute lateral position, which resulted in a perception of an internalized or centrally-located auditory event. In both conditions head rotations were rendered naturally and only the self-translation resulted in differences in rendering between the conditions. The conditions are presented schematically in Fig. 2.

Fig. 2.

Two different session in experiment II: translating listener session and translating source session. Both include two conditions, which result in matching binaural signals between the sessions. Condition A in both sessions results in a perception of a dynamic auditory event that either reacts to self-translation or translates itself, whereas condition B results in a perception of a static auditory event located at the center of the head.

Fig. 2.

Two different session in experiment II: translating listener session and translating source session. Both include two conditions, which result in matching binaural signals between the sessions. Condition A in both sessions results in a perception of a dynamic auditory event that either reacts to self-translation or translates itself, whereas condition B results in a perception of a static auditory event located at the center of the head.

Close modal

The participant controlled the playback via a hand-held controller, which they could use to switch between the two conditions as many times as desired. The only way to discriminate the two sound events was to translate laterally within the given range (±0.25 m) and listen to both options. The time to complete each trial was not limited. The system provided visual feedback after each trial whether the response was correct.

The translating source session was the opposite of the translating listener session. Here, the participant was seated, and the sound event was either translating with a ±0.25 m translation range (condition A) or stationary (condition B) at distances from 1 to 10 m. The participants were instructed to minimize their head movements, but the head was not fixed. The source translation was a periodic oscillation between the range end-points. The task was a similar 2AFC discrimination task where the participant was required to detect which event was translating. They received visual feedback after each response whether their response was correct.

The two contrasting sessions (translating listener and translating source) produced similar audio signals to the ear canals, with the only difference being the participant self-translation or the lack thereof. For both sessions, the trial at each distance was repeated four times by each participant resulting in 40 trials. The order of the sessions was counterbalanced, and the order of trials within a session was pseudo-random to reduce learning effects. There were four practice trials in both sessions with a visual cue of the sound event location. The visual cue was a green sphere rendered stereoscopically at eye level and at the same distance as the sound event. The practice trials spanned the distance range from 1 to 10 m. In the practice condition, it was made sure that every participant could perceive the difference at 1 m distance. Later in the analysis the 1 m condition was used as a control condition and missing it in either session was a reason for excluding the participant.

In total 19 participants were included in the final analysis. Each participant's correct answers were counted for each distance in the two sessions and the probability to find the target sound event was modeled by a Weibull-distribution. The results of fitting the distributions to the data from the two sessions are displayed in Fig. 3 together with average probability to find the target by distance.

Fig. 3.

(Color online) Psychometric functions for translating source and translating listener sessions modeled according to the Weibull-distribution. The data points are the average of each participant's average of four trials at each distance for the ±0.25 m lateral translation range. The whiskers denote the 95% confidence interval of the mean.

Fig. 3.

(Color online) Psychometric functions for translating source and translating listener sessions modeled according to the Weibull-distribution. The data points are the average of each participant's average of four trials at each distance for the ±0.25 m lateral translation range. The whiskers denote the 95% confidence interval of the mean.

Close modal

Figure 3 shows a significant discrepancy in probabilities to differentiate the target sound event between the translating listener and translating source sessions. A threshold for 79.4% correct response level in the translating listener session was found to be 4.33 m with bootstrapped (10 000 repetitions, p = 0.05) lower confidence interval level at 3.99 m and upper at 5.19 m. This value with a 0.25 m lateral translation range corresponds to the minimum audible angle of 3.3°. Lower and upper confidence intervals are 2.8° and 3.6°, respectively. The translating source session results are above this probability level at distances up to 10 m from the listener, and the minimum audible translation angle cannot be conclusively determined based on our data, but the value appears to be in the range of the stationary listener MAA.

The 1.30° spatial resolution found in experiment I of this study would yield an 11.02 m critical distance where one can still perceive a direction-of-arrival change given the same ±0.25 m range for lateral translation. This distance is approximately 2.6 times further away than the distance (4.33 m) found in the translating listener session of this study. Similarly, the translating source session yielded a critical distance beyond 10 m, which was the furthest measured point in this study, for an estimate of the MAMA. Therefore, it can be concluded that ST-MAA is substantially larger than a stationary MAA or a MAMA. The result is striking keeping in mind that there was no difference in the audio signals presented at the ear canals in the translating listener and translating source sessions. The difference results from listener self-translation or the lack thereof.

In previous studies, self-movement was found to aid in relative judgments of source distances and directions (Genzel et al., 2018; Loomis et al., 1990). The scenario presented in this study is fundamentally different in the sense that there is no known auditory reference against which to judge the stationarity of the auditory event. The listener has to rely on their own internal interpretation of the virtual environment and make an absolute judgment about its properties. The type of motion, lateral translation, may explain the worse performance observed here compared to previous studies with rotational self-movement of the head. Potentially the proprioceptive and vestibular information for small head rotations are more informative than the somatosensory cues available for whole body translational movement.

Based on the results presented here, self-translation appears to impair absolute judgment of stationarity of sound events. Humans are shown to accept highly unnatural visual cues about spatial dimensions as long as they are consistent with self-translation (Glennerster et al., 2006). A similar mechanism may be at play in the auditory system, ignoring noisy sensory data when self-translation cues are strongly in favor of a specific interpretation. A related point of view is the cue conflict of decoupling the rotation and translation cues on the overall sensory-motor loop. This conflict may contribute towards the reduced capability to discriminate the conditions.

A few limitations are making our experimental design differ from a scenario that would be encountered in the real world. Sound level attenuation due to distance would most likely further increase the minimum audible angle at further away distances. The sound events had no semantical meaning which may be a factor in the real world. Finally, the translation and rotation cues were decoupled in our experimental setup, which would not happen in reality but is an important consideration when designing spatial audio reproduction for virtual reality.

A headphone-based reproduction system was found to be able to reproduce localization cues that result in a comparable spatial resolution to loudspeaker- and anechoic-chamber-based stationary minimum audible angle studies. In these studies, the established minimum audible angle value is approximately 1°, which is less than a third of the angle found here for the self-translation induced minimum audible angle of 3.3° in the horizontal plane in front of the listener.

International Audio Laboratories Erlangen, Germany is a joint institution of the Friedrich-Alexander-University Erlangen-Nürnberg (FAU) and Fraunhofer Institute for Integrated Circuits (IIS).

1.
Algazi
,
V. R.
,
Avendano
,
C.
, and
Duda
,
R. O.
(
2001
). “
Estimation of a spherical-head model from anthropometry
,”
J. Audio Eng. Soc.
49
(
6
),
472
479
.
2.
Begault
,
D. R.
,
Wenzel
,
E. M.
, and
Anderson
,
M. R.
(
2001
). “
Direct comparison of the impact of head tracking, reverberation, and individualized head-related transfer functions on the spatial perception of a virtual speech source
,”
J. Audio Eng. Soc.
49
(
10
),
904
916
.
3.
Brimijoin
,
W. O.
, and
Akeroyd
,
M. A.
(
2014
). “
The moving minimum audible angle is smaller during self motion than during source motion
,”
Front. Neurosci.
8
,
273
.
4.
Carlile
,
S.
, and
Leung
,
J.
(
2016
). “
The perception of auditory motion
,”
Trends Hear.
20
,
1
19
.
5.
Duda
,
R. O.
, and
Martens
,
W. L.
(
1998
). “
Range dependence of the response of a spherical head model
,”
J. Acoust. Soc. Am.
104
(
5
),
3048
3058
.
6.
Engel
,
A. K.
,
Maye
,
A.
,
Kurthen
,
M.
, and
König
,
P.
(
2013
). “
Where's the action? The pragmatic turn in cognitive science
,”
Trends Cogn. Sci.
17
(
5
),
202
209
.
7.
Genzel
,
D.
,
Schutte
,
M.
,
Brimijoin
,
W. O.
,
MacNeilage
,
P. R.
, and
Wiegrebe
,
L.
(
2018
). “
Psychophysical evidence for auditory motion parallax
,”
Proc. Natl. Acad. Sci. U.S.A.
115
,
4264
4269
.
8.
Glennerster
,
A.
,
Tcheang
,
L.
,
Gilson
,
S. J.
,
Fitzgibbon
,
A. W.
, and
Parker
,
A. J.
(
2006
). “
Humans ignore motion and stereo cues in favor of a fictional stable world
,”
Curr. Biol.
16
(
4
),
428
432
.
9.
Levitt
,
H.
(
1971
). “
Transformed up-down methods in psychoacoustics
,”
J. Acoust. Soc. Am.
49
(
2
),
467
477
.
10.
Loomis
,
J. M.
,
Hebert
,
C.
, and
Cicinelli
,
J. G.
(
1990
). “
Active localization of virtual sounds
,”
J. Acoust. Soc. Am.
88
(
4
),
1757
1764
.
11.
Mills
,
A. W.
(
1958
). “
On the minimum audible angle
,”
J. Acoust. Soc. Am.
30
(
4
),
237
246
.
12.
Niehorster
,
D. C.
,
Li
,
L.
, and
Lappe
,
M.
(
2017
). “
The accuracy and precision of position and orientation tracking in the HTC Vive virtual reality system for scientific research
,”
i-Perception
8
(
3
),
1
23
.
13.
Perrott
,
D. R.
, and
Musicant
,
A. D.
(
1977
). “
Minimum auditory movement angle: Binaural localization of moving sound sources
,”
J. Acoust. Soc. Am.
62
(
6
),
1463
1466
.
14.
Perrott
,
D. R.
, and
Saberi
,
K.
(
1990
). “
Minimum audible angle thresholds for sources varying in both elevation and azimuth
,”
J. Acoust. Soc. Am.
87
(
4
),
1728
1731
.
15.
Speigle
,
J.
, and
Loomis
,
J.
(
1993
). “
Auditory distance perception by translating observers
,” in
Proceedings IEEE Symposium on Research Frontiers in Virtual Reality
, San Jose, CA, pp.
92
99
.