The authors propose a training program for a listener to quantify the horizontal extension of an auditory image—auditory source width (ASW). The proposed program controls the ASW of a five-channel sound source by spreading it across five front loudspeakers, displays the corresponding change in visual width, and trains listeners to remember the spread angle through an isomorphic mapping to the corresponding visual cue. To evaluate the efficacy of the training, the authors conducted pre- and post-training tests. The results show that the width judgment error of the post-training test was significantly smaller than the pre-training test.
1. Introduction
A technical ear-training program refers to a systematic education program designed to allow acoustic engineers to improve their auditory sensitivity (Iwamiya et al., 2003). In an auditory training context, “sensitivity” refers to a listener's ability to perceptually discriminate and cognitively judge fine differences in auditory objects. The goal of any ear-training program is to assist trainees in acquiring such ability through systematic and technical practice.
A systematic ear-training program was initiated at the Chopin Academy of Music in 1979 (Miśkiewicz, 1992). After that, various timbre training programs have been introduced for students and junior engineers (Moulton, 1992; Quesnel, 1996; Olive, 2001; Iwamiya et al., 2003; Kim, 2015; Basset and William, 2016).
In contrast, systematic training for spatial attributes has received less attention from researchers because of difficulties in controlling a target attribute regardless of its importance in sound quality evaluation. For instance, Morimoto and Iida (1995) found that auditory source width (ASW) is important to characterize concert hall acoustics influencing the overall sound quality. In the audio production, Kim (2009) discovered that ASW is one of five salient attributes determining listeners' preference of multichannel microphone techniques.
Tobias Neher at Surrey University first investigated and proposed a spatial ear-training method (Neher, 2004). He focused on the training of ASW in two categories: individual source width and ensemble source width (Rumsey, 2002). Neher's training program adjusted the individual source width using direct sound as well as reflections reproduced from three frontal loudspeakers (using the ITU-R BS. 775-1 loudspeaker configuration). For the ensemble source width, Neher adopted a new fourth-order Ambisonic panning law and distributed seven synthesized sound sources (including synthesized early reflections and late reverberation) using the five loudspeakers (Genelec 1302A) of ITU-R BS.775-1.
Whereas Neher's method has been shown to be scientifically valid and practically applicable, no other research has been conducted on the efficacy of the training. The current authors have modified Neher's ASW training program for a new immersive audio format that incorporates five frontal loudspeakers and assesses the efficacy of training through a controlled listening experiment.
2. The ASW training program
ASW is correlated with (1) lateral energy fraction (the ratio of lateral to frontal energy), (2) loudness, (3) inter-aural cross correlation (IACC), and (4) frequency region. The current ASW training program adjusts the perceived width of a five-channel sound source by spreading the source across five frontal loudspeakers (Genelec 8020B, placed at 0°, ±30°, and ±60° from the front of the listening position), which in turn changes the IACC values.
The angular position of each channel is rendered using a pairwise amplitude-based panning method, and the position of two outrigger channels determines the width. For example, if two outmost channels were positioned at ±60°, the corresponding width magnitude is set to 60 to indicate symmetric extension from the center loudspeaker position. Two inner channels are positioned at half of the outmost channel angles, and the center channel is always positioned at 0°.
The training curriculum consists of matching tasks that ask a trainee to adjust the stimulus width and match it to a reference width. The reference width can vary from 1° (minimum width angle) to 60° (maximum), with a 5° resolution. We designed the program so that the difficulty increased depending on the training level. At the beginning, the width-variation resolution is 15°, but it reduces to 10° and 5° as the training progresses. While a trainee adjusts the width, the program displays visual feedback that (1) assists the trainee in developing an isomorphic mapping between auditory and visual images, and (2) quantifies the deviation between the question and answer width. The isomorphic mapping is essential for long-term memory of the trained skill (Corey, 2016).
Three stimuli were prepared for the training: a flute ensemble, a pop-band, and a solo cello. The first two stimuli vary the ensemble source (consisting of five distinct sound objects) width, while the last one varies the individual source (consisting of a single sound object and four channels of reverberation) width. The four channels of reverberation were created by convolving the sound object with four room impulse responses measured at a large church.
We measured the mean IACC at one-third octave bands of the pre/post-test stimuli and checked whether the proposed method rendered a perceptually continuous variation of the ASW. As Table 1 indicates, the mean IACC values are monotonically varied with the proposed width control for the flute ensemble and pop-band stimuli. Since a series of previous studies have shown that the IACC correlates with perceived width, the current monotonic relationship supports the validity of the proposed method using a panning-method to control perceived width. As for the solo trumpet, the IACC values decreased but not in a monotonic way.
. | 0° . | 5° . | 10° . | 15° . | 20° . | 25° . | 30° . | 35° . | 40° . | 45° . | 50° . | 55° . | 60° . |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Flute | 0.744 | 0.741 | 0.704 | 0.684 | 0.660 | 0.643 | 0.626 | 0.612 | 0.602 | 0.578 | 0.566 | 0.566 | 0.561 |
Pop | 0.755 | 0.748 | 0.731 | 0.701 | 0.667 | 0.640 | 0.616 | 0.598 | 0.580 | 0.560 | 0.555 | 0.544 | 0.536 |
Trumpet | 0.756 | 0.726 | 0.700 | 0.701 | 0.702 | 0.709 | 0.716 | 0.709 | 0.729 | 0.728 | 0.693 | 0.684 | 0.673 |
. | 0° . | 5° . | 10° . | 15° . | 20° . | 25° . | 30° . | 35° . | 40° . | 45° . | 50° . | 55° . | 60° . |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Flute | 0.744 | 0.741 | 0.704 | 0.684 | 0.660 | 0.643 | 0.626 | 0.612 | 0.602 | 0.578 | 0.566 | 0.566 | 0.561 |
Pop | 0.755 | 0.748 | 0.731 | 0.701 | 0.667 | 0.640 | 0.616 | 0.598 | 0.580 | 0.560 | 0.555 | 0.544 | 0.536 |
Trumpet | 0.756 | 0.726 | 0.700 | 0.701 | 0.702 | 0.709 | 0.716 | 0.709 | 0.729 | 0.728 | 0.693 | 0.684 | 0.673 |
3. Experiment
To study the efficacy of the training program, we conducted an identical test that compared listeners' width judgment between pre- and post-training. In the experiment, three trainee groups were formed: treatment group 1 (T1), treatment group 2 (T2), and the control group (C). The T1 and T2 groups received a total of six trainings (one training session lasted about 40 min), whereas the C group did not participate in training sessions. Each group had ten subjects. The T1 group received the training with a five-loudspeaker array, and the T2 group received the training using a pair of headphones and binaural recordings simulated by CATT Acoustics software. We included the T2 group to test whether the headphone-based simulation would be as effective as the loudspeaker training. If so, it would make the training program more accessible to many users without the need for a specific loudspeaker-array arrangement. In addition, we used different stimuli for the tests—a flute ensemble (different excerpt), a pop-band (also different excerpt), and a solo trumpet—to determine if the training can be effective for non-trained stimuli.
4. Results
The pre- and post-training test consisted of total 39 questions (13 presented angles × 3 stimuli). From the subjects' answers, we calculated each question's angle deviation from the reference width to the answer width. The total deviation sums of the three groups were 2596 (T1 pre-), 1426 (T1 post-), 3673 (T2 pre-), 2836 (T2 post-), 1764 (C pre-), and 1539 (C post-), respectively. These descriptive statistics indicate that the treatment groups reduced the deviation more, thus accurately perceived the magnitude of auditory width, than the control group.
We conducted three separate Wilcoxon signed-rank tests on the Timing factor (pre- and post-training test) for each group (T1, T2, and C). This non-parametric analysis was used because collected data did not satisfy the required analysis of variance assumptions. The results show significant differences for the T1 (p < 0.0001) and T2 group (p = 0.0009) while no difference for the C group (p < 0.1223).
Previously, we found that the measured IACC values of the individual source were different from the two included ensemble sources. To determine the influence of source type, we analyzed data using the Friedman test and found no significant difference between two ensemble sources (p = 0.92), yet a significant difference between the flute ensemble and solo trumpet (p < 0.0001) as well as the pop-band and solo trumpet (p < 0.0001). The estimated width magnitudes were influenced by the source type.
This led the authors to conduct Wilcoxon signed-rank tests separately for each stimulus and each group. Table 2 summarizes the six test results showing that, as for the T2 group, the deviation of estimated individual source width has been decreased after the training, but not ensemble source width. In contrast, the T1 group has decreased the deviation for both individual and ensemble source width. There was no significant difference between pre- and post-test in the C group. The results support the speculation that the treatment group subjects improved their width-judging performance more than the control group, whereas the improvement was influenced by the width source type. Figure 1 illustrates the mean answer deviation of pre- and post- tests for the ensemble (left panel) and individual (right panel) source.
Group . | Source width type . | Z . | Prob > Z . |
---|---|---|---|
T1 | Ensemble | 4.75951 | 1.941 × 10−6 |
T1 | Individual | 4.81453 | 1.475 × 10−6 |
T2 | Ensemble | 1.71404 | 0.0865 |
T2 | Individual | 3.31029 | 0.0009 |
C | Ensemble | 0.93931 | 0.3476 |
C | Individual | 1.30683 | 0.1913 |
Group . | Source width type . | Z . | Prob > Z . |
---|---|---|---|
T1 | Ensemble | 4.75951 | 1.941 × 10−6 |
T1 | Individual | 4.81453 | 1.475 × 10−6 |
T2 | Ensemble | 1.71404 | 0.0865 |
T2 | Individual | 3.31029 | 0.0009 |
C | Ensemble | 0.93931 | 0.3476 |
C | Individual | 1.30683 | 0.1913 |
5. Discussion and future work
A new training method effective for an individual source width is needed. As Fig. 1 reveals, the mean deviations of the individual source width of a solo trumpet were higher than those of ensemble sources. It is probably due to the similarity of five channels, which leads to small IACC values (as listed in Table 1) even in its maximum spread.
This spatial ear training is strongly influenced and limited by room acoustics—that is, strong lateral reflections always influence perceived width. We created the headphone version to minimize such influence (the T2 group). However, the training effect in this group was not as strong as the loudspeaker-based training group (T1). The authors plan to improve the training effect of the headphone version by (1) extending the effective frequency range of the simulation (in particular, the low-frequency range) and (2) incorporating a head-tracker to provide trainees with a dynamic binaural stimulus.
The authors have designed the experiment to observe the learning-related improvement while minimizing the post-training familiarization improvement by differentiating the sound sources for the training and the tests (both pre- and post-test). The experiment could have employed one more control condition, a familiarization group, where the trainees would complete the given ASW training but not receive any feedback from the program. Comparing the performance of this group with the other groups would be able to determine the influence of post-training familiarization.
6. Conclusion
This study introduces a new ASW training program and evaluates whether the program improves the ability to manipulate perceived width through a controlled listening test. The results show that width-matching training enables participants to judge the given width with less deviation. Participants who received training with a loudspeaker array performed better than the group that received training with a pair of headphones as well as the control group who did not receive any training.