This study investigates how virtual head rotations can improve a binaural model's ability to segregate speech signals. The model takes two mixed speech sources spatialized to unique azimuth positions and localizes them. The model virtually rotates its head to orient itself for the maximum signal-to-noise ratio for extracting the target. An equalization-cancellation approach is used to generate a binary mask for the target based on localization cues. The mask is then overlaid onto the mixed signal's spectrogram to extract the target from the mixture. Improvement in signal-to-noise ratios from head rotation approaches over 30 dB.

The results of several studies have found that human listeners can improve their performance in auditory tasks by utilizing head movements. Perrett and Noble (1997) showed that head movements can resolve front-back confusion when localizing a sound source. Bronkhorst and Plomp (1988) suggested that speech reception thresholds were lowest for sound sources at angles that offer a “better ear advantage,” resulting in improved signal-to-noise ratios (SNR). It therefore stands to reason that listeners might use head movements to achieve optimal performance in challenging auditory environments. However, most binaural models do not include head movements. For this reason, a binaural model was developed that is free to virtually rotate its head in order to optimize its own performance. The model makes quantitative predictions, both in terms of angular head turns and the resulting SNR, that could be compared to behavioral studies in the future to determine if listeners use head movements to maximize their SNR using this better-ear advantage.

Cherry (1957) coined the term “the cocktail party effect” to describe the human ability to segregate speech from a mixture of concurrent speakers. Bregman (1990) summarized a series of related theories—some of which attempt to explain cocktail party processing—under the general term auditory scene analysis (ASA). Computational auditory scene analysis (CASA) attempts to have a computer perform ASA tasks as a human would—a goal that has proven difficult to achieve. The CASA model presented here is a source-blind localization and segregation model that performs cocktail party processing; it optimizes its performance via virtual head rotation.

The goal of this study was to investigate whether a traditional binaural segregation model could be extended by using head movements to show an improvement in performance, or whether a completely new architecture would be required. We used an equalization-cancellation-based source segregation model to separate two sound sources using a binary mask. Two anechoic sound sources are spatialized using head-related transfer functions (HRTFs), with the individually characteristic ITD and ILD cues as segregation criteria.

The model consists of several stages, as shown in Fig. 1: head movement-assisted localization, head rotation, and segregation. All signals and filters used in this model were sampled at 44.1 kHz at a bit depth of 16 bits. For time segmentation, overlapping Hanning-windows were used, with windows having 45-ms duration and 22 ms of overlap between neighboring time windows. For frequency segmentation, a gammatone filter bank was used [Hohmann implementation from the Auditory Modeling Toolbox (Søndergaard and Majdak, 2013)], which allows for linear frequency resynthesis. The filter bank consists of forty bands spanning 20 Hz to 20 kHz, each with a bandwidth of one equivalent rectangular bandwidth.

Fig. 1.

A signal flow diagram of the model.

Fig. 1.

A signal flow diagram of the model.

Close modal

The model takes a mixture of two spatialized voice signals as its input, however, the architecture allows for it to process more simultaneous sources. The voice signals are anechoic mono segments of male and female speech, taken from the “Music for Archimedes” recordings (Bang & Olufsen, 1992); the waveforms of each are shown in the first panel of Fig. 2. Both signals s1 and s2 are adjusted to have the same root-mean-square pressure; the spectrograms of both signals are shown in the second and third panels of Fig. 2. The signals are spatialized via convolution using a library of binaural HRTFs (Gardner and Martin, 1995), set to the same elevation (0°) but unique azimuth angles, denoted as θ1 and θ2. After convolution, the two left channels and two right channels are added together to create a stereo mixture of the speech signals.

Fig. 2.

The model's EC process, target at +30° and masker at −45°. The top figure shows the male (black) and female (grey) voice signals; second and third, spectrograms of the female and male voice signals, respectively; fourth, the binary mask for extracting the male voice; and fifth, a spectrogram of the reconstructed male voice.

Fig. 2.

The model's EC process, target at +30° and masker at −45°. The top figure shows the male (black) and female (grey) voice signals; second and third, spectrograms of the female and male voice signals, respectively; fourth, the binary mask for extracting the male voice; and fifth, a spectrogram of the reconstructed male voice.

Close modal

The mixed stereo signal is sent to the localization algorithm, which time-segments the signal. Determining the angle of a source contained within a given time slice is based on an interaural cross-correlation (ICC) model (Braasch et al., 2013). Source angles are determined from ITD cues via an ICC functions in bands ranging from 20 Hz to 1.5 kHz, and ILD cues are extracted for frequencies from 1.5 kHz to 20 kHz. The ICC functions are remapped from the ITDs to azimuth using the methods from Braasch et al. (2013). Note that the sound sources are localized from isolated segments before segregating their mixture, assuming that in a real-world scenario the auditory system would search for isolated segments of each speaker in order to localize them. To resolve cones of confusion from ITD analysis, the model virtually rotates its head according to Braasch et al. (2013) by modifying its internal representation of the sources by 30° while holding the external representation of the sources constant. The model combines the two measured ITDs such that the true ITD peaks align, whereas the distinct cones of confusion before and after head rotation do not. The algorithm is successful in localizing a single source in all presented cases, as demonstrated by Braasch et al. (2013). For multiple sources, the model must be told how many sources are present. The localization algorithm yields two angles and passes these back to the model.

At this point, the model knows the angles of the sources θ1 and θ2, as well as the angle between sources Δθ = θ2θ1. This is checked against a library that finds the best angle for segregating one source from the other (given a constant Δθ) by maximizing the SNR, labeled as the Head Rotation Library in Fig. 1. This library was generated by running the inverse filter approach for pink noise signals, using multiple HRTF catalogs with the angle of the two sources known, with their respective output levels measured in dB-rms. The two pink noise sources were placed independently through the inverse filter process as described in Sec. 2.4. For example, one noise signal was spatialized to 30° then placed through the model for cancellation at 30°, while a second noise signal was spatialized to −45° then independently placed through the model for cancellation at 30°. The output SNR improvement was measured by

SNR(θ1,θ2)=10×log10(EextractedtargetEresidualmasker),
(1)

where E denotes the energy of the target or masker after they have been deconvolved with the respective HRTFs in the source segregation process. The model rotates its head such that the desired (target) and distractor (noise) speech signals are placed at angles θ1* and θ2* (satisfying the requirement that θ2*θ1*=Δθ) that maximize the SNR after cancellation, calculated as

ΔSNR=SNR(θ1*,2*afterrotation)SNR(θ1,2beforerotation).
(2)

The anechoic speech sources are re-convolved at these new optimal angles, simulating the model “turning its head” to reorient the sources as if the model were rotating a binaural manikin with the two sources being held stationary at a constant Δθ.

The model's segregation method is similar to Durlach's (1963) EC model, but with the time and level shifts performed with an inverse-HRTF filter. The desired filter is known from the localization algorithm's output. Decorrelated pink noise was added at approximately 20 dB-rms below the rms pressure level of the masker to approximate the performance of the model in a more realistic scenario. To extract the target signal, the signals are segmented in time and frequency to create individual time-frequency units. The model's goal is to compensate the target signal via frequency division. By dividing both the left and right channels in frequency by the head-related transfer function for the angle of the target signal, the target is de-spatialized to the center, creating a diotic signal in the bins for only the target. Thus, subtracting the right channel from the left will leave zero residual energy, indicating that only the target was present in such a time-frequency unit. However, if the masker is at all present in a time-frequency unit, then the two channels will not align in phase and amplitude after inverse filtering, and subtraction will yield some residual energy. The process can be modeled by the equations

x1=xLhL1(θtarget),
(3)
x2=xRhR1(θtarget),
(4)
E=(x1x2)2(xL)2+(xR)2,
(5)

where xL and xR denote the left and right channels that the model reads in, hL,R1 denotes the inverse head related impulse response of the left and right channels, respectively, E denotes the measured energy in a unit, and the asterisk denotes convolution. These energy values are used to generate a binary mask, shown in the fourth panel of Fig. 2. Ideally, in units where only the target is present, x1 = x2 and E = 0 as the inverse filter aligns both channels to cancel ideally. However, the influence of the masker and the noise floor will cause units in which the masker is present to create residual energy, and E will be non-zero. If the normalized energy contained in a unit is above a threshold (set to 0.15 for this model), then insufficient cancellation occurred and the masker is present; the binary mask is set to 1 for these bins. On the other hand, if there was sufficient cancellation, then only the target is present and the binary mask is set to 0 for the unit. The binary mask is then inverted such that BM(t, f)= 1 – E. After applying the mask to the mixed signal, the signal is recombined in time and frequency, and only the target signal remains with the masker signal removed. The output of this resynthesis is shown in the bottom panel of Fig. 2.

Figure 3 shows the model's performance in SNR [according to Eq. (1)] as a function of head orientation. The horizontal axis shows the location of the target, θ1. Each row depicts a specific source-angle condition between the masker and target (Δθ), denoted in each plot's title. The left graphs show the results for the signals low-pass filtered at 1.5 kHz, the right graphs show the results for the signals high-pass filtered at 1.5 kHz (filtered to account for the ITD envelope). The left-hand vertical axis and solid black line denote the SNR performance after cancellation; the right-hand vertical axis and dashed line denotes the absolute value of the difference in level of the target/masker pair between the right and left ear. These measures demonstrate the model's ability to perform cancellation for the first source placed at the angle θ1 corresponding to the independent axis, and the second source placed at θ1 + Δθ. When both sources are directly in front of and behind the head, the model is unable to perform cancellation; this was modified to read as 0 dB in the plots, but are measured as undefined. As the graphs show, there is a correlation between the difference in energy between the two ears, and the model's performance at extracting the target signal. After head rotation, output SNRs saw maximal improvements ranging from 16 to 40 dB depending on experimental conditions. It is worth noting that the SNRs presented in Fig. 3 are taken from the ideal binary mask. To compare these to experimentally computed results, the test case shown in Fig. 2 of the target at +30° and the masker at −45°, the SNRs after head rotation were measured at 19 dB and 24.75 dB for the left and right channels, respectively, while the ideal binary mask shows an SNR of 34.13 dB.

Fig. 3.

Model's signal-to-noise ratio after the target has been extracted as a function of head orientation. See text for details.

Fig. 3.

Model's signal-to-noise ratio after the target has been extracted as a function of head orientation. See text for details.

Close modal

Without head movement, the model in practice shows similar performance to the binary mask model from Roman et al. (2003). While higher SNRs are sometimes observed in this study, differences in performance between the two models are likely the result of using different speech signals and angles from the HRTF catalog.

The EC process used by the model is unusual in that the binary mask is used to isolate the target, rather than cancel the masker. The model's EC process uses a null-antenna approach to exploit the fact that the lobe of the two-channel sensor (representing the binaural hearing system) is more effective at rejecting a signal than it is at filtering and extracting a signal. As a result, the model's architecture is not limited to two sources. This is why the model generates the mask to remove the target and then inverts the mask—it allows that more than one masker can be rejected by searching for time-frequency units where the target exists via cancellation. The EC process is further optimized by using inverse HRTF filters and deconvolving the signals, rather than iteratively adjusting time and level cues. Since the sources are localized prior to segregation, the required HRTF filters are already known to the model.

Head rotation significantly improves the model's performance. In localization, it resolves cones of confusion by reinforcing the true ITD peaks and negating the secondary ITD peak. For segregation, the trends in Fig. 3 demonstrate that the model's performance correlates to the difference in energy between the left and right ear. The model supports the better-ear hypothesis (Bronkhorst and Plomp, 1988) in its ability to perform segregation. Head rotation allows improved performance by orienting the head to operate at the angle where the better ear is at a distinct advantage. The model's performance predicts that humans may turn their heads to an angle where the headshadow is greatest to optimize source extraction, as suggested by Bronkhorst and Plomp (1988).

This material is based upon work supported by the National Science Foundation (Grant Nos. IIS-1320059 and BCS-1539276), a Rensselaer HASS Fellowship, and is affiliated with the Two!Ears Project FP7-ICT-2013-C, No. 618075 supported by the European Research Council. The authors would like to thank Torben Pastore for his helpful comments.

1.
Bang & Olufsen
(
1992
). “
Music for Archimedes
,” Audio Compact Disk, Bang & Olufsen 101.
2.
Braasch
,
J.
,
Clapp
,
S.
,
Parks
,
A.
,
Pastore
,
T.
, and
Xiang
,
N.
(
2013
). “
A binaural model that analyses aural spaces and stereophonic reproduction systems by utilizing head movements
,” in
The Technology of Binaural Listening Application of Models of Binaural Listening Binaural Listening in Technology
(
Springer
,
Berlin
), pp.
201
223
.
3.
Bregman
,
A. S.
(
1990
).
Auditory Scene Analysis
(
MIT Press
,
Cambridge, MA)
, pp.
1
46
.
4.
Bronkhorst
,
A. W.
, and
Plomp
,
R.
(
1988
). “
The effect of head-induced interaural time and level differences on speech intelligibility in noise
,”
J. Acoust. Soc. Am.
83
,
1508
1516
.
5.
Cherry
,
E. C.
(
1957
).
On Human Communication
(
MIT Press
,
Cambridge, MA)
.
6.
Durlach
,
N. I.
(
1963
). “
Equalization and cancellation theory of binaural masking-level differences
,”
J. Acoust. Soc. Am.
35
,
1206
1218
.
7.
Gardner
,
W.
, and
Martin
,
K.
(
1995
). “
HRTF measurements of a KEMAR dummy head
,”
J. Acoust. Soc. Am.
97
,
3907
3908
.
8.
Perrett
,
S.
, and
Noble
,
W.
(
1997
). “
The contribution of head motion cues to localization of low-pass noise
,”
Percept. Psychophys.
59
,
1018
1026
.
9.
Roman
,
N.
,
Wang
,
D.
, and
Brown
,
G. J.
(
2003
). “
Speech segregation based on sound localization
,”
J. Acoust. Soc. Am.
114
,
2236
2252
.
10.
Søndergaard
,
P. L.
, and
Majdak
,
P.
(
2013
). “
The auditory modeling toolbox
,” in
The Technology of Binaural Listening
, edited by
J.
Blauert
(
Springer
,
Berlin)
, Chap. 2, pp.
33
56
.