Listeners' head movements were measured during speech recognition with simultaneous maskers. Both the target and masker were behind the listener, separated by 30°. Frequent head turns during speech recognition were observed for four of the ten listeners. For those four, head turns were more frequent at lower target-to-masker ratios (TMRs) and were oriented toward the target speech source. When the masker was competing speech the ultimate head orientation angle was larger at lower TMRs. These observed head movements are not consistent with a strategy that maximizes either the target level at a single ear or binaural unmasking of speech.

A target speech stream presented with simultaneous maskers can be made more intelligible by spatially separating the target and masker sources (e.g., Hirsh, 1950). This spatial release from masking has been studied extensively in the past with the listener's head orientation fixed during the stimulus presentation. In everyday listening situations, however, listeners are free to move their heads. The current study focuses on situations in which the target and masker sources are both located behind the listener and investigates the characteristics of head movements during a speech recognition task.

Relatively few studies have investigated how listeners move their heads in multi-source environments (e.g., Brimijoin et al., 2012; Grange and Culling, 2016). These previous studies suggest that (1) not all listeners move their heads during speech recognition under masking when not directed to do so, and (2) undirected head movements may not optimize the acoustic input at one or both of the ears to gain a release from masking. With regard to the second point, the comparison was with two previously proposed strategies—the target level at a single ear and an optimization of the binaural spatial release from masking (Brimijoin et al., 2012). Here we extend these studies to explore the prevalence and features of head movements when the target and masker both are presented behind the listener. Additionally, unlike previous studies, in the current study speech targets are of relatively short duration (∼3 s), both noise (energetic) and speech (informational) maskers are tested, and there is uncertainty as to the target location. The results indicate differences in head-turning behavior depending on both the type of masker tested and the target-to-masker ratio (TMR). Moreover, those listeners who turned their heads do not appear to have done so only to alter acoustic inputs at the ears.

Ten listeners (2 male), ranging in ages from 18 to 24 yrs, participated. All but two listeners had absolute thresholds for pure tones of 15 dB hearing level (HL) or lower for audiometric frequencies between 250 and 8000 Hz in both ears. S1 had thresholds of 20 dB HL at 6000 Hz in the left ear, and S9 had thresholds of 25 dB HL at 6000 Hz in both ears. Across the frequencies of 500, 1000, 2000, and 3000, there were only four occasions in which the absolute thresholds differed across ears by more than 5 dB: at 2000, 500, 3000, and 500 Hz for S1, S6, S7, and S9, respectively, with an advantage of the right ear for all but S7. Listeners were drawn from the University of California, Irvine student population and were reimbursed for participation.

The target stimuli were from the coordinate response measure corpus (Bolia et al., 2000) in which sentences were of the form “Ready (call sign) go to (color) (number) now.” The corpus included eight talkers, four male and four female, eight call signs, four colors, and eight numbers, for a total of 2048 sentences. The average duration of each sentence was approximately 1.5 s. To provide sufficient time for head movements, the target speech on each trial was the concatenation of two unique sentences spoken by the same talker. The first sentence of the target stimulus (Target St1) always had the call sign “Baron,” while the second target sentence (Target St2) could have any of the eight call signs. The call sign in Target St2 was not limited to Baron so that the identity of the target was only provided by the call sign in Target St1. The listeners' task was to indicate the color and number for Target St2.

During the experiment, the target sentences and maskers were presented simultaneously. The maskers were either steady-state noises or competing sentences in separate conditions. In the noise-masker condition, the masker was spectrally-shaped to match the long-term spectrum of the speech material and was gated on at the onset of Target St1 and gated off at the offset of Target St2 using 10-ms raised-cosine ramps. In the speech-masker condition, the masker was constructed as the target but with additional constraints such that (1) the first masker sentence (Masker St1) did not begin with the call sign Baron; (2) the color and number from Masker St1 were different from those in Target St1; and (3) the color and number from Masker St2 were different from those in Target St2. The call signs in Target St1 and Masker St1 were temporally aligned to have a common onset.

The experiment was conducted in a sound-deadened single-walled sound isolation booth (IAC Acoustics, North Aurora, IL; length 3.8 m, width 3.8 m, and height 2.1 m). The walls and ceiling of the booth were treated with panels of 10.2 cm Sonex acoustic foam wedges and grooves and the floor was covered by carpet. During the experiment, the listener was seated in a non-swiveling chair in the center of the booth. The direction that the chair was facing defined the frontal-center reference (0°) in the azimuthal plane (see Fig. 1). Two loudspeakers were placed behind the listener and were 1.5 m away from the listener's head. They were positioned at azimuthal angles of −165° and +165° (boxes in Fig. 1), approximately at the height of the listeners' ears. On each trial the target was presented to one of the two loudspeakers chosen at random (Fig. 1 shows the case in which the target is presented from the +165° loudspeaker). The azimuthal positions of the loudspeakers were chosen so that they were outside of a human's field of view (typically between −110° ∼ −100° and +100° ∼ +110° considering the eye rotations). In addition, the two loudspeakers were separated by 30°, which was greater than the average localization errors for sound sources from the back of the listener [about 19°–20°, see Wightman and Kistler (1989) and Brungart et al. (1999)].

Fig. 1.

The locations of the loudspeakers relative to 0° azimuth during the experiment. The target speech was presented from either the −165° or the +165° loudspeaker, randomly determined from trial to trial with equal probability. Only one of these two cases is illustrated here. The filled boxes indicate the loudspeaker presenting the target and the unfilled boxes indicate the loudspeaker presenting the masker. Left: The listener's approximate head orientation before each experimental trial (i.e., the initial head orientation) is illustrated. Right: An example of a positive head turn during the experimental trial is illustrated.

Fig. 1.

The locations of the loudspeakers relative to 0° azimuth during the experiment. The target speech was presented from either the −165° or the +165° loudspeaker, randomly determined from trial to trial with equal probability. Only one of these two cases is illustrated here. The filled boxes indicate the loudspeaker presenting the target and the unfilled boxes indicate the loudspeaker presenting the masker. Left: The listener's approximate head orientation before each experimental trial (i.e., the initial head orientation) is illustrated. Right: An example of a positive head turn during the experimental trial is illustrated.

Close modal

To reduce listeners' reliance on the absolute level of the masker to determine which loudspeaker contained the target, on every trial the masker level was chosen at random between 62 and 68 dB SPL (free-field measurement with a Bruel and Kjaer 2236 sound pressure meter and a 4144 condenser microphone placed approximately at the location of the listeners' midriff). Six TMRs were tested, drawn at random, ranging from −18 to −3 in 3 dB steps.

The listeners entered the dark testing booth, were seated, and blindfolded. The aim was to minimize the influences of visual information on head-turning behavior. Listeners wore a plastic headband, on which a motion sensor was secured. The experimenter ensured that the headband was positioned correctly before the initiation of data collection. Listeners were informed of the task, including the instruction that “some, but not all listeners, initiate head movements in this task,” but that the listener must stay seated. Note that this instruction suggested to the listener that head movements during the experiment were not prohibited, which was similar to that adopted by Thurlow et al. (1967) for their localization study. By these instructions, the head movements observed in the current study cannot be treated as reflecting either undirected or directed behavior (Brimijoin et al., 2012; Grange and Culling, 2016).

During the experiment, the experimenter sat outside of the booth and communicated with the listener via an intercom system. Each experimental session started with several practice trials at high TMRs. Once the data collection began, the experimenter initiated each experimental trial and the stimulus was presented. After the stimulus presentation, the listener's verbal response (i.e., the color and number associated with Target St2) was recorded by the experimenter using a graphical user interface. On rare occasions (usually within the first several trials of data collection), the listener's response was not one of the 32 possible choices (4 colors × 8 numbers) and the experimenter selected one response within the 32 choices using a prescribed random pattern. Data collection was completed in 20 blocks of 30 trials, typically requiring two 1.5-h sessions. Eight blocks were for the noise masker and 12 blocks for the speech masker. Listeners were randomly assigned to complete either the noise- or speech-masker condition first. A greater number of trials were collected for the speech masker because (a) a shallower psychometric function relating the recognition score to TMR was expected for the speech masker than the noise masker (Brungart, 2001) and so smaller errors in performance scores were desirable, and (b) to allow an evaluation of effects of competing speech features (e.g., gender, etc.) on the results (Brungart, 2001).

All stimuli were generated digitally at a sampling rate of 44 100 Hz. Stimuli were delivered to the loudspeakers (model 198-4, Atlas Sound L.P.) from the experimental computer via a sound card (model EMU 0202 USB, Creative Technology Ltd.) and a power amplifier (CP660, Crown International, Inc.). Head-movement data were collected using a motion sensor (model 1044-0, Phidgets, Inc.) attached to a headband. The device measured acceleration and magnetic field in three linear dimensions and angular velocity around the axes of the three linear dimensions at a sampling rate of 250 Hz. The recording of the sensor data was synchronized to the stimulus presentation so that the delay from the onset of the sensor data stream to the onset of stimulus presentation was constant. The delay was measured by attaching the sensor to the diaphragm of a loudspeaker connected to the sound card and monitoring the sensor's response to an impulse; the resulting value was used to temporally align the sensor data to the onset of the sound stimulus. The head orientation angle, in terms of pitch (i.e., up/down rotation), roll (i.e., shoulder-to-shoulder rotation), and yaw (i.e., left–right rotation) angle was derived from the acceleration and magnetic field data. Here, the data analysis focuses on the head orientation angle along the yaw axis.

Because of the orientation of the chair, the listeners spontaneously oriented their heads to an azimuth angle that was close to 0° before each trial, and the two loudspeakers could be assumed to be symmetrically located about the listeners' heads (the left panel of Fig. 1). The average head orientation during the 400-ms period before the onset of the sound stimulus (i.e., the initial head orientation) was computed for each trial and the trials with initial head orientation beyond ±10° were discarded from the analyses of the head movements (approximately 20% of all trials). This was implemented to ensure that the locations of the loudspeakers relative to the listener's initial head orientation were reasonably consistent across trials.

The occurrences of head turns were estimated for each listener and each of the experimental conditions (2 masker types × 6 TMRs). A head turn was identified if the head orientation angle deviated from the initial head orientation of that trial by more than 10° (in either the positive or the negative directions) for 25 ms or longer. For the speech masker, the latencies of the head turns were referenced to the common onset of the call signs for Target St1 (Baron) and Masker St1. For the noise masker, the latencies of the head turns were referenced to the sound onset. Thus, the 0-s latency indicates the point at which the target location begins to be identifiable.

Based on the proportion of trials with head turns in the 12 experimental conditions, the ten listeners were classified into two groups using K-means clustering (Matlab), which yielded four members in the Head Turner (HT) group (S1, S6, S7, and S9) and six in the Non-Head Turner (NHT) group. The left panel of Fig. 2 plots the head-turn proportion for the HT (triangles) and NHT (circles) groups for the speech (unfilled) and noise (filled) maskers. On average, the HT listeners initiated head turns on approximately 68% of trials, while the NHT listeners turned their heads on 11% of trials. The left panel of Fig. 2 also illustrates the effects of TMR and masker type on head-turn proportion. A mixed-design analysis of variance (ANOVA) revealed a significant effect of TMR [F(2.01, 16.04) = 9.22, p = 0.002, Greenhouse-Geisser corrected], suggesting more frequent head turns associated with lower TMRs. The effect of masker type was not significant [F(1, 8) = 0.25, p = 0.633], nor was there a significant interaction between TMR and masker type [F(5, 40) = 1.56, p = 0.192]. As expected, the effect of group was significant [F(1, 8) = 38.89, p < 0.001]. However, listener group was not found to significantly interact with TMR [F(5, 40) = 1.70, p = 0.156] or masker type [F(1, 8) = 0.03, p = 0.873].

Fig. 2.

Left: The average head-turn proportion as a function of TMR. Right: The average recognition scores for identifying both the color and number correctly (in RAUs) as a function of TMR. The horizontal dashed line indicates chance performance. In each panel, results are shown for the two groups (HT: triangles, NHT: circles) and the two masker types (speech: unfilled; noise: filled).

Fig. 2.

Left: The average head-turn proportion as a function of TMR. Right: The average recognition scores for identifying both the color and number correctly (in RAUs) as a function of TMR. The horizontal dashed line indicates chance performance. In each panel, results are shown for the two groups (HT: triangles, NHT: circles) and the two masker types (speech: unfilled; noise: filled).

Close modal

Concerning the behavioral results, the right panel of Fig. 2 plots the average recognition score for identifying both the color and number correctly as a function of TMR for the two groups (HT and NHT) and two masker types (speech and noise). The performance scores are based on proportion correct converted to rationalized arcsine units (RAUs; Studebaker, 1985). These data are in agreement with the results of Brungart (2001), except that for the current data the proportion correct with a noise masker is slightly higher. This difference is likely to reflect a spatial release from masking in the current experiment and potentially the fact that two sentences were concatenated in the current experiment, providing listeners with more time to develop the target and masking streams. Although not detailed here, the results were also in accord with Brungart (2001) in that when the masker was speech, recognition scores were superior when the target and masker were spoken by individuals of different genders than individuals of the same gender. The results of a mixed-design ANOVA indicates significant main effects of TMR [F(5, 40) = 128.66, p < 0.001] and masker type [F(1, 8) = 9.10, p = 0.017] on recognition scores, as well as a significant interaction between TMR and masker type [F(5, 40) = 69.02, p < 0.001]. Neither the group effect [HT versus NHT group, F(1, 8) = 1.23, p = 0.299], nor any of the remaining interactions approached significance. Thus, although there was a slight performance advantage for the NHT listeners over the HT listeners based on the average performance of the two groups, this difference did not reach statistical significance.

Brungart (2001) and Arbogast et al. (2002) observed that listeners' incorrect responses included colors and numbers from the speech masker, an effect that will be referred to as intrusion errors here. It is possible that these errors reflect errors of recall (closer to the typical use of “intrusion”) rather than errors of encoding. These alternative explanations can be explored because the target and masker are the concatenation of two sentences, and errors in memory retrieval can be revealed not just from Masker St2, but also Target St1 or Masker St1. Intrusion errors were evaluated separately for the identification of the color and number and they showed consistent results. The intrusion errors from earlier sentences (i.e., Target St1 and Masker St1) did not exceed chance while the intrusion errors for Masker St2 did. Overall, about 75% of errors in color identification and 71% of errors in number identification agreed with the color and number of Masker St2. Therefore, intrusion errors were restricted to the sentence simultaneously competing with Target St2, suggesting intrusion errors are errors of encoding rather than errors of recall.

For the further analyses of the head orientation, the orientation angles on the trials with the target presented from the −165° loudspeaker were multiplied by −1, assuming symmetric head-turn behaviors on the two types of trials (trials with −165° targets and trials with +165° targets). After this conversion, all trials could be considered as those illustrated in Fig. 1.

To characterize the observed head movements, the first question to answer is how frequently the listener initiated positive and negative head turns. Results suggested that negative head turns were much rarer than positive head turns. Positive head turns were more frequent than negative head turns. For (positive vs negative) the rates in noise were (62.6% vs 18.1%) and (11.7% vs 3.1%) for the HT and NHT listeners, respectively. For the speech masker the corresponding rates were (61.6% vs 25.9%) and (8.0% vs 0.80%). The negative head turns typically occurred on trials with positive head turns, and for trials with both positive and negative head turns, the negative head turns typically preceded the positive head turn. Since even for the HT listeners, negative head turns were not observed at all TMRs, the analyses of the prevalence of positive and negative head turns among the HT listeners were based on data collapsed across TMRs. A repeated-measures ANOVA treating masker type and head-turn direction (positive versus negative head turns) as independent variables and head-turn proportion as dependent variable was run. As expected, the proportion of positive head turns was significantly larger than the proportion of negative head turns [F(1, 3) = 44.01, p = 0.007]. The effect of masker type was not significant [F(1, 3) = 0.10, p = 0.774], nor was there a significant interaction between the two factors [F(1, 3) = 0.33, p = 0.609].

The left panel of Fig. 3 plots the final head orientation on trials with head turns as a function of TMR averaged across the four HT listeners. For each listener, the final head orientation was the head orientation angles averaged across the final 400 ms of the stimulus on each trial and then averaged across all trials with head turns in the same masking condition. On the majority of the trials with head turns, the HT listeners ultimately oriented their heads toward a positive azimuth angle near the end of the stimulus presentation, i.e., approximately when the color and number of Target St2 were presented. The traces of orientation angle as a function of time for the four HT listeners are provided in the supplemental materials.1 A repeated measures 2-way ANOVA revealed no significant effect of masker type or TMR on final head orientation [F(1, 3) = 0.12, p = 0.749; F(5, 15) = 0.31, p = 0.898, respectively], though there was a significant interaction between the two factors [F(5, 15) = 3.34, p = 0.032)]. Subsequent post hoc 1-way ANOVAs indicated a marginally significant effect of TMR on final head orientation for the speech masker [F(5, 15) = 3.20, p = 0.073, Bonferroni corrected] but not the noise masker [F(5, 15) = 1.51, p = 0.489, Bonferroni corrected]. This means that the interaction between the two factors mainly reflected the effect the TMR on final head orientation for the speech masker.

Fig. 3.

Left: The average final head orientation on trials with head turns as a function of TMR. Right: The average latency for positive head turns as a function of TMR. In each panel, results from the HT group are shown for speech (unfilled triangles) and noise (filled triangles) maskers. Error bars indicate the standard error of the mean.

Fig. 3.

Left: The average final head orientation on trials with head turns as a function of TMR. Right: The average latency for positive head turns as a function of TMR. In each panel, results from the HT group are shown for speech (unfilled triangles) and noise (filled triangles) maskers. Error bars indicate the standard error of the mean.

Close modal

The right panel of Fig. 3 plots the positive head-turn latency as a function of TMR averaged across the four HT listeners. The average latency was about 1 s, and it was consistent for the speech and noise maskers at TMRs above −12 dB. At the lowest TMR (i.e., −18 dB), however, a longer latency was observed for the noise masker than the speech masker. A repeated measures 2-way ANOVA revealed no significant effect of masker type or TMR on positive head-turn latency [F(1, 3) = 3.12, p = 0.176; F(5, 15) = 0.55, p = 0.736, respectively], though there was a significant interaction between the two factors [F(5, 15) = 4.17, p = 0.014)]. A post hoc paired t-test suggested significant longer positive head-turn latencies for the noise than speech masker at −18 dB TMR [t(3) = 3.81, p = 0.032]. At this lowest TMR test, the location of the target speech may have been revealed by the relatively large intensity difference between the target and masker before the onset of the target call sign Baron (i.e., the 0-s latency), which may have caused the positive head turns to be initiated earlier compared to the noise-masker condition. Whatever the case, the HT listeners' relatively rapid and relatively consistent head turns suggests efficient spatial judgments.

The current study investigated the head movements during speech recognition in competing maskers when the target speech and the masker were both behind the listener. Four of the ten listeners were sufficiently consistent and had sufficiently large head movements to be categorized as “head turners,” six were not. The head-orientation characteristics among these HT listeners, in terms of frequency of occurrences, final head orientation, and head-turn latency, were studied for speech and noise maskers and at various TMRs.

It has previously been proposed that listeners may initiate head turns to alter acoustic inputs at the ears. For example, head movements may be used to maximize the signal level received at an ear (Brimijoin et al., 2012). The current data are not consistent with a dependence on just this strategy because various features of the observed head movements depended on masker type, TMR, and/or their interactions, and these factors are not expected to affect the azimuth angle associated with the maximum target level at either ear.

Moreover, the head movements observed in the current study do not seem to serve the purpose of optimizing spatial release from masking. Utilizing a model of binaural speech unmasking (Lavandier and Culling, 2010; Jelfs et al., 2011), the expected speech-intelligibility benefit was calculated based on the final head orientation for each listener and for the noise masker. The model predicts unmasking using both the effect of head shadow and binaural interactions. The head-related transfer functions (HRTFs) measured by Gardner and Martin (1995) from KEMAR dummy head in anechoic condition was implemented. Although individualized HRTFs were not measured, verification measurements indicated that these published HRTFs produced predictions near the TMRs empirically measured. The model predictions were restricted to the noise masker because the model is not assured to be suitable for speech maskers.

The expected speech-intelligibility benefit at the final head orientation, relative to a head orientation angle of 0°, was on average approximately −0.45 dB when all listeners were included [t(9) = −3.16, p = 0.012] and −0.94 dB for the HT listeners [t(3) = −9.95, p = 0.002]. Notably, the expected speech-intelligibility benefit was higher for the NHT group than the HT group [t(8) = 7.90, p < 0.001]. This difference in the predicted speech intelligibility may contribute to the slight speech-recognition performance difference observed between the average scores of the HT and NHT groups (the right panel of Fig. 2). From the above analysis, it seems that the head movements observed from the four HT listeners are unlikely to be initiated to optimize spatial release from masking.

The lack of evidence for a dependence on just the acoustic features for head turns may reflect the experimental procedure implemented here. First, the target and masker were only 30° apart, leading to somewhat modest TMR variations at either of the two ears as a function of head orientation angle. Second, speech-unmasking optimization may require exploratory head movements across a wide span of angles and over a long period. In the current study, the extent to which listeners were able to orient their heads was limited by the use of the non-swiveling chair and the uncertainty regarding the roving masker level and target location. Nonetheless, listeners did move their heads and some listeners moved their heads consistently. When they did turn their heads, it was not consistent with acoustic-based optimization expectations.

It is apparent that strategies not directly related to acoustic factors contributed to the initiation of head movements. This may reflect, for example, an effort to direct selective attention to the target source or the tendency to bring the target source into the field of view to enhance audio-visual speech perception even when the listeners are blindfolded.

This work was supported by the Summer Undergraduate Research Program at the University of California, Irvine awarded to M.L.F. and NIH Grant No. R21 DC013406 (MPIs: V.M.R. and Y.S.).

1

See supplementary material at http://dx.doi.org/10.1121/1.4976111E-JASMAN-141-514702 for the traces of head orientation angle as a function of time for the four HT listeners.

1.
Arbogast
,
T. L.
,
Mason
,
C. R.
, and
Kidd
,
G.
, Jr.
(
2002
). “
The effect of spatial separation on informational and energetic masking of speech
,”
J. Acoust. Soc. Am.
112
,
2086
2098
.
2.
Bolia
,
R. S.
,
Nelson
,
W. T.
,
Ericson
,
M. A.
, and
Simpson
,
B. D.
(
2000
). “
A speech corpus for multitalker communications research
,”
J. Acoust. Soc. Am.
107
,
1065
1066
.
3.
Brimijoin
,
W. O.
,
McShefferty
,
D.
, and
Akeroyd
,
M. A.
(
2012
). “
Undirected head movements of listeners with asymmetrical hearing impairment during a speech-in-noise task
,”
Hear. Res.
283
,
162
168
.
4.
Brungart
,
D. S.
(
2001
). “
Informational and energetic masking effects in the perception of two simultaneous talkers
,”
J. Acoust. Soc. Am.
109
,
1101
1109
.
5.
Brungart
,
D. S.
,
Durlach
,
N. I.
, and
Rabinowitz
,
W. M.
(
1999
). “
Auditory localization of nearby sources. II. Localization of a broadband source
,”
J. Acoust. Soc. Am.
106
,
1956
1968
.
6.
Gardner
,
W. G.
, and
Martin
,
K. D.
(
1995
). “
HRTF measurements of a KEMAR
,”
J. Acoust. Soc. Am.
97
,
3907
3908
.
7.
Grange
,
J. A.
, and
Culling
,
J. F.
(
2016
). “
The benefit of head orientation to speech intelligibility in noise
,”
J. Acoustic. Soc. Am.
139
,
703
712
.
8.
Hirsh
,
I. J.
(
1950
). “
The relation between localization and intelligibility
,”
J. Acoust. Soc. Am.
22
,
196
200
.
9.
Jelfs
,
S.
,
Culling
,
J. F.
, and
Lavandier
,
M.
(
2011
). “
Revision and validation of a binaural model for speech intelligibility in noise
,”
Hear. Res.
275
,
96
104
.
10.
Lavandier
,
M.
, and
Culling
,
J. F.
(
2010
). “
Prediction of binaural speech intelligibility against noise in rooms
,”
J. Acoust. Soc. Am.
127
,
387
399
.
11.
Studebaker
,
G. A.
(
1985
). “
A ‘rationalized’ arcsine transform
,”
J. Speech Lang. Hear. Res.
28
,
455
462
.
12.
Thurlow
,
W. R.
,
Mangels
,
J. W.
, and
Runge
,
P. S.
(
1967
). “
Head movements during sound localization
,”
J. Acoust. Soc. Am.
42
,
489
493
.
13.
Wightman
,
F. L.
, and
Kistler
,
D. J.
(
1989
). “
Headphone simulation of free-field listening. II: Psychophysical validation
,”
J. Acoust. Soc. Am.
85
,
868
878
.

Supplementary Material