Analysis of pupil dilation has been used as an index of attentional effort in the auditory domain. Previous work has modeled the pupillary response to attentional effort as a linear time-invariant system with a characteristic impulse response, and used deconvolution to estimate the attentional effort that gives rise to changes in pupil size. Here it is argued that one parameter of the impulse response (the latency of response maximum, tmax) has been mis-estimated in the literature; a different estimate is presented, and it is shown how deconvolution with this value of tmax yields more intuitively plausible and informative results.

Pupillometry, the tracking of pupil diameter, has been used to measure attentional effort,1,2 including in the auditory domain.3–5 The pupillary response to transient effort- or load-inducing stimuli is slow, with latency of maximum response on the order of several hundred milliseconds.6,7 However, the pupillary response can be modeled as a linear time-invariant system comprising a train of theoretical “attentional pulses” and a characteristic impulse response approximated by an Erlang gamma function

h=tne(nt/tmax).
(1)

The impulse response h has empirically-determined parameters for the latency of response maximum tmax and the shape parameter of the Erlang distribution n; the latter is proposed to be analogous to the number of steps in the neural signaling pathway transmitting the attentional pulse to the pupil.7 This model allows estimation of the timing and magnitude of the attentional signal by deconvolving the measured pupillary response using the estimated impulse response function as a deconvolution kernel,8 in a method similar to that used in fMRI analysis of the BOLD response. Such techniques are valuable for relating the temporal dynamics of (delayed) physiological responses to the unfolding of stimulus events in time.

Hoeks and Levelt have empirically estimated the kernel parameters n = 10.1 and tmax = 0.93 s using both auditory and visual stimuli, but a crucial shortcoming was the inclusion of button-press responses in all trials used for parameter estimation (non-button-press trials were included in their experimental design, but they report pupillary responses to these trials were “too small and noisy for further data analysis”).7 This is problematic in light of recent findings showing that up to 70% of pupil response can be attributed to preparatory and motor commands in tasks with button-presses, with effects beginning as early as 400 ms prior to the button press event.9 In consequence, Hoeks and Levelt's estimate of the latency of response maximum (tmax) may be inappropriate for processing pupillary responses to stimuli absent of motor responses. For this reason, we re-estimated tmax for both target (with button press) and non-target (no button press) auditory stimuli (Experiment 1), and show how our estimate of tmax yields better temporal alignment of stimulus and deconvolved pupil response in an auditory attention switching task (Experiment 2), when compared to deconvolution using previous estimates. We expect the improvement in temporal alignment between stimulus and pupil response to be useful in addressing questions related to cognition, listening effort, and auditory attention.

All procedures were performed in a sound-treated booth illuminated only by the LCD monitor on which visual stimuli were presented. Auditory stimuli were delivered over Etymotic ER-2 insert earphones via a TDT RP2 real-time processor (Tucker Davis Technologies, Alachula, FL) at a level of 65 dB sound pressure level (SPL). Pupil size was measured continuously at a 1000 Hz sampling frequency using an EyeLink1000 infra-red eye tracker (SR Research, Kanata, ON). Participants were seated 50 cm away from the EyeLink camera with their heads stabilized by a chin rest and forehead bar. All participants had normal audiometric thresholds (20 dB hearing level or better at octave frequencies from 250 Hz to 8 kHz), were compensated at an hourly rate, and gave informed consent to participate as overseen by the University of Washington Institutional Review Board.

Experiment 1 tested the pupillary response to a simple auditory target detection task. The aim was to compare pupillary response to non-target tones versus response to target tones (with button press response to the target tones) and estimate the latency of maximum pupil response (tmax). Ten adults (5 female) aged 21 to 35 yrs (mean 26.6) participated in Experiment 1.

To maximize our ability to detect changes in pupil size, we assessed the dynamic range of each participant's pupil, then selected a background gray scale value for the visual display that yielded a resting dilation near the middle of a participant's pupil size range where the pupil's response was steepest, as a safeguard against ceiling effects.10,11 We began by presenting a 10-s rest period comprising a black screen with a centered, dark gray fixation dot (value 0.2 on 0–1 scale; 1 = maximum luminance). Next, a series of monochromatic screens with central fixation dots were presented for 3 s each, with background values ranging from 0 (black) to 0.5 (mid-gray) in 8 exponential (base-2) steps; on each step the luminance value of the fixation dot was 0.2 higher than the background. After reaching the brightest level, the rest period and series of increasing luminance steps was repeated. To choose the best background value, we calculated median pupil size between 1.25 and 3.0 s after each change of screen luminance, averaged those median values across the two repetitions of the calibration sequence, and selected the background value exhibiting the greatest change in pupil size compared to the (darker) level preceding it.

To determine the pupil response to auditory stimuli, participants were asked to respond by button press to tones with frequency modulation (FM) and ignore constant frequency tones. Steady tones were 1000 Hz with a 10 ms cosine-squared window taper at both ends and a total duration of 100 ms. Target tones had a frequency centered at 1000 Hz that varied sinusoidally with a range of 200 Hz and a period matching the duration of the stimulus, and were otherwise identical to the steady tones. Tones were presented in 4 blocks of 75 stimulus presentations with breaks between blocks; each block began with a 10-s rest period to allow pupil size to stabilize. One-fourth of all tones were target tones, randomly distributed through the task. Inter-stimulus interval was randomly and evenly distributed between 3 and 5 s. Examples of both tone types were played for the listener prior to the task. Three participants repeated the task with standard and target tones swapped, to confirm that pupil responses were insensitive to the small differences between the tone types; swapping target and test tones had no noticeable effect on pupil responses (these data are not presented).

Pupil size measurements were time-aligned to the onset of each tone and epoched from −0.5 to 3.0 s. Pupil size was then baseline-corrected relative to the period from −0.5 to 0.0 s and z-score normalized within each epoch, consistent with Wierda and colleagues' procedure.8 The first epoch of each block was excluded, as were epochs with an incorrect behavioral response (ranging from 2 to 5 across participants), and epochs beginning less than 2.5 s after a button press (10–16 across participants). The total number of trials excluded ranged from 17 to 21 (5%–7%).

Plots of pupil response to standard and target tones are shown in Fig. 1. Response to standard tones shows a peak around 0.5 s after stimulus onset, whereas response to target tones shows an early peak around 0.75 s and a larger, later peak around 1.4 s. Differences in both magnitude and peak latency are attributable to the behavioral response (button press) in the target trials; the differences are consistent with previous work showing that when button press responses occur up to 70% of the pupillary response is attributable to them.9 

Given the simplicity of the stimulus design in this experiment, we can suppose that tmax in the non-target condition [512 ms; Fig. 1(a)] is close to the minimum possible latency for a pupillary change resulting from an auditory stimulus. It should be noted that our stimulus in the no-button-press condition is virtually identical to that used by Hoeks and Levelt7 in their auditory task (a 100 ms duration 1000 Hz pure tone), so the larger value of tmax (930 ms) derived by Hoeks and Levelt (and subsequently used by Wierda and colleagues8 in their deconvolution analysis in a visual attention task) likely reflects contributions to pupil dilation from a combination of stimulus, motor planning, and motor command activities [as does our estimate of tmax to target tones; Fig. 1(b)]. As such, our estimate of tmax for non-target tones should yield a more appropriate deconvolution kernel for analysis of pupil responses to auditory stimuli absent a rapid motor response, and should also be better suited to deconvolution analyses for continuous auditory stimuli (this follows from the characterization of the pupillary response as a linear time-invariant system).7 Moreover, this does not preclude using our estimate of tmax when analyzing auditory tasks that do include rapid motor responses: as long as button presses are balanced across experimental conditions, it should still be possible to analyze the difference in (deconvolved) pupil size across conditions by treating the pupillary response to motor planning and execution as noise.

To illustrate the effect of appropriate parameterization of the deconvolution kernel in pupillometric analysis, we applied the deconvolution technique of Wierda and colleagues8 to measurements of pupil size from an auditory attention switching experiment, using estimates of tmax from both experiment 1 and from Hoeks and Levelt.7 Sixteen adults (8 female) aged 19 to 35 yrs (mean 25.5) were recruited for experiment 2. The experiment included two stimulus manipulations (number of noise-vocoder bands; mid-trial gap duration) and one cued behavioral manipulation (maintain attention to one talker throughout, or switch attention between talkers); methods for all three manipulations are described, but for brevity the deconvolution analysis will only be shown for the behavioral manipulation.

Stimuli comprised spectrally degraded spoken alphabet letters ADEGOPUV from the ISOLET v1.3 corpus12 from one female and one male talker. The mean fundamental frequencies of the unprocessed recordings were 103 Hz for the male talker and 193 Hz for the female talker. Letter durations ranged from 351 to 478 ms, and were silence-padded to a uniform duration of 500 ms, normalized by equating root-mean-square amplitude, and windowed at the edges with a 5 ms cosine-squared envelope. Two streams of four letters each were generated for each trial, with a gap of either 200 or 600 ms between the second and third letters of each stream.

Spectral degradation of the letters followed conventional noise vocoding strategy, maintaining temporal and amplitude cues and removing fine structure.13 The stimuli were fourth-order bandpass filtered into 10 or 20 spectral bands of equal equivalent rectangular bandwidths,14 with lower and upper bounds of 200 and 8000 Hz. The amplitude envelope of each band was extracted with half-wave rectification and a 160 Hz low-pass fourth-order Butterworth filter. The resulting envelopes were used to modulate white noise that had been bandpass filtered at the same frequencies as the extracted bands, and the resulting modulated noise bands were summed and presented diotically at 65 dB SPL. A white-noise masker with π-interaural-phase was played continuously during experimental blocks, to provide additional masking of environmental sounds (e.g., friction between earphone tubes and subject clothing) and to provide parity with follow-up MEG neuroimaging experiments. The masking noise was presented at a level of 45 dB SPL, yielding a stimulus-to-noise ratio of 20 dB.

Participants were instructed to maintain their gaze on a white fixation dot centered on a black screen throughout test blocks. Each trial began with a 1 s auditory cue (spoken letters “AA” or “AU”) indicating (by the sex of the talker) whether to attend first to the male or female voice, and additionally indicating whether to maintain attention to that talker throughout the trial (AA cue) or to switch attention to the other talker at the mid-trial gap (AU cue). The cue was followed by 0.5 s of silence, followed by the main portion of the trial: two concurrent, diotic 4-letter streams (1 male voice, 1 female voice), with a variable-duration gap between the second and third letters (the gap duration was varied across trials, but was always the same for the 2 streams within a trial). The task was to respond by button press to the letter “O” spoken by the target talker (Fig. 2). To allow unambiguous attribution of button presses, the letter O was always separated from another O (in either stream) by at least 1 s, and its position in the letter sequence was balanced across trials and conditions.

Deconvolution kernels were calculated as in Eq. (1), with n = 10.1 (following Hoeks and Levelt) and values of tmax from both Hoeks and Levelt (930 ms) and from experiment 1 (512 ms). Fourier analysis of the deconvolution kernels and subject-level mean pupil size time series indicated no appreciable energy at frequencies above 3 Hz, so for efficiency of computation (and to parallel the procedure of Wierda and colleagues) deconvolved signals were generated as a best-fit linear sum of kernels spaced at 100 ms intervals, as implemented in pyeparse.15 Statistical comparison of pupil dilation time series was performed using a non-parametric cluster-level one-sample T-test on the within-subject differences in deconvolved pupil size between experimental conditions (clustering across time only),16 as implemented in mne-python.17 

Deconvolved pupil size for the behavioral contrast “maintain” versus “switch” is presented in Fig. 3(b); the effects of gap duration and number of vocoder bands are not discussed. Mean deconvolved pupil size was statistically significantly larger in trials requiring mid-trial switches of attention than in trials where subjects maintained attention to the same talker throughout the trial. Z-score normalized pupil size exhibits the same pattern of statistically significant difference between maintain and switch trials [i.e., a single cluster from point of divergence to end of trial; Fig. 3(a)].

However, the divergence of the z-score normalized pupil size time series occurs around 1.3 s [vertical dotted line, Fig. 3(a)], whereas the divergence of the deconvolved signals is temporally aligned with the offset of the AA/AU cue [vertical dotted line in Fig. 3(b)]. The arrow along the horizontal axis in Fig. 3(b) indicates time of significant divergence if data are deconvolved using a kernel computed with the estimate of tmax from Hoeks and Levelt;7 such early divergence indicates acausal behavior (different effort associated with different trial types occurs before listeners have heard the portion of the cue that differentiates maintain trials from switch trials). The temporal alignment of the trial type cue and the divergence of the pupil size time series using our estimate of tmax is consistent with the view that pupil dilation reflects cognitive load or attentional effort, and that effort/load increases as soon as listeners know they are hearing a (more difficult) switch trial.

Deconvolution of pupil size measurements allows insight into the unfolding of attentional effort over the course of an experimental trial, by temporally aligning the measured response with the stimulus events that induced it. However, pupil size is also affected by non-stimulus events; motor planning and execution associated with rapid button press responses are a particularly likely source of noise in the pupillometric signal in experimental settings. Nonetheless, careful attention to experimental design—combined with appropriate parameterization of the deconvolution kernel—preserves the ability to make inferences from the temporal relationship between stimulus events and (deconvolved) pupillary response.

This research was supported by NIH Grant No. R01-DC013260 (AKCL) and NIH LRP awards (DRM and EDL). The authors are grateful to Zach Smith for the spectral degradation code used in Experiment 2, and to Matt Winn and two anonymous reviewers for helpful suggestions on an earlier draft of this paper.

1.
E. H.
Hess
and
J. M.
Polt
, “
Pupil size in relation to mental activity during simple problem-solving
,”
Science
143
(
3611
),
1190
1192
(
1964
).
2.
D.
Kahneman
and
J.
Beatty
, “
Pupil diameter and load on memory
,”
Science
154
(
3756
),
1583
1585
(
1966
).
3.
S. E.
Kuchinsky
,
J. B.
Ahlstrom
,
K. I.
Vaden
,
S. L.
Cute
,
L. E.
Humes
,
J. R.
Dubno
, and
M. A.
Eckert
, “
Pupil size varies with word listening and response selection difficulty in older adults with hearing loss
,”
Psychophysiol.
50
(
1
),
23
34
(
2013
).
4.
T.
Koelewijn
,
B. G.
Shinn-Cunningham
,
A. A.
Zekveld
, and
S. E.
Kramer
, “
The pupil response is sensitive to divided attention during speech processing
,”
Hear. Res.
312
,
114
120
(
2014
).
5.
M. B.
Winn
,
J. R.
Edwards
, and
R. Y.
Litovsky
, “
The impact of auditory spectral resolution on listening effort revealed by pupil dilation
,”
Ear Hear.
36
(
4
),
e153
e165
(
2015
).
6.
J.
Beatty
, “
Task-evoked pupillary responses, processing load, and the structure of processing resources
,”
Psychol. Bull.
91
(
2
),
276
292
(
1982
).
7.
B.
Hoeks
and
W. J. M.
Levelt
, “
Pupillary dilation as a measure of attention: A quantitative system analysis
,”
Behav. Res. Meth. Ins. C.
25
(
1
),
16
26
(
1993
).
8.
S. M.
Wierda
,
H.
van Rijn
,
N. A.
Taatgen
, and
S.
Martens
, “
Pupil dilation deconvolution reveals the dynamics of attention at high temporal resolution
,”
Proc. Natl. Acad. Sci. U.S.A.
109
(
22
),
8456
8460
(
2012
).
9.
J.-M.
Hupé
,
C.
Lamirel
, and
J.
Lorenceau
, “
Pupil dynamics during bistable motion perception
,”
J. Vision
9
(
7
),
1
19
(
2009
).
10.
M. P.
Janisse
,
Pupillometry: The Psychology of the Pupillary Response
(
Hemisphere
,
Washington
,
1977
), p.
9
12
.
11.
C. R.
Chapman
,
S.
Oka
,
D. H.
Bradshaw
,
R. C.
Jacobson
, and
G. W.
Donaldson
, “
Phasic pupil dilation response to noxious stimulation in normal volunteers: Relationship to brain evoked potentials and pain report
,”
Psychophysiol.
36
(
1
),
44
52
(
1999
).
12.
R. A.
Cole
,
Y.
Muthusamy
, and
M.
Fanty
, “
The ISOLET spoken letter database
,”
Technical Report 90-004, Oregon Graduate Institute, Hillsboro, OR
(
1990
), paper 205.
13.
R. V.
Shannon
,
F.-G.
Zeng
,
V.
Kamath
,
J.
Wygonski
, and
M.
Ekelid
, “
Speech recognition with primarily temporal cues
,”
Science
270
(
5234
),
303
304
(
1995
).
14.
B. C. J.
Moore
and
B. R.
Glasberg
, “
Formulae describing frequency selectivity as a function of frequency and level, and their use in calculating excitation patterns
,”
Hear. Res.
28
(
2–3
),
209
225
(
1987
).
15.
E. D.
Larson
and
D. A.
Engemann
, “
pyeparse: 0.1.0
,” (
2015
).
16.
E.
Maris
and
R.
Oostenveld
, “
Nonparametric statistical testing of EEG- and MEG-data
,”
J. Neurosci. Meth.
164
(
1
),
177
190
(
2007
).
17.
A.
Gramfort
,
M.
Luessi
,
E. D.
Larson
,
D. A.
Engemann
,
D.
Strohmeier
,
C.
Brodbeck
,
R.
Goj
,
M.
Jas
,
T.
Brooks
,
L.
Parkkonen
, and
M. S.
Hämäläinen
, “
MEG and EEG data analysis with MNE-Python
,”
Front. Neurosci.
7
,
267
(
2013
).