Unfamiliar second-language (L2) accents present a common challenge to speech understanding. However, the extent to which accurately recognized unfamiliar L2-accented speech imposes a greater cognitive load than native speech remains unclear. The current study used pupillometry to assess cognitive load for native English listeners during the perception of intelligible Mandarin Chinese-accented English and American-accented English. Results showed greater pupil response (indicating greater cognitive load) for the unfamiliar L2-accented speech. These findings indicate that the mismatches between unfamiliar L2-accented speech and native listeners' linguistic representations impose greater cognitive load even when recognition accuracy is at ceiling.
1. Introduction
Listening to unfamiliar second language- (L2-) accented speech is often described by listeners as an effortful process, even when L2 speakers are highly proficient (Munro and Derwing, 1995). Both L2- and regionally accented speech are characterized by systematic segmental and suprasegmental deviations from standard native pronunciations of a given language or dialect. For example, when produced by nonnative Mandarin Chinese speakers of English, tense (e.g., /i/) and lax (e.g., /ɪ/) American English vowel pairs tend to be less phonetically distinct than when produced by native speakers (Wang and Heuven, 2006). Deviations such as these result in mismatches between L2 speech and native listeners' representations, which may cause reduced recognition accuracy (the proportion of words in the speech stream that can be correctly identified) and/or increased cognitive load (the degree to which cognitive resources are recruited at a given moment to meet processing demands; Pichora-Fuller et al., 2016). That is, resolving such mismatches may require additional cognitive resources not needed for native-accented speech [see Van Engen and Peelle (2014) for an executive recruitment account of accented speech]. Furthermore, even when unfamiliar L2-accented speech is accurately recognized, processing it should nonetheless require greater cognitive load than processing native speech.
Our research question was the following: Does unfamiliar L2-accented speech that is accurately recognized impose greater cognitive load than native speech? Behavioral studies have primarily relied on performance measures such as recognition accuracy (also referred to as “intelligibility”) and reaction time (e.g., processing speed for true vs false sentence verification; Adank et al., 2009) to determine the costs of processing unfamiliar L2-accented speech. However, examining the cognitive load associated with the online processing of speech from highly proficient L2 speakers, including those whose speech can be accurately recognized, requires a different methodological approach. To address our research question, we used pupillometry (the measure of pupil dilation over time) to examine cognitive load for native English listeners during the perception of accurately recognized Mandarin Chinese-accented English and standard American-accented English. By using pupillometry in place of recognition accuracy and/or reaction time measures, we were able to examine the time-course of effortful processing for accurately recognized speech. Notably, our definition of “recognition accuracy” is the proportion of keywords that the participant is able to correctly repeat, and does not necessarily reflect what the participant initially perceived; in fact, our hypothesis is predicated on the assumption that listeners are recruiting additional cognitive resources to re-process their initial perceptions. Further discussion of the mechanism(s) hypothesized to support this process is revisited in Sec. 5.
Pupillometry has been used for several decades as a non-intrusive, temporally sensitive psychophysiological index of cognitive load (Beatty, 1982) and, more recently, has been applied within the domain of speech perception [for a review, see Van Engen and McLaughlin (2018)]. Of interest for cognitive research is the task-evoked pupil response (from here forward, simply pupil response), which is a phasic increase of pupil dilation linked temporally to a specific cognitive event (Beatty, 1982). These increases in pupil dilation are much smaller than changes due to luminance, and are related to up-regulation of sympathetic, and inhibition of parasympathetic, activity (Steinhauer et al., 2004; Beatty and Lucero-Wagoner, 2000; Peysakhovich et al., 2017). Larger pupil response during a cognitive task indicates greater cognitive load. For example, Beatty (1982) found that pupil response was systematically larger for longer digit spans during a recall task.
Using acoustically degraded speech, it has been demonstrated that the pupil response increases systematically as speech becomes less intelligible (Zekveld et al., 2010; Zekveld and Kramer, 2014). Additionally, for Dutch-English bilingual listeners, pupil response for 50% intelligible English (i.e., L2) speech is greater when presented in English babble than in Dutch babble, and is greater when the speaker's language is not known in advance (i.e., when trials are randomized as opposed to blocked; Francis et al., 2018). Porretta and Tucker (2019) examined the effect of intelligibility using L2-accented speech, demonstrating for isolated words that there is greater and more sustained pupil response for unintelligible accented speech than intelligible accented speech. However, the potential difference in pupil response for accurately recognized, unfamiliar L2-accented speech and native speech remains to be investigated.
In a study using noise-vocoded speech, Winn et al. (2015) found that even when analyses were limited to accurately perceived trials, there was a systematic increase in pupil response as the spectral quality of the speech signal decreased. Thus, pupillometry is capable of capturing cognitive load as a function of recognition accuracy, but also revealing more nuanced differences in cognitive load when recognition is at ceiling. The latter makes pupillometry a valuable method for addressing our research question, and offers a novel way to quantify the challenge posed by accented speech.
In the present study, we measured pupil response during listening to speech from a native, standard American-accented speaker of English and a nonnative, Mandarin Chinese-accented speaker of English. We predicted that growth curve analysis (Mirman, 2014) of the pupil response would show an effect of speaker type (i.e., overall pupil dilation would be greater for the L2-accented speaker condition), and an interaction between speaker type and the linear time term (i.e., indicating a difference in the rate of change of pupil dilation between conditions), reflecting greater cognitive load for processing L2-accented speech than native speech. Additionally, we predicted that subjective ratings of listening effort would be greater for the L2-accented speaker condition. Study preregistration is available (see https://osf.io/bgcx7/). All analysis scripts and data are also available (see https://osf.io/7dajv/files/).
2. Methods
2.1 Participants
Participants were young adults (N = 52, ages 18–22, 39 female, and 13 male) recruited from Washington University's Psychology Subjects Pool. Hearing thresholds were measured for each participant to confirm they had clinically normal hearing. An additional 11 subjects participated and were excluded and replaced due to data loss either from equipment malfunction, or because they were not native speakers of American English, had parents who were not native speakers of American English, or had extensive exposure to Mandarin Chinese.
2.2 Stimuli
Sentence recordings of a native speaker of standard American English and a Mandarin-Chinese accented speaker of English were taken from Van Engen et al. (2012). All sentences were six words long, with four keywords (e.g., “the gray mouse ate the cheese”). Results from a pilot study (see supplemental materials1) indicated that the Mandarin Chinese-accented stimuli had high recognition accuracy but were also rated as more accented and more effortful to understand than the native-accented stimuli. Root mean squared amplitudes for all stimuli were equalized using praat (Boersma and Weenink, 2019). To control for natural differences in speaking rate, which is crucial for interpreting the time-course of the pupil response, the native speaker files were acoustically lengthened. Adjustments were made using the Stretch function in adobe audition cc, which can adjust file lengths while controlling for pitch and speech characteristics. Files were lengthened by a constant amount, so that the average length of the native speaker files would equal the average length of the L2 speaker files. The L2 speaker files were not altered. A pilot experiment with unprocessed native stimuli is reported in the supplemental materials.1 The same pattern of results was found.
2.3 Procedure
Participants provided written informed consent prior to participation in the study. Following the consent process, they completed language and demographic questionnaires and underwent audiometric testing. Puretone thresholds at 500, 1000, and 2000 Hz were used to establish a pure-tone average (PTA) for each participant. All participants had PTAs at or below 10 dB in at least one ear.
For the pupillometry task, participants were seated in a sound-attenuating room facing a monitor and EyeLink 1000 Plus camera. All equipment was positioned following EyeLink specifications. During the experiment, participants rested their chins on a head-mount. They were instructed to fixate on a cross located in the center of the screen at all times. Two colors, red and blue, were used as cues for the participants. Whenever the cross was red, participants were instructed to reduce their blinking comfortably (if possible), while still prioritizing the listening task. Whenever the cross was blue, they were instructed to blink freely. Each stimulus was preceded by a baseline period (used during analysis to establish pupil size prior to input) with 3000 ms of silence and a red cross. During the stimulus the cross remained red, and after the stimulus there was a delay period with 3000 ms of silence and a (continued) red cross. After the recording window, the color of the cross changed to blue, and participants were instructed to repeat what they heard aloud. An audio recorder was used to record their responses. After the participant clicked the spacebar to initiate a new trial, there was a final 3000 ms silent delay with a blue cross to allow the pupil response to recover.
Sixty stimuli (30 per condition) were presented in a randomized order.1 For subjective ratings of listening effort, participants were probed every three trials about the last sentence they had listened to and used the keyboard to respond, resulting in approximately 10 responses per condition. Participants responded using a scale of 1 to 9 where: 1 = “not effortful,” 5 = “moderately effortful,” and 9 = “extremely effortful.” All participants were debriefed after the experiment.
3. Analyses
3.1 Subjective responses
For each participant, there were approximately ten responses to the effort probes for each speaker. Their mean was used as a composite rating score. Quantile-quantile plots of the scores demonstrated that they were not normally distributed, so Wilcoxon rank-sum tests were used to compare ratings for the two speakers.
3.2 Pupil data maintenance
Recognition accuracy was scored by a single researcher, and trials in which any keywords were missed were excluded from the pupillometry analyses (2.1% of total trials). A custom script written in r was used to manage pupil data.2 This maintenance pipeline began by aligning data from each trial at sentence onset, and then averaging the pupil dilation during the 500 ms preceding each target sentence to determine a baseline value for each trial. Timepoints with missing data due to blinks were omitted when calculating the baseline value, and trials with fewer than 50% valid points during this window were omitted. Next, these baseline values were subtracted from pupil dilation throughout the rest of each trial to create a measure of absolute change. To remove effects of blinking, intervals of missing values were identified and then expanded to remove noisy readings caused by the eyelids moving prior to (by 30 ms) and following (by 160 ms) blinks (Winn et al., 2015; Zekveld et al., 2010). This expanded blink interval was then interpolated across using linear approximation. Last, the pupil data was smoothed using a 10 Hz moving average (Winn et al., 2018). Trials that required more than 50% interpolation were excluded from analyses. Across all participants and trials, 2.9% of timepoints were identified as blinks and interpolated. Across participants, less than one trial was excluded on average for this reason, and no participant exceeded 20% trial loss.
3.3 Growth curve analysis
Growth curve analysis (GCA) was used to model the pupil data in r (version 3.5.1) using the lme4 package. GCA is similar to polynomial regression, but controls for potential collinearity issues by orthogonalizing the polynomial time terms (Mirman, 2014). For the present study, we opted to use GCA instead of more traditional mean pupil dilation and peak pupil dilation measures for the following reasons: (1) GCA allows for analysis of the full shape of the pupil response, and analysis of all data in the time-course without averaging across trials (as is necessary with analysis of variance), (2) calculation of the peak pupil dilation from pupil data averages (i.e., averages across trials within a given subject) can introduce noise into the analysis process, and (3) tests of mean pupil dilation are still present in GCA models (for our model, the main effect of condition), and peak pupil dilation can be indirectly tested by examining the slope and shape of a given GCA model. The window of data used for model fitting began at 500 ms after the stimuli began (based on visual inspection of the data) and ended 1000 ms after the average sentence offset time (3344 ms). The pupil data was time-binned, reducing the sampling rate from 500 to 50 Hz. Table 1 shows the full model and its specifications. The time-course of pupillary response was modeled up to the third-order (cubic) orthogonal polynomial, based on hierarchical model comparisons that indicated significant contributions of the quadratic (χ21 = 61.19; p < 0.001) and cubic (χ21 = 5.84; p = 0.016) polynomials to model fit. For the speaker condition manipulation, dummy coding specified the native speaker as the reference (i.e., zero) group. The random effects structure included random intercepts and slopes for subject and item. Correlations between these slopes and intercepts were included in the random effects (as recommended by Mirman, 2014).
R Code: . | . | . | . | . | . |
---|---|---|---|---|---|
lmer(Dilation ∼ . | (ot1 + ot2 + ot3) * Condition +(1 + ot1 + ot2 + ot3 | Subject) + (1 + ot1 + ot2 + ot3 | Item), data = PRA_E2R_TB50_AOI.df) . | . | # fixed effects # random effects # random effects . | ||
Term . | Estimate . | SE . | DF . | t . | p . |
Intercept (Native) | 180.73 | 16.30 | 77.31 | 11.085 | <0.001a |
ot1 | 1061.73 | 121.17 | 77.89 | 8.763 | <0.001a |
ot2 | −596.76 | 87.84 | 102.36 | −6.793 | <0.001a |
ot3 | −118.10 | 44.58 | 96.16 | −2.649 | 0.009b |
Condition (Nonnative) | 59.85 | 10.75 | 58.14 | 5.565 | <0.001a |
ot1:Condition (Nonnative) | 453.50 | 83.53 | 58.02 | 5.429 | <0.001a |
ot2:Condition (Nonnative) | −14.24 | 86.48 | 58.41 | −0.165 | 0.870 |
ot3:Condition (Nonnative) | 51.40 | 47.37 | 58.08 | 1.085 | 0.282 |
R Code: . | . | . | . | . | . |
---|---|---|---|---|---|
lmer(Dilation ∼ . | (ot1 + ot2 + ot3) * Condition +(1 + ot1 + ot2 + ot3 | Subject) + (1 + ot1 + ot2 + ot3 | Item), data = PRA_E2R_TB50_AOI.df) . | . | # fixed effects # random effects # random effects . | ||
Term . | Estimate . | SE . | DF . | t . | p . |
Intercept (Native) | 180.73 | 16.30 | 77.31 | 11.085 | <0.001a |
ot1 | 1061.73 | 121.17 | 77.89 | 8.763 | <0.001a |
ot2 | −596.76 | 87.84 | 102.36 | −6.793 | <0.001a |
ot3 | −118.10 | 44.58 | 96.16 | −2.649 | 0.009b |
Condition (Nonnative) | 59.85 | 10.75 | 58.14 | 5.565 | <0.001a |
ot1:Condition (Nonnative) | 453.50 | 83.53 | 58.02 | 5.429 | <0.001a |
ot2:Condition (Nonnative) | −14.24 | 86.48 | 58.41 | −0.165 | 0.870 |
ot3:Condition (Nonnative) | 51.40 | 47.37 | 58.08 | 1.085 | 0.282 |
Significant at p < 0.001 level.
Significant at p < 0.01 level.
4. Results
4.1 Pupillometry
The summary output from the full model is reported in Table 1, and the model fit is shown in Fig. 1(B). We predicted that there would be greater overall pupil response and a steeper increase of pupil response for the L2 speaker condition than the native speaker condition, resulting in a higher peak pupil response. The results confirmed both of these predictions, with a significant main effect of condition (ß = 59.40, SE = 10.75, p < 0.001) and a significant interaction between the linear polynomial term and condition (ß = 1061.73, SE = 121.17, p < 0.001). The interactions between condition and the quadratic and cubic terms were not significant (ß = −14.24, SE = 86.48, p = 0.870 and ß = 51.40, SE = 47.37, p = 0.282, respectively), indicating that the shape of the pupil response for each condition was similar. Overall, the pattern of results in the full model clearly demonstrate larger pupil response—and, thus, greater cognitive load—for unfamiliar L2-accented speech.
4.2 Subjective responses
As shown in Fig. 1(A), the L2 speaker was rated as more effortful to understand than the native speaker (p < 0.001, 95% CI = [−2.03, −1.31]).
5. Discussion
The results of the current study indicate that processing accurately recognized, unfamiliar L2-accented speech imposes greater cognitive load than native speech. These findings are consistent with an executive recruitment account for accented speech, under which we would predict that mismatches between the speech patterns of a speaker and a listener's representations will require more cognitive resources to process—even when recognition accuracy is at ceiling. The pupillometry data thus provides a novel and more nuanced account of unfamiliar accented speech processing, capturing differences in effort that standard measures of recognition accuracy (i.e., “intelligibility”) cannot. The one way in which recognition accuracy measures have been used to study highly intelligible speech is by presenting the speech in noise (to bring intelligibility off ceiling). Pupillometry has the distinct advantage of not requiring acoustic signal degradation. However, the limitations of the study design should also be noted. Most importantly, the current study uses a single talker in each condition, so it will be important to verify that these results can be replicated with other types of unfamiliar accented speech. Future work could expand upon these results by measuring pupil response for multiple speakers of a given accent, and for multiple types of unfamiliar nonnative accents and regional dialects.
Our findings also indicate that listeners are aware that processing unfamiliar L2-accented speech is more cognitively demanding than processing native speech. Complementing the pupillometric findings, participants also rated the highly proficient L2 speaker as significantly more effortful to understand than the native speaker (see also Munro and Derwing, 1995). However, while these subjective ratings match the physiological evidence, it is also possible that they reflect participants' expectations and/or biases regarding accented speech, and not their awareness of increased cognitive load. That is, participants may have rated the unfamiliar L2-accented trials as more effortful because accented speech is stereotypically more difficult to understand than native speech.
Investigating the executive function(s) underlying the increased cognitive load for unfamiliar L2-accented speech will also be an important direction for future research. Current speech perception models, such as the Ease of Language Understanding (ELU) model (Rönnberg et al., 2008; Rönnberg et al., 2013), propose a dual role of working memory during speech perception, emphasizing that it is important both for storage of the unfolding speech stream and for top-down processing of phonetic and semantic ambiguity. In line with this model, the current literature suggests that working memory may be particularly important for resolving perceptual ambiguities in unfamiliar accented speech. Both McLaughlin et al. (2018) and Janse and Adank (2012) found that working memory capacity is positively related to recognition accuracy for accented speech in young and older adults, respectively. While the ELU model primarily focuses on acoustically degraded speech, the current findings motivate exploration of how individual differences in working memory may play a critical role in unfamiliar accent processing, and, possibly, be reflected in pupil response during listening.
In summary, our findings indicate that processing unfamiliar L2-accented speech imposes greater cognitive load than processing native speech, as evidenced by a larger and more rapid pupil response. Most notably, recognition accuracy for the unfamiliar L2-accented speech stimuli used in the current study was at ceiling, indicating that mismatches between the speech patterns of L2-accented speech and native listeners' representations require greater support from executive resources (such as working memory) to reconcile. Additionally, while the accented stimuli in the present study were recordings of a Mandarin-Chinese accented speaker of American English, we would predict these findings to replicate with other unfamiliar L2 and regional accents as well. If—as we have proposed—increased cognitive load stems from the mismatches between the speech patterns of a speaker and a listener's linguistic representations, then there should be greater pupil response for other unfamiliar accent types as well. Testing the boundaries of this prediction will be an important next step for this area of inquiry.
Acknowledgments
This project was supported by a National Science Foundation Graduate Research Fellowship awarded to D.J.M. (Grant No. DGE-1745038).
See supplementary material at https://doi.org/10.1121/10.0000718 for a pilot experiment with a blocked design. The main effect of accent is present in both designs.
All pupil dilation measures are reported in the unit of measurement automatically provided by the EyeLink 1000 Plus system, called “arbitrary units,” which are measures of pupil area.