Higher-level factors, including the contextual plausibility of competing word candidates, interact with lower-level phonetic cues to influence how listeners interpret the speech signal. This work shows that listeners' phonetic categorization (e.g., coat versus goat) is more heavily influenced by sentential context when listening to a non-native versus native talker. Further, the effect of context on phonetic categorization decreases as the listener becomes familiar with the talker's phonetic characteristics, for both native and non-native talkers. Overall, results suggest that listeners adjust their perceptual strategies to optimize accurate perception of a talker's message.
1. Introduction
Speech perception requires the integration of multiple sources of information, from lower-level (acoustic) to higher-level (e.g., lexical, syntactic, social) cues (e.g., Ganong, 1980). For example, Borsky et al. (1998) found that listeners perceive speech sounds differently based on the context of the surrounding sentence, such that a word that is phonetically ambiguous between goat and coat is more likely to be classified as goat in “The boy milked the ___,” but as coat in “The girl put on the ___.” In this paper, we test (1) whether listeners rely more on sentential context when processing foreign-accented speech, and (2) whether listeners reweight contextual vs acoustic cues as they become familiar with the phonetic characteristics of a talker. In doing so, we examine a relatively unexplored strategy that listeners may adopt to optimize adaptation to new talkers or accents: adjusting the relative reliance on higher- vs lower-level cues.
Listeners' weighting of higher- vs lower-level cues varies based on properties of the signal. Gianakas and Winn (2016), for example, report a larger influence of higher-level, lexical cues in phonetic categorization when the acoustic signal is degraded via noise-vocoding. At the same time, listeners' knowledge and/or stereotypes about talker characteristics influence speech perception independently of acoustics, such that acoustically identical sounds can be interpreted differently based on social factors such as the talker's perceived gender (e.g., Strand and Johnson, 1996). When a listener expects the talker's phonetic features to be unreliable because they are a non-native speaker, they may therefore decrease their reliance on the acoustic signal, and, in turn, place more emphasis on higher-level cues.
Work by Lev-Ari (2015) suggests that listeners do indeed adjust their comprehension strategies based on different expectations for non-native talkers. Specifically, Lev-Ari proposes that listeners have “less detailed” linguistic representations for non-native talkers and compensate with heavier reliance on high-level, contextual information. For example, listeners accepted a picture of a mermaid as referent for [fɛɹi] (fairy) when the word was spoken by a non-native talker who was prone to talking about mythical creatures, even when a more correct referent (a ferry boat) was present. Although this work is in the lexical domain, the same reasoning extends to phonetic perception: phonetic non-prototypicality is a defining characteristic of non-native speech, so paying less attention to low-level acoustic cues and putting more weight on higher-level contextual cues may be a good strategy for accurately interpreting the message of a talker with a non-native accent.
Contrary to this prediction, Clopper (2012) found that listeners showed a smaller effect of higher-level information when listening to non-standard regional accents. Listeners were more accurate at identifying a word in a speech-in-noise task when the target word was contextually probable, but this contextual “benefit” was smaller when the talker had a less familiar regional accent. Using a similar speech-in-noise paradigm, Holt and Bent (2017) found no difference in children's use of semantic information when identifying words spoken with a native vs non-native accent. However, dialect-specific differences in intelligibility are more pronounced in noise (Clopper and Bradlow, 2008), so listeners may not show a contextual benefit for unfamiliar accents in speech-in-noise tasks simply because little contextual information is available. In the current work, we use clear listening conditions to directly test whether listeners will rely more heavily on contextual information when listening to talkers with discernable non-native accents.
The idea that listeners rely more heavily on higher-level information when they are less confident about the phonetic characteristics of the talker leads to a second prediction: as a listener gains familiarity with a talker across time, reliance on higher-level information should decrease. The large body of work on perceptual learning (e.g., Norris et al., 2003) demonstrates that listeners “retune” their phonetic categories to mirror the idiosyncratic pronunciations of the talker, showing that listeners do learn and adapt to talker-specific phonetic norms. Although most of this work has involved native talkers, Reinisch and Holt (2014) report similar results for a non-native talker, suggesting that similar processes govern learning of talker-specific phonetic characteristics in native and non-native speech. Therefore, while we predict that the extent of reliance on higher-level information will initially be stronger for non-native talkers, we expect this reliance to decrease in a similar manner for all talkers, regardless of language background.
In the current work, we use the paradigm of Borsky et al. (1998) to test whether the relative weighting of low-level acoustic vs high-level contextual information differs when listening to a native English versus non-native English (native Mandarin) talker, and how this changes as the listener gains experience with each talker. Participants are asked to identify target words that vary in both their phonetic characteristics (an acoustic continuum ranging from goat to coat) and their contextual plausibility (sentences that bias interpretation toward either goat or coat) across natively- and non-natively-accented talkers. We hypothesize that listeners will rely more heavily on higher-level information when they expect the lower-level, phonetic information to be less reliable or predictable, i.e., when confronted with a talker with a discernable accent. Second, we expect that his reliance on higher-level information will decrease across the course of exposure to each talker, if the weighting of higher-level information is correlated with listeners' confidence in the phonetic characteristics of a given talker's accent.
2. Methods
Participants were 20 female undergraduate students (ages 19–23 yrs old, mean = 21) from the University of Mississippi. All were native English speakers with no substantial exposure to other languages, and all reported normal hearing. An additional four participants were omitted for failing to follow task instructions.
Methods were adapted from Borsky et al. (1998). Participants heard sentences ending in target words drawn from a continuum from goat to coat, followed by appearance of the word GOAT or COAT on the screen. Listeners indicated whether the final word of the sentence matched the visual probe by pressing a button on a keyboard.
Stimuli varied by talker, sentential context, and VOT (voice onset time—the primary acoustic dimension differentiating voiced stops like /g/ from voiceless stops like /k/) of the initial stop in the target word. Two female talkers—one a native speaker of Midwestern American English and the other a native speaker of Mandarin, chosen because she had a discernable non-native accent—recorded ten sentences ending in the goat and the coat. A Mandarin-accented speaker was chosen in part because Mandarin has a similar realization of the /g/-/k/ contrast as English (Lin, 2007).1 The semantic context of the sentences biased the interpretation toward goat half the time (e.g., “The young girl milked the ____”) and coat half the time (e.g., “The lazy brother unbuttoned the ____”) (see the Appendix).
The final two words of each sentence (the g/coat) were spliced out and replaced with a target phrase, consisting of the plus a target word drawn from a 9-step VOT continuum ranging from 5 to 85 ms. The continuum was created by systematically manipulating the word-initial stop from one natural production of the coat from each talker, using the PSOLA algorithm in Praat (Boersma and Weenink, 2016) to increase or decrease VOT, measured from before the stop burst to the onset of voicing of the following vowel (as in Borsky et al., 1998). The entire phrase the g/coat was used as the baseline for manipulation, as opposed to the target word in isolation, so that co-articulatory information preceding the stop closure would be identical for all stimuli. Manipulated target phrases were spliced into each of the ten carrier sentences produced by the same talker.
Half of the participants heard the native talker first, and half heard the non-native talker first. Within each talker, there were 180 trials, randomized and divided across two blocks: 9 VOT steps * 10 sentences (5 coat-biasing and 5 goat-biasing) * 2 repetitions.2 Across the two talkers, there were a total of 4 blocks and 360 trials. Listeners took a self-paced break between blocks.
3. Analysis and results
Fourteen trials with reaction times above 6 s were excluded as outliers. The remaining responses were converted to coat (/k/) or goat (/g/) for analysis. For example, when the visual probe was COAT, a match response was interpreted as coat, and a mismatch response as goat. We then examined how our factors of interest, Talker (native vs non-native), Context (goat- vs coat-biased), VOT (steps 1–9), and Block (first or second block for a given talker), influenced the listeners' phonetic categorization.
Figure 1 shows listener responses across the VOT range, broken down by Talker, Context, and Block, with the y axis indicating the percentage of coat (vs goat) responses. As expected, listeners' choice of /g/ vs /k/ was highly predictable based on VOT, as shown by the prevalence of goat responses at low VOTs and coat responses at high VOTs. The two lines in each panel represent the two sentential biasing contexts. A larger gap between the two lines indicates a larger influence of Context on listeners' responses. As can be seen in Fig. 1, the effect of Context appears to be larger for the non-native talker (right) than the native talker (left), and larger in Block 1 (top) than Block 2 (bottom).
(Color online) Average percentage of coat (vs goat) responses across the VOT range and across the first and second half of the trials for each talker. Error bars show 95% confidence intervals based on by-participant means.
(Color online) Average percentage of coat (vs goat) responses across the VOT range and across the first and second half of the trials for each talker. Error bars show 95% confidence intervals based on by-participant means.
Responses were analyzed in a logistic mixed-effects regression model, using the lme4 package in R (Bates et al., 2015; R Core Team, 2013). The model predicted the log likelihood of listeners' choice of coat (vs goat) as a function of the fixed factors of Talker, Context, Block, and their interactions. VOT was included as a fixed covariate. The random effects structure was determined following recommendations of Matuschek et al. (2017), and included random by-subjects and by-item intercepts, as well as by-subject random slopes for Context and VOT.
Statistical results are presented in Table 1. All categorical predictors were coded as (−0.5, 0.5), and VOT was centered prior to analysis. Thus, the intercept represents the log odds of a coat response, while estimates for categorical factors represent the difference in log odds for a coat response between the two levels of a given factor, collapsed over all levels of the other factor(s). The estimate for VOT represents the change in log odds for a one-unit increase in normalized VOT. In the case of significant interactions, we performed follow-up Wald Chi-square tests, using the phia package in R (De Rosario-Martinez, 2015). Along with the statistical results, we also report the 50% crossover points (i.e., the /g/-/k/ category boundary) of response curves, as a descriptive measure of the extent to which each factor influences categorization.
Results from the logistic mixed effects model: /k/ (vs /g/) response ∼ VOT + Context * Talker * Block + (VOT + Talker | participant) + (1 | item). Reference levels for categorical variables are in italics.
. | Estimate . | SE . | z . | p . |
---|---|---|---|---|
(Intercept) | 1.557 | 0.165 | 9.422 | <0.001 *** |
VOT | 0.117 | 0.011 | 11.048 | <0.001 *** |
Context (goat- vs coat-biasing sentence) | 0.665 | 0.078 | 8.579 | <0.001 *** |
Talker (native vs non-native) | 0.515 | 0.124 | 4.138 | <0.001 *** |
Block (first vs second block for a talker) | 0.030 | 0.074 | 0.411 | 0.681 |
Context * Talker | 0.331 | 0.147 | 2.252 | 0.024 * |
Context * Block | −0.368 | 0.148 | −2.490 | 0.013 * |
Talker * Block | 0.101 | 0.147 | 0.687 | 0.492 |
Context * Talker * Block | −0.073 | 0.295 | −0.247 | 0.805 |
. | Estimate . | SE . | z . | p . |
---|---|---|---|---|
(Intercept) | 1.557 | 0.165 | 9.422 | <0.001 *** |
VOT | 0.117 | 0.011 | 11.048 | <0.001 *** |
Context (goat- vs coat-biasing sentence) | 0.665 | 0.078 | 8.579 | <0.001 *** |
Talker (native vs non-native) | 0.515 | 0.124 | 4.138 | <0.001 *** |
Block (first vs second block for a talker) | 0.030 | 0.074 | 0.411 | 0.681 |
Context * Talker | 0.331 | 0.147 | 2.252 | 0.024 * |
Context * Block | −0.368 | 0.148 | −2.490 | 0.013 * |
Talker * Block | 0.101 | 0.147 | 0.687 | 0.492 |
Context * Talker * Block | −0.073 | 0.295 | −0.247 | 0.805 |
There was a significant positive intercept, indicating more overall coat than goat responses. As expected, the main effects of VOT and Context were significant, with more coat responses for higher VOTs and for coat-biasing sentences. However, the magnitude of the Context effect differed across the other two factors, as indicated by the significant two-way interactions of Context by Talker and Context by Block. While the effect of Context was significant for both talkers (non-native: χ2 = 59.6, p < 0.001; native: χ2 = 22.2, p < 0.001), it was greater for the non-native than the native talker: there was an 8.3 ms difference in the category boundary between goat vs coat-biasing sentences for the non-native talker, as compared with a 4.6 ms difference for the native talker (collapsed over both Blocks). Likewise, Context significantly influenced listeners' responses in both blocks (Block 1: χ2 = 61.7, p < 0.001; Block 2: χ2 = 20.5, p < 0.001), but it was larger in Block 1 (7.8 ms higher category boundary for goat- than coat-biasing sentences) than in Block 2 (5.0 ms boundary difference).
These interactions are in line with our predictions. (1) While sentential context always informed responses, it played a greater role when the talker had a discernable non-native accent (Context by Talker interaction). (2) Listeners decreased their reliance on context during across time (Context by Block interaction). Further, these effects were not modulated by a 3-way interaction of Context by Talker by Block, suggesting that the talker-driven discrepancy in use of Context did not differ significantly by Block, and that listeners showed equal—or at least not significantly different—adaptation to the native and non-native talkers.
Finally, there was a significant, unpredicted, main effect of Talker, with more coat responses for the non-native vs native talker. Inspection of the data indicated that the difference in categorization can be attributed to a lower /g/-/k/ category boundary for the non-native talker: 34 ms for the native talker vs 29 ms for the non-native talker.
4. Discussion
The findings are in line with our prediction that listeners will rely more heavily on higher-level (contextual) information when there is reason to be less confident about the lower-level (acoustic) information. First, we found that phonetic categorization is more influenced by sentential context when listening to a non-native than a native talker. These results support Lev-Ari's (2015) proposal that contextual expectations can trump the linguistic content of a non-native talker's message in language processing more generally. Listeners may expect non-native pronunciation to be less reliable than that of a native speaker, and put more weight on higher-level information. On the other hand, Holt and Bent (2017) report no difference in children's use of context for native versus non-native talkers, while Clopper (2012) reported a decreased effect of contextual information by adult listeners listening to a talker with an unfamiliar regional accent. However, the speech-in-noise task used in these studies may disproportionately impair intelligibility of unfamiliar accents (e.g., see Clopper and Bradlow, 2008). When overall intelligibility is sufficiently reduced, adequate contextual information is no longer transmitted in the signal for it to be useful for phoneme or word recognition. Overall, then, our results support a model of speech perception that includes a role of dynamic, talker-driven tuning of speech perception strategies. Future work with multiple talkers per accent is an important next step in assessing the generalizability of our observed effect of a talker-specific influence on phonetic categorization. Furthermore, including talkers with different degrees of accentedness would provide a more complete account of how specific properties of the accent influence perception.
Our second finding is that use of higher-level information decreases as the listener gains experience with the talker. This suggests that greater certainty about an individual talker's phonetic characteristics causes listeners to reduce their reliance on higher-level information in favor of acoustic information. This effect of experience (Block) did not differ significantly across the two talkers, suggesting that qualitatively similar learning processes may be involved in adaptation to native vs non-native talkers, in line with previous work showing similar phonetic retuning processes in both native and non-native speech (Reinisch and Holt, 2014). However, the effect of Block was numerically smaller for the native talker, so the lack of a significant Block by Talker interaction may be due to lack of sufficient power. Future work should explore the time-course of adaptation in the use of higher- vs lower-level information, and the extent to which the time-course may differ based on characteristics of the talker.
Our final, unexpected, finding is that listeners had a lower overall VOT boundary for the non-native talker. This could be due to differences in the stimuli themselves: VOT is just one of many phonetic cues to the stop voicing contrast (e.g., f0, formant transitions, see Lisker, 1986). However, a separate population of listeners from Toronto with substantial experience with Mandarin-accented English do not show this talker-related difference in VOT boundary (Schertz and Hawthorne, 2017), suggesting that the effect is most likely not due solely to differences in the stimulus items. One potential reason for the unexpected finding could be the participants' experience with non-native accents in general. Many commonly-heard non-native accents in the U.S. (e.g., Spanish, French) have a categorically lower VOT boundary than English (e.g., Flege and Eefting, 1987). Upon hearing a non-native English talker, listeners without extensive Mandarin-accented English experience may therefore rationally posit that the talker will have a lower VOT boundary. However, with only one talker per accent and one listener group, this remains speculative.
Overall, our results highlight listeners' sensitivity to the talker and the communicative situation, as well as their ability to flexibly adjust their perceptual strategies. This fits in well with a Bayesian “ideal listener” model of speech perception (e.g., Kleinschmidt and Jaeger, 2015), which predicts situationally-specific adjustments in ways that would rationally optimize comprehension. In other words, listeners' assumptions about the relative predictability of various sources of information influence their reliance on these different sources.
An important implication of this work is that differences found in the processing of native vs non-native speech may be due not only to phonetic differences in the talker (i.e., non-prototypical pronunciations by non-native speakers), but also to different strategies used by the listener. Our results also show that this increased reliance on higher-level information is mitigated through short-term experience with the phonetic characteristics of the talker across the time-course of the experiment. Taken together, these two findings suggest the additional possibility that listeners' long-term experience with accented speech could also shape perception. In other words, listeners who differ in their extent of experience with a specific accent, or with accented speech in general, may show different perceptual strategies. We hope that the current findings stimulate more detailed explorations of how listeners' long-term experience with accents shapes both their initial cue-weighting strategies and the trajectory of adaptation.
Acknowledgments
We would like to thank Beka Bosley, Sarah Fischer, Lori Hendrix, and Alexis Zosel (University of Mississippi), as well as Crystal Chow and Shukri Nur (University of Toronto Mississauga) for their help running the experiment. We would also like to thank several anonymous reviewers for comments on previous drafts of this paper. Support for the preparation of this manuscript was provided in part by the School of Applied Science at the University of Mississippi.
Appendix
Coat-biasing sentences
The young girl learned to put on the coat/goat.
The handsome man forgot to wear the coat/goat.
The wise grandmother remembered to wear the coat/goat.
The lazy brother did not unbutton the coat/goat.
The old grandfather liked to put on the coat/goat.
Goat-biasing sentences
The young girl learned how to milk the coat/goat.
The handsome man forgot to feed the coat/goat.
The wise grandmother remembered to feed the coat/goat.
The lazy brother did not chase the coat/goat.
The old grandfather liked to milk the coat/goat.
For example, an aspiration contrast, phonetically [k] (/g/) vs [kh] (/k/). The average natural production values for the English speaker in our recordings were 26 ms VOT for /g/ and 88 ms for /k/; the Mandarin speaker had a VOT of 24 ms for /g/ and 103 ms for /k/.
We also performed a manipulation of the pitch of the vowel in the target word. This was for a different purpose and did not interact with the sentential context effect; for the purposes of this work we therefore consider the two levels of this manipulation as a repetition.