Listeners can rapidly adapt to an unfamiliar accent. For example, following exposure to a speaker whose /f/ sound is ambiguous between [s] and [f], they categorize more sounds along an [s]–[f] phonetic continuum as /f/. We investigated the adaptation mechanism underlying such perceptual changes—do listeners shift the target sound in phonetic space (category shift), or do they adopt a more general mechanism of broadening the category (category expansion)? In experiment 1, we trained listeners on an accent containing ambiguous /θ/ = [θ/s] and then tested them on categorizing phonetic continua spanning [θ]–[s] or [θ]–[f]. Listeners tested on the [θ]–[s] continua showed a significant increase in proportion of /θ/ responses vs controls, while those tested on [θ]–[f] did not. Experiment 2 investigated how acoustic-phonetic similarity may modulate the mechanism underlying recalibration. Listeners were trained on the same /θ/ = [θ/s] accent as in experiment 1 but were tested on a different continuum, [θ]–[ʃ]. This time, trained listeners showed a significant increase in proportion of /θ/ responses with the novel phonetic contrast. This suggests that phonetic recalibration involves some degree of non-uniform category expansion, constrained by phonetic similarity between training and test sounds.

As listeners, we sometimes encounter speech that deviates from what we typically hear in our everyday lives. Whether due to a non-native accent or regional dialect or an idiosyncratic pronunciation, such variation poses a potential challenge for listeners, resulting in decreased comprehension and/or increased processing time. Fortunately, listeners can rapidly adapt to an atypical pronunciation, reducing or eliminating this “accent cost” with sufficient exposure to the speaker (Bradlow and Bent, 2008; Clarke and Garrett, 2004; Melguy and Johnson, 2021; Sidaras et al., 2009; Vaughn, 2019).

Such perceptual adaptation is well-established in the literature and has been found using a variety of tasks. Clarke and Garrett (2004), for instance, exposed listeners to Spanish- or Chinese-accented English where the final word in the sentence was unpredictable, and listeners' task was to indicate if a visual word target on their computer screen matched the spoken word or not. By the end of the task, subjects' responses matched the baseline found for natively accented speech. Another study by Bradlow and Bent (2008) measured sentence transcription accuracy of Chinese-accented speech for trained and untrained groups of listeners, finding that listeners previously exposed to the accent showed improved accuracy vs controls, with listeners trained on multiple speakers able to generalize learning to a new speaker of the same accent. Sidaras et al. (2009) utilized a similar paradigm, exposing listeners to sentences produced by six speakers of Spanish-accented English and then testing transcription accuracy on novel sentences and words produced by either six different speakers or the same group of speakers they heard in training. Trained listeners showed improved performance for both the familiar and novel speaker group vs untrained controls. Melguy and Johnson (2021) also found evidence of adaptation to a speaker of Mandarin-accented English, as measured by improved sentence transcription accuracy over the course of a 60-trial exposure period. Vaughn (2019) obtained a similar result for Spanish-accented English, finding a significant improvement in transcription accuracy across just 40 sentence-level trials.

The common pattern in such studies is that listeners can rapidly adapt to a specific speaker's accent following a relatively brief period of exposure, a result that is consistent across different measures of language processing, from word or sentence transcription accuracy (Alexander and Nygaard, 2019; Bradlow and Bent, 2008; Melguy and Johnson, 2021; Sidaras et al., 2009; Vaughn, 2019) to reaction times in online lexical processing tasks (Clarke and Garrett, 2004; Xie and Myers, 2017; Xie et al., 2017).

How do listeners achieve such adaptation? One commonly assumed mechanism is lexically guided recalibration or “retuning” of phonetic category boundaries, which allows the perceptual system to dynamically adjust to changes in the linguistic environment (Norris et al., 2003). Such adaptation facilitates subsequent speech processing by adjusting prelexical categories to more closely match recently encountered tokens, thereby improving accuracy for a particular speaker or accent. In their seminal study, Norris et al. (2003) argued that lexical feedback could allow listeners to perceptually adapt to a novel regional dialect or accent via adjustment of phonetic categories. They created an artificial accent by mixing together two sounds, [s] and [f], and embedding the resulting ambiguous sound [s/f] in naturally recorded words, replacing either /s/ or /f/. Following exposure to this pronunciation, listeners shifted their categorization boundary on an [s]–[f] phonetic continuum, with the direction of the shift depending on the training condition (whether they heard [s/f] in /s/- or /f/-words). Subsequent studies have added several key findings about the nature of recalibration. First, they show that phonetic recalibration is often speaker-specific, failing to transfer to a new speaker with the same pronunciation (Eisner and McQueen, 2005; Kraljic and Samuel, 2005, 2007). Authors of these studies have argued that acoustic similarity between training and test pronunciations constrains the likelihood of generalization, as the phonetic realization of fricatives can differ substantially between speakers. This observation sees further support in later recalibration research involving voiceless fricatives, which has shown that cross-speaker generalization is possible when exposure and generalization speakers' productions span a similar range of perceptual space (Reinisch and Holt, 2014). In other cases, generalization appears to occur more freely, both across speakers and across categories for the same speaker. For instance, Kraljic and Samuel (2006) showed that learning for an ambiguous pronunciation /d/ = [d/t] transferred to a novel speaker and to a novel place of articulation—trained listeners classified more sounds as voiced on both [d]–[t] and [b]–[p] phonetic continua. The authors also explain this generalization asymmetry between stops and fricatives in acoustic terms—the phonetic implementation of voicing differs minimally across speakers and across places of articulation, whereas fricatives show much larger differences and thus tend to resist generalization.

Schuhmann (2014) found that within-speaker generalization was possible with fricatives in some cases: listeners trained on an ambiguous voiceless fricative pronunciation [f/s] generalized the same pattern to a voiced fricative contrast [v]-[z]. She argued that generalization occurred, despite the voicing difference, because both [f]-[s] and [v]-[z] contrasts involve similar acoustic-phonetic cues to place of articulation. Generalization was also found by Mitterer et al. (2016), who trained listeners on a pronunciation involving tensified (underlyingly lax) Korean stops that were phonetically ambiguous in place of articulation (e.g., /t/ = [t*/p*]). They found that listeners generalized learning to phonetically similar non-tensified lax stops [t]-[p], but not to the more distinct aspirated stops [th]-[ph], concluding that acoustic similarity was an important predictor for generalization of learning.

Additional research suggests that listeners can also generalize learning across prosodic positions, but again such transfer seems to be tightly constrained. For instance, Jesse and McQueen (2011) found that listeners trained on an ambiguous fricative pronunciation [f/s] in word-final position generalized learning to word-initial instances of the sound, but Mitterer et al. (2013) failed to find similar transfer of learning for Dutch liquids, which, unlike fricatives such as /s/ or /f/, are realized differently in onset vs offset positions. They trained listeners on an ambiguous liquid pronunciation involving word-final /r/ and /l/, where /r/ can occur as an approximant, while /l/ is velarized (e.g., /l/ = [ɹ/ɫ]). They then tested generalization of learning to different allophonic variants of these sounds in word-final and word-initial positions. They found no generalization of learning in either case, concluding that learning occurred at the level of position-dependent allophones and that generalization is constrained by acoustic similarity with exposure sounds. A more recent study by Mitterer and Reinisch (2017) also failed to find generalization of learning across word position for German stops. Listeners were exposed to (underlyingly) voiced stops that were ambiguous in place of articulation. These sounds occurred word-finally and, due to a German devoicing process, were realized as phonetically voiceless (e.g., /d/ = [t/k]). Results showed that listeners generalized learning to phonetic contrasts involving (underlyingly) voiced stops [d]-[g] as well as (underlyingly) voiceless stops [t]-[k] in word-final position. However, there was no generalization when the same sounds occurred word-medially. Again, the conclusion was that learning is contextually specific and dependent on acoustic similarity between exposure and test segments.

The current body of work on phonetic recalibration has shown that the perceptual system is flexible and able to adapt to atypical pronunciations of a wide variety of speech segments, as well as to generalize such learning to novel segments. However, generalization of learning appears to be tightly constrained by phonetic similarity and the distribution of relevant acoustic cues, which may differ across contexts, varying by word position, by segment, and by speaker. This mirrors recent findings from natural accent accommodation studies (Alexander and Nygaard, 2019; Xie and Myers, 2017), which suggest that acoustic similarity between exposure and test speakers is a key ingredient for successful generalization of perceptual learning to novel speakers.

Despite extensive evidence that listeners can successfully “retune” or “recalibrate” perceptual categories following exposure to an atypical pronunciation (Kraljic and Samuel, 2005, 2006; Mitterer et al., 2013; Norris et al., 2003; Reinisch et al., 2013) and that perceptual categories have internal structure (Iverson and Kuhl, 2000; Miller and Volaitis, 1989; Xie et al., 2017), there is no consensus about the underlying mechanisms involved. In this study, we explore two possible adaptation strategies that have been proposed in prior research: phonetic category boundary shift vs category boundary expansion (e.g., Kleinschmidt and Jaeger, 2015). Under the first account, listeners are said to make targeted adjustments to phonetic category representations following exposure to a non-canonical pronunciation. Such shifts are expected to be specific to the sound pattern involved in training. Under the second, listeners may simply become more tolerant of atypical pronunciations of a given category. This process has been described as a general relaxation of default phonetic categorization criteria. This means that listeners may generalize learning beyond the specific phonetic pattern in the exposure accent, accepting multiple different realizations of the trained category as instances of that category. Importantly, such expansion could be uniform or non-uniform. A uniform expansion of the trained category would involve a general broadening of the category in phonetic space, potentially altering subsequent perception of all neighboring sounds. A non-uniform expansion, by contrast, would involve expansion of just one part of the category toward the region of phonetic space where listeners had encountered the unusual tokens of the target.1

These different mechanisms are schematized in Fig. 1, where we illustrate three possible ways by which listeners may restructure their /θ/ category boundaries following exposure to an artificial accent where /θ/ = [θ/s]. Here, we show hypothetical perceptual category distributions for /θ/ and two neighboring fricatives, /f/ and /s/, which are similar to it perceptually (Cutler et al., 2004; Miller and Nicely, 1955) and acoustically (Stevens, 1998). Crucially, we assume that /θ/ lies in between these two categories in perceptual space (see Sec. I D for a detailed discussion of these sounds and a justification for placing them along a continuum of perceptual similarity). If listeners recalibrate by shifting the category [change in mean; see Fig. 1(a)], we expect a shift in both sides of the /θ/ distribution toward /s/. If listeners utilize a uniform expansion strategy [increased category variance; see Fig. 1(b)], then we expect a shift on both sides of the category distribution, toward /s/ on the right and toward /f/ on the left. Finally, if recalibration occurs via non-uniform expansion [added skew of the distribution; see Fig. 1(c)], then we expect to see a shift in the category boundary toward /s/ but no change in the /f/-/θ/ boundary.

FIG. 1.

(Color online) Possible recalibration strategies following exposure to an ambiguous /θ/ = [θ/s] pronunciation (dotted line indicates category boundaries prior to accent exposure): (a) shows recalibration by category shift, while (b) and (c) show recalibration by uniform or non-uniform category expansion. Category shift (a) and uniform expansion (b) predict that both the /f/-/θ/ and the /θ/-/s/ boundaries will shift after exposure, while non-uniform category expansion (c) allows for only a shift on the /θ/-/s/ boundary.

FIG. 1.

(Color online) Possible recalibration strategies following exposure to an ambiguous /θ/ = [θ/s] pronunciation (dotted line indicates category boundaries prior to accent exposure): (a) shows recalibration by category shift, while (b) and (c) show recalibration by uniform or non-uniform category expansion. Category shift (a) and uniform expansion (b) predict that both the /f/-/θ/ and the /θ/-/s/ boundaries will shift after exposure, while non-uniform category expansion (c) allows for only a shift on the /θ/-/s/ boundary.

Close modal

Previous work has noted the possibility that listeners could be relying on either category shift or expansion to achieve recalibration. As Kleinschmidt and Jaeger (2015) point out, both are plausible mechanisms that can account for the recalibration effect: listeners can either shift a category in phonetic space (change the mean) or expand it (increase the variance). Moreover, they argue that both strategies are available to listeners, and both may be useful depending on the speech context and listeners' prior assumptions about how pronunciation varies across sound categories and across speakers—note, however, that their modeling assumes that categories have a normal distribution defined by just two parameters (mean and variance), and they do not discuss skewed distributions. Yet despite the extensive body of work on phonetic recalibration, evidence to support a category shift vs some version of category expansion is lacking—existing findings are compatible with all three possible strategies proposed here. This is because most recalibration studies have either focused on testing just the sound pattern listeners heard during exposure (Eisner and McQueen, 2006; Norris et al., 2003), testing generalization of the same pattern to a different speaker (Eisner and McQueen, 2005; Kraljic and Samuel, 2005; Reinisch and Holt, 2014), or looking for cross-series generalization of the trained sound pattern to another pair of sounds related on some feature dimension (Kraljic and Samuel, 2006; Mitterer et al., 2016; Mitterer and Reinisch, 2017; Reinisch and Mitterer, 2016; Schuhmann, 2014). However, despite this ambiguity in the existing recalibration literature, other research on perceptual learning more generally does offer some evidence for each of these strategies.

1. Category shift

For instance, several studies utilizing either natural or artificial accents have provided evidence for the type of pattern-specific learning that is suggested by a category shift mechanism. Maye et al. (2008) trained listeners on an artificial accent where front vowels were systematically lowered (e.g., witchwetch). Following exposure, listeners were more likely to process novel words with lowered front vowels (matching the exposure pattern) as real words. However, listeners who were tested on a novel accent where front vowels were raised did not show increased acceptance of these items as real words, suggesting pattern-specific learning. Analogously, Bradlow and Bent (2008) showed that exposure to multiple talkers of Chinese-accented English led to improved transcription accuracy with a novel Chinese-accented speaker, but not a Slovakian-accented one. This suggests that listeners learned an accent-wide schema that was specific to the common phonetic characteristics of Chinese-accented English, but not Slovakian-accented English. Alexander and Nygaard (2019) also found relative specificity of learning, which they took as evidence for a boundary shift learning mechanism—listeners who were trained on single-word utterances produced within a given accent (either Korean- or Spanish-accented English) and subsequently tested on a novel speaker of the same accent saw a significant training benefit, while those tested on a novel speaker of a different accent did not. Meanwhile, results from multiple-accent training were mixed and yielded a smaller training benefit that was less consistent, showing up for Spanish-accented tokens but not Korean-accented ones. Subsequent analyses of vowel productions across the different accents suggested that acoustic similarity between exposure and test items may have facilitated generalization where this was observed, regardless of training/test accent pairings.

2. Category expansion

Other studies show evidence for a less specific learning strategy that may support a version of the category expansion mechanism proposed here. This kind of general strategy has been presented as either relaxation of phonetic categorization criteria (e.g., Zheng and Samuel, 2020) or as general category expansion (Kleinschmidt and Jaeger, 2015; Schmale et al., 2012). The idea is that rather than shifting phonetic category boundaries in a specific direction, listeners may just expand them, which may help them with diverse or unfamiliar accents. Lev-Ari (2015) presents an analogous view of non-native speech comprehension—listeners expect non-native speakers to be less competent in general and therefore adjust processing at all levels (phonetic, lexical, syntactic, etc.) to accept more deviations from native-speaker norms. Evidence for this mechanism is shown by generalization of perceptual learning beyond the specific segmental patterns listeners are exposed to. For instance, White and Aslin (2011) found that infants trained on a vowel fronting pattern, /ɑ/ → /æ/ (dogdag), showed some generalization of learning to a similar but distinct pattern of fronting, /ɑ/ → /ɛ/ (dogdeg). The authors suggested that accent exposure may have resulted in category expansion but that this was constrained by similarity to the trained pronunciation. Weatherholtz (2015) found evidence for more general category expansion, showing that listeners exposed to a novel pattern of back vowel lowering generalized learning to a novel accent where back vowels were raised [contrary to the findings of Maye et al. (2008)], as well as to a structurally parallel pattern where front vowels were lowered. However, listeners who were exposed to a pattern of back vowel raising showed no generalization to raised front vowels—learning was specific to the exposure pattern. Such findings suggest that while phonemes may expand in phonetic space due to accent exposure, there are still clear constraints on the degree to which listeners are willing to relax categorization criteria and generalize beyond the specific phonetic pattern(s) they are exposed to.

The primary aim of this study is to investigate the mechanisms responsible for the change in a phonetic category boundary observed in a typical lexically guided recalibration experiment. Crucially, we assume that in either scenario listeners are making an adjustment to a phonetic category representation following exposure to an atypical realization of that category. The question is whether such recalibration is contrast-specific (boundary shift) or more general (category expansion).

The specific sounds used in training and test phases of the current study were selected based on their close correspondence to actual pronunciation differences in non-native speakers of English, as well as their proximity to one another in perceptual and acoustic space. Because the dental fricative /θ/ is a cross-linguistically unusual sound typically absent from language phoneme inventories, L2 learners across different language backgrounds struggle to realize it in a native-like manner. Even within a group of L2 speakers sharing a common L1 background, it is normal to see a wide range of variation in how this segment gets realized. Two frequent substitutions involve other (more typologically common) fricatives that are phonetically close to [θ], such as [s] or [f] (Seibert, 2011).

The experiments reported here thus make use of the fricatives [f], [θ], [s] (experiment 1), and [ʃ] (experiment 2) and critically assume that [θ] is perceptually intermediate between [f] and [s]. This assumption is justified by perceptual data (see Fig. 2, left panel). The perceptual space illustrated there was derived from the 0 dB signal-to-noise ratio (SNR) confusion matrix published by Miller and Nicely (1955), using the multidimensional scaling method for confusion matrix data developed by Shepard (1972) [see also Johnson (2003) for a short explication of the method]. The first perceptual dimension that emerges in the analysis of English fricative confusions separates the voiced fricatives from the voiceless fricatives. The second perceptual dimension separates the sibilant fricatives [s, ʃ, z, ʒ] from the non-sibilants [f, θ, v, ð]. Importantly, in the perceptual space, [θ] is between [f] and [s]. One key acoustic cue that is related to this perceptual configuration is illustrated by the spectra shown in the right panel of Fig. 2, which shows acoustic power spectra taken from the midpoint of each of the fricatives [f], [s], and [θ] that were used in experiment 1 below. The sibilance of [s] is apparent in how much louder (vertically displaced) the [s] spectrum is; [f] and [θ] are similar to each other in having lower amplitude. In addition, [s] and [θ] are similar to each other in having a peak amplitude between 5 and 7 kHz. This reflects the resonant frequency of a short resonant tube in front of the tongue constriction of [θ] and [s], which is shorter in [θ]. The effective length of the acoustic cavity in front of the constriction is much shorter in [f], resonating above 9 kHz (Stevens, 1998). The spectrum of [ʃ] (not shown in Fig. 2 for the sake of clarity) has high amplitude like [s] and lower frequency resonant peaks at about 3 and 5 kHz.

FIG. 2.

(Color online) Perceptual space (left panel) and fricative noise spectra (right panel) illustrating the similarity of the non-sibilants [θ] and [f] and of the coronal fricatives [s] and [θ]. [θ] is plotted with a solid black line, [s] is plotted with a dashed blue line, and [f] is plotted with a red dotted line.

FIG. 2.

(Color online) Perceptual space (left panel) and fricative noise spectra (right panel) illustrating the similarity of the non-sibilants [θ] and [f] and of the coronal fricatives [s] and [θ]. [θ] is plotted with a solid black line, [s] is plotted with a dashed blue line, and [f] is plotted with a red dotted line.

Close modal

The proximity of these sounds in the phonetic space surrounding [θ] makes for a clear test of a category shift vs expansion strategy, with potential real-world implications for adaptation to natural accents. We investigated this question by exposing two groups of listeners to the same artificial accent: a pronunciation of the voiceless dental fricative /θ/ that was ambiguous between [s] and [θ]. Previous recalibration literature has used a similar training accent involving /θ/, intended to approximate a common pronunciation of this sound by L1 Mandarin speakers of English (Charoy, 2021; Zheng and Samuel, 2020). Listeners were then tested on different minimal-pair continua: one group heard tokens along [θ]–[s] continua (e.g., thinksink), while the other group heard [θ]–[f] continua (e.g., thoughtfought). Predictions for the changes in categorization under each type of possible recalibration mechanism are summarized in Fig. 1. If recalibration is achieved by category shift (shift in mean, no change in variance), then we should expect to see a shift toward [s] in the [θ]–[s] test group and a shift toward [θ] in the [θ]–[f] test group. However, if training induces a uniform broadening of the /θ/ category, then we can expect an increase in the proportion of /θ/ responses in both test groups, reflecting listeners' willingness to generally accept more atypical tokens as exemplars of /θ/, regardless of phonetic match to the exposure pronunciation. Finally, if listeners expand the /θ/ category in a non-uniform way following accent exposure (skewed distribution), we would expect to see a shift in the [θ]–[s] group toward [s] but no change in the categorization results for the [θ]–[f] group.

This experiment was implemented via a custom web program,2 and data were collected via Amazon's Mechanical Turk (MTurk), a platform where workers complete Human Intelligence Tasks (HITs) for compensation. In recent years, MTurk has been an increasingly popular platform for conducting linguistics research, including auditory perception tasks (Cooper and Bradlow, 2018; Melguy and Johnson, 2021; Vaughn, 2019; Xie and Myers, 2017) as well as phonetic recalibration tasks (Charoy, 2021; Liu and Jaeger, 2019). In contrast to data obtained from the lab, MTurk offers the benefit of a more demographically diverse participant pool and efficient collection of large amounts of data. Recent studies have shown that MTurk can yield high-quality psychometric data comparable to that obtained from laboratory studies (Buhrmester et al., 2011) despite the inability of researchers to precisely control for factors such as the experimental environment or hardware used for stimulus presentation. Recent research has shown that when differences in audio quality are accounted for, online participants can closely match the quantitative and qualitative patterns of their lab counterparts (Cooke and García Lecumberri, 2021).3

One hundred eighty-nine participants were initially recruited via MTurk to participate in the main experiment. Exclusion of participants who failed to meet experimental criteria left a total of 120 participants whose data were retained for analysis (see Sec. II E for an explication of exclusion criteria). All participants lived in the U.S., were native speakers of English, reported normal speech and hearing, and gave informed consent prior to participating in the study. They were paid $15/h for their participation. An additional 22 participants were recruited under the same selection criteria for a norming task, which was used to determine the most ambiguous critical /θ/-words to be used in the exposure task. Experimental participants were primarily male (M = 80, F = 37), with three subjects declining to report and no subjects selecting the “other” option. Most participants reported white ethnicity (N = 94), with seven self-reporting as Asian, nine as Black, six as Hispanic, and four declining to answer. Participants spanned a wide range of age groups (6 = 21–25, 19 = 26–30, 33 = 31–35, 17 = 36–40, 11 = 41–45, 10 = 46–50, 12 = 51–55, 8 = 56–60, 4 = 61+). Participants self-reported moderate experience with accented English on a seven-point scale [mean = 4.64, standard deviation (SD) = 1.71, range = 1–7].

The materials for the exposure task consisted of 100 English words and 100 non-words. Stimuli were based on the lists used in the lexical decision exposure task from Zheng and Samuel (2020) and were modified to be appropriate for testing the present question (e.g., words containing /f/ were removed from the critical /θ/-items, and additional stimuli were added to bring the total number of items to 200).

All stimuli were recorded in a quiet room at a sampling rate of 44 100 kHz using an Audiotechnica (Tokyo, Japan) AT2020USB+ condenser microphone. The speaker (27, M) was a native speaker of American English and was living in Berkeley, California at the time. Stimuli consisted of 20 critical words containing an ambiguous /θ/ (phonetically between [θ] and [s]), 20 unambiguous words containing /s/, 60 filler words, and 100 filler non-words (see  Appendix A for a full list of training materials). Tokens from the three word-lists (/θ/-words, /s/-words, and word fillers) were equated in mean lexical frequency, with an average word frequency in each list of 4.1 occurrences per million in the SUBTLEX-US corpus (Brysbaert et al., 2012). Words ranged in length from 2 to 4 syllables, and the lists were closely matched in mean syllable length (/θ/-words = 3.1, /s/-words = 3.1, fillers = 3.2). The majority of non-words (80 items) overlapped with those used in Zheng and Samuel (2020) and were supplemented with 20 additional items, created from filler words by replacing approximately 1 sound per syllable, following the method used in Kraljic et al. (2008). Non-word fillers were also between 2 and 4 syllables long and matched the real-word items in mean syllable length (3.1). None of the filler items contained the sound /θ/, /f/, or /s/.

Ambiguous realizations of /θ/ for the 20 critical exposure items were created using STRAIGHT (Kawahara et al., 2008), following recent phonetic recalibration work (Reinisch and Holt, 2014; Reinisch et al., 2013). Using STRAIGHT allows morphing of word stimuli in their entirety (rather than excising and mixing just the fricative portions). This means that the resulting morphed stimuli are ambiguous on all cues to consonant identity (spectral characteristics, duration, formant transitions, etc.).

For each critical item, two recordings were made: one naturally produced /θ/-word (e.g., empathy) and one non-word counterpart where /θ/ was replaced with /s/ (e.g., empassy). The two recordings were then mixed in their entirety, generating a seven-step phonetic continuum for each item (e.g., empathy–empassy). Prior to mixing, temporal anchors were placed at major acoustic landmarks (e.g., the onset of voicing for vowels, the onset/offset of silence for stops) to ensure that segments of the same type were mixed together and to obtain more natural-sounding morphed stimuli (Reinisch et al., 2013). A norming task (see Sec. II D 1) was then used to select the most phonetically ambiguous step along the resulting continuum.

The same morphing procedure was used to create the minimal pairs used in the test phase, following Reinisch and Holt (2014) and Reinisch et al. (2013), who report that using minimal pairs rather than dummy syllables during the test phase reduced between-subject variability and facilitated cross-talker generalization. We created four /θ/–/s/ minimal-pair continua and four /θ/–/f/ minimal-pair continua. In each case, two of the continua had the critical sound occur word-finally, and two had it word-initially (see  Appendix B for a full list of experiment 1 test materials). The same speaker used to record training materials also produced all test materials. Finally, both test and exposure items were normalized in amplitude, resampled to 16 kHz, and converted to mp3 format to ensure between-browser compatibility during audio presentation.

1. Pretest

To select the most ambiguous critical /θ/ items for the exposure task, the 20 continua were subjected to a pretest by a separate listener group. For this purpose, 22 participants were initially recruited via MTurk. Data from two participants were removed due to failure to complete the task, and one additional participant was excluded due to inability or unwillingness to perceive the difference between items (i.e., giving the same response for all tokens within a continuum for over 50% of continua).

For each continuum, the middle five steps were selected (the end points were not used). Stimulus presentation was blocked by continuum, and four lists were created, each with a different random order of continua and step numbers with a given continuum. Participants were randomly assigned to one of these lists. Each participant was thus asked to judge a total of 100 tokens (20 continua × 5 tokens each). Once the participant had read the task instructions, they clicked a button on their web browser to begin the task. Listeners were asked to complete the task in one sitting and to make their response as quickly and accurately as possible. For each trial, listeners heard a token and were asked to indicate whether they heard a real word (e.g., empathy) or a non-word where the /θ/ had been replaced with /s/ (e.g., empassy). Responses were made by pressing one of two keyboard keys (“z” = non-word, “m” = real word). The response labels were present on the screen for the duration of the whole trial. Once the listener submitted their response, playback of the next item began automatically [inter-stimulus interval (ISI) = 1000 ms]. The selected ambiguous step number for each continuum was based on the step closest to the category boundary between response options (i.e., the point at which 50% of participants indicated hearing the /θ/-word). In cases where the percentage of /θ/-word responses for this token fell below 50%, the previous step on the continuum was selected (e.g., if token 4 had an average of 45% /θ/-word responses and token 3 had an average of 60% /θ/-word responses, token 3 was selected). This was done to compensate for the real-word bias in speech perception (Ganong, 1980) and to ensure that the target phoneme was phonetically ambiguous, following Reinisch et al. (2013). Results for the selected step closely matched those obtained in their study: the average step number of the ambiguous token was near the middle of the seven-step continuum (3.90), and the mean percentage of /θ/-word responses at this step was 73%.

2. Screening

Prior to proceeding to the experimental task, all participants completed a screening task designed to ensure that they were wearing headphones. Previous web-based studies have found a significant difference in transcription accuracy based on the quality of audio hardware participants used (Cooke and García Lecumberri, 2021; Melguy and Johnson, 2021). While it is not possible to completely control for the quality of audio hardware with a crowd-sourced participant pool, a headphone check ought to significantly reduce the amount of between-listener variation in audio quality.

For the audio check, we used a custom JavaScript program created by Woods et al. (2017), which utilizes a three-alternative forced choice (3AFC) task with 200 Hz pure tones. In each trial, a random one of the three tones is in antiphase across the stereo channels, making it sound significantly quieter when played through headphones, but not when played through loudspeakers. Listeners must decide which of the three tones is quietest. This method has been previously demonstrated to be highly effective in both lab and web-based studies with a very small number of trials [see Woods et al. (2017) for more detailed information on stimuli, protocol, and validation results]. Questionnaire results suggest that the task was effective, with the vast majority of participants self-reporting use of high-quality headphones (N = 93) or earbuds (N = 24), although, as in Woods et al. (2017), a small number purportedly passed the task using loudspeakers (N = 3).

3. Training

Subjects were randomly assigned to one of two conditions (training vs no training; cf. Kraljic et al., 2008). All participants in the training condition completed a lexical decision task designed to expose them to the speaker's accent. Each heard the same 100 non-words, 60 filler words, 20 ambiguous /θ/-words, and 20 unambiguous /s/-words. Exposure items were presented in a separate random order for each participant.

Listeners in the training condition were instructed that they would hear either a word or non-word, and their task was to decide which they heard for each trial. Once the participant had read the task instructions, they clicked a button on their web browser to begin the task. Listeners were asked to complete the task in one sitting and to make their response as quickly and accurately as possible. For each trial, listeners heard a token and were asked to indicate whether they heard a real word or a non-word. Responses were made by pressing one of two keyboard keys (“z” = non-word, “m” = real word). Once the listener submitted their response, playback of the next item began automatically (ISI = 1000 ms). Upon task completion, participants were taken to a separate page with instructions for the test phase of the experiment.

4. Test

All participants (both control and training groups) were randomly assigned to one of two test conditions: one group heard tokens from seven-step continua involving /θ/-/s/ minimal pairs (e.g., think-sink), while the other group heard tokens from /θ/-/f/ minimal-pair continua (e.g., thought-fought).

All participants were given instructions explaining the task and informed on the number of trials. They were asked to complete the task in one sitting and to respond as quickly and accurately as possible. For each trial, participants heard a token and were asked to decide which of two words (presented textually on the screen) they had heard. Responses were again made via keyboard press (“z” = /θ/-word, “m” = /f/- or /s/-word, depending on group). Textual response options were presented 500 ms prior to auditory stimulus onset (ISI = 1000 ms), following previous work (Reinisch and Holt, 2014; Reinisch et al., 2013). To reduce response variability, six separate lists were created for each group, with each list blocked by continuum and order of continua and step number randomized. Participants were randomly assigned to one of these lists. Each step within a continuum was presented four times, making for a total of 112 tokens per list (4 continua × 7 steps × 4 repetitions).

5. Questionnaire

After completing the categorization task, participants were asked to complete a questionnaire with basic demographic information [age, race, gender, languages spoken, audio speaker type used to listen to stimuli, and experience with foreign-accented speech (from 1 to 7)].

Prior to statistical analysis, data were first processed by removing subjects who failed to meet experimental criteria. Of the 189 subjects initially recruited, 34 were removed based on performance in the training task: 33 subjects for failure to reach a 70% accuracy threshold on the exposure task, following Kraljic and Samuel (2006), and one subject for categorizing over 50% of ambiguous /θ/ items as non-words (Norris et al., 2003). An additional four subjects were removed for failing to complete the test (categorization) task. Several subjects (N = 5) who had only a small number of missing trials (<5) due to a technical error were retained. An additional ten subjects were removed for failure to complete the questionnaire, and two subjects were removed for indicating they were not native speakers of English. Subjects who had over 20% of trials with excessively quick responses [response time (RT) < 200 ms] were also removed (N = 2) [following Reinisch and Holt (2014), but using a less stringent exclusion criterion to minimize data loss]. Finally, a further 17 subjects were discarded due to unwillingness or inability to reliably perceive the difference between continuum end points in the test task. Similar to Zheng and Samuel (2020), we calculated a difference score by taking the per-subject proportion of [θ] responses for step 1 of each continuum (unambiguous [θ]) and then subtracting the proportion of [θ] responses at step 7 (unambiguous [s] or [f], depending on test condition). Based on visual inspection of the data, we set the difference score exclusion threshold at 20%. This is less stringent than the threshold used by Zheng and Samuel (2020) but reflects the greater perceptual similarity of the [θ] and [f] end points used in the current study and was used primarily to exclude participants who were not performing the task in good faith. A total of 69 subjects were thus excluded, and data from the remaining 120 subjects were further processed to remove trials with very fast (<200 ms) or very slow RTs (>2500 ms), following Reinisch and Holt (2014). This resulted in exclusion of 6.28% of trials in the lexical decision task and 5.27% of trials in the categorization task.

Data were analyzed via mixed-effects logistic regression modeling using the lme4 package (Bates et al., 2014) in R (R Core Team, 2022). Maximal random effect structure was used for each model where this did not result in convergence issues, with random slopes fitted for all within-item predictors (Barr, 2013). The inclusion of fixed effects and interactions between them was decided via a stepwise selection process using a likelihood ratio test—terms that did not significantly improve model fit were removed.

Categorical variables were dummy-coded and included condition (training vs control) and target word position (word-initial or word-final). Continuum step (1–7) was also included as a centered numeric variable (such that the reference level corresponded to the middle of the continuum). Random effects included by-listener intercepts and slopes for condition and for continuum step. Issues with model convergence were addressed by first simplifying random effect structure and then by simplifying interaction terms.

1. Lexical decision

Participants showed a high accuracy rate for all stimulus types, including the ambiguous /θ/ item types (see Table I for a summary of RTs and proportion items correct by stimulus type). RTs were somewhat higher for the non-word fillers compared to the other stimulus types, but there appeared to be no difference between filler words and critical /θ/ items, suggesting that the ambiguous pronunciation sounded relatively natural.

TABLE I.

Experiment 1 lexical decision task results: Mean accuracy rates and response times (in milliseconds) for correct items, by word type.

Correct (%)RT (ms)
Filler non-words 94.2 1788 
Filler words 94.5 1646 
Critical /s/-words 97.4 1640 
Critical /θ/-words 93.2 1685 
Correct (%)RT (ms)
Filler non-words 94.2 1788 
Filler words 94.5 1646 
Critical /s/-words 97.4 1640 
Critical /θ/-words 93.2 1685 

2. Categorization

Results of the categorization task for the /θ/-/s/ test group showed a significant effect of condition [χ2(1) = 4.14, p < 0.05], with participants exposed to ambiguous /θ/ = [θ/s] categorizing an average of 6.11% more tokens as /θ/ compared to controls (see Fig. 3). There was also a significant effect of word position [χ2(1) = 21.59, p < 0.001], with word-initial [θ/s] targets showing a higher proportion of /θ/ responses overall (54.18%) compared to word-final tokens (48.34%). However, although the size of the training effect was larger with word-final continua (8.43% more /θ/ responses) than with word-initial ones (3.82%), there was no significant interaction of group and word position [χ2(1) = 1.35, p = 0.24], so it is unclear if word position affected the degree of perceptual learning (see  Appendix C for a visualization of the recalibration effect by word position). Continuum step was also significant, with likelihood of a /θ/ response decreasing with a step number [χ2(1) = 278.49, p < 0.001], and there was a significant interaction of step and word position [χ2(1) = 12.84, p < 0.001], indicating that the effect of step was weaker with word-initial /θ/ targets.

FIG. 3.

Proportion of /θ/ responses for groups tested on categorizing seven-step [θ]–[s] phonetic continua, by exposure condition.

FIG. 3.

Proportion of /θ/ responses for groups tested on categorizing seven-step [θ]–[s] phonetic continua, by exposure condition.

Close modal

Results for the [θ]–[f] group showed a small difference (1.33%) in the mean proportion of /θ/ responses between exposure and control groups (see Fig. 4), but this was not significant [χ2(1) = 0.01, p = 0.93], suggesting that phonetic recalibration did not occur in this test condition. As in the [θ]–[s] condition, model results showed a significant effect of step [χ2(1) = 430.11, p < 0.001]. There was also a significant effect of word position [χ2(1) = 6.07, p < 0.05], with a lower likelihood of /θ/ responses predicted for word-initial words—the reverse of the pattern shown for the [θ]–[s] test group. Results showed that listeners categorized 64.92% of tokens in the word-final continua as /θ/, but only 60.11% for word-initial continua.

FIG. 4.

Proportion of /θ/ responses for groups tested on categorizing seven-step [θ]–[f] phonetic continua, by exposure condition.

FIG. 4.

Proportion of /θ/ responses for groups tested on categorizing seven-step [θ]–[f] phonetic continua, by exposure condition.

Close modal

Again, there was a small difference in proportion of /θ/ responses by word position across conditions: trained listeners showed 2.34% more /θ/ responses for word-initial tokens vs controls and 0.39% more for word-final ones. There was no significant interaction of word position and exposure condition, although word-initial continua showed a larger difference between trained listeners and controls (see  Appendix C).

Results from experiment 1 are consistent with a non-uniform category expansion mechanism of phonetic recalibration. Participants trained on an ambiguous [θ/s] realization of the phoneme /θ/ showed a significant increase in the proportion of /θ/ responses compared to controls. However, participants trained on [θ/s] and tested on a distinct continuum [θ]–[f] did not show such a shift, which suggests that the learning strategy is contrast-specific and does not result in perceptual changes in other, neighboring parts of phonetic space—there was no evidence for a general expansion of phoneme categories.

Many previous tests of generalization in phonetic recalibration have focused on whether listeners generalize learning to a novel speaker with the same pronunciation. In this case, by contrast, we are essentially testing for generalization of a novel pronunciation within the same speaker. We do this by looking at the possibility of generalization of learning to a novel phonetic contrast involving the trained category. Results suggest that learning is contrast-specific, with no transfer of learning for the distinct fricative contrast [θ]-[f].

Given evidence from experiment 1 for a non-uniform category expansion and prior evidence (White and Aslin, 2011) that recalibration may generalize to phonetically similar segments, experiment 2 tests for an effect of exposure to ambiguous [θ/s] on listeners' categorization of a [θ]–[ʃ] continuum. The question addressed by experiment 2 is whether the effect seen in experiment 1 is contrast-specific or represents a category expansion that could have an impact on the perception of a sound [ʃ] that is acoustically similar to [s].

Previous research has shown high acoustic similarity of [θ] and [f]: both are low-amplitude and characterized by relatively flat acoustic power spectra (Jongman et al., 2000; Stevens, 1998). They have also been shown to be perceptually confusable (Cutler et al., 2004; Miller and Nicely, 1955) as illustrated in Fig. 2 (left panel). Categorization results for [θ]–[f] in the present study support this, with the flat, linear categorization functions (Fig. 4) indicating that listeners found these stimuli highly confusable—even at the [f] end point of the continuum, listener responses are near-chance. This contrasts sharply with results for the [θ]–[s] test groups, which display an unambiguous s-curve typical of categorical perception (Fig. 3). It is thus plausible that generalization failed due to low acoustic similarity between [s] and [f] but that there could still be a change in categorization of other fricatives that were more acoustically similar to [s]. For instance, the sibilant [ʃ] shares with [s] a relatively high amplitude and a prominent spectral peak (Jongman et al., 2000; Stevens, 1998), and it is perceptually similar (see Fig. 2, left panel). While the results from experiment 1 are consistent with the non-uniform boundary expansion hypothesis proposed in the current study, they do not conclusively answer the question of whether learning is contrast-specific or not. Testing another portion of phonetic space proximal to [θ] but more acoustically similar to [s] could thus shed additional light on the nature of the recalibration mechanism.

An additional 112 participants were initially recruited via MTurk following the same inclusion criteria as in experiment 1. None had participated in experiment 1. Exclusion of participants who failed to meet experimental criteria left a total of 68 participants whose data were retained for data analysis (see Sec. III D for an explication of exclusion criteria). Participants had a similar demographic profile to those in experiment 1. They were primarily male (M = 42, F = 25), with one subject declining to report and no subjects selecting the “other” option. Most participants self-reported as White (N = 51), with three self-reporting as Asian, six as Black, three as Hispanic, and five declining to answer. Participants spanned a wide range of age groups (1 = 18–20, 6 = 21–25, 11 = 26–30, 21 = 31–35, 13 = 36–40, 6 = 41–45, 4 = 46–50, 4 = 51–55, 2 = 61+). Participants self-reported moderate experience with accented English on a seven-point scale (mean = 4.97, SD = 1.73, range = 1–7).

The same training materials used in the lexical decision task in experiment 1 were also used here. Each subject group was then tested on four separate minimal-pair continua involving the /θ/-/ʃ/ contrast, e.g., math-mash (see  Appendix B for a full list of test materials). The same procedure was used to morph naturally produced recordings in seven-step intervals. All audio stimuli used in both experiments are included as supplementary materials.4

The same training/test procedure used in experiment 1 was used, with the presence of a perceptual learning effect in the categorization phase tested by comparing an exposure group to a corresponding control with no training. Training items were identical to those used in experiment 1, and the number of tokens in the test phase also matched that of experiment 1.

Data from the 112 participants initially recruited were screened following the same criteria used in experiment 1. First, 33 subjects were removed for failure to meet the 70% accuracy criterion or to classify over 50% of the /θ/ target words as real words in the training task. An additional two participants were removed because they had already participated in the previous experiment. One subject was excluded for failure to complete the test task, three were excluded for failure to complete the questionnaire, and one was excluded for indicating that they were not a native speaker of English. An additional two subjects were excluded because over 20% of their trials had excessively fast reaction times (<200 ms). Finally, two subjects were excluded for failure to meet the difference score threshold, indicating that they could not reliably perceive the continuum end points. A total of 44 subjects were thus excluded from analysis. As in experiment 1, data from the remaining 68 subjects were filtered to exclude trials with excessively fast (<200 ms) or slow (>2500 ms) reaction times. This resulted in the removal of 10.9% of trials for the categorization task.

Model selection procedures, coding scheme for predictor variables, and random effect structure were all identical to experiment 1.

1. Lexical decision

As in experiment 1, participants showed a high accuracy rate for all stimulus types, including the ambiguous /θ/ item types (see Table II). Unlike in experiment 1, however, critical /θ/ items appeared to pattern with filler non-words, showing higher RTs than for filler /s/-words. Accuracy rates for critical /θ/-words were nonetheless comparable to other items, indicating that participants accepted these items as sufficiently natural.

TABLE II.

Experiment 2 lexical decision task results: mean accuracy rates and response times (in milliseconds) for correct items, by item type.

Correct (%)RT (ms)
Filler non-words 90.0 1780 
Filler words 91.6 1640 
Critical /s/-words 92.4 1658 
Critical /θ/-words 90.7 1798 
Correct (%)RT (ms)
Filler non-words 90.0 1780 
Filler words 91.6 1640 
Critical /s/-words 92.4 1658 
Critical /θ/-words 90.7 1798 

Analysis of the categorization data showed a significant effect of condition [χ2(1) = 5.33, p < 0.05], with participants in the exposure group classifying an average of 7.08% more tokens as /θ/ compared to controls (see Fig. 5). Participants showed almost the same size training effect for word-initial continua (7.37% more /θ/ responses vs controls) as for word-final ones (6.79% more /θ/ responses vs controls), with no significant interaction between word position and exposure condition [χ2(1) = 0.31, p = 0.58]. Again, a visualization of the training effect by word position can be found in  Appendix D. Results also showed a significant effect of continuum step [χ2(1) = 231.68, p < 0.001], with a significant interaction of step and target word position [χ2(1) = 23.51, p < 0.001] as well as step and condition [χ2(1) = 7.38, p < 0.01].

FIG. 5.

Proportion of /θ/ responses for groups tested on categorizing a seven-step [θ]–[ʃ] [phonetic continuum, by exposure condition.

FIG. 5.

Proportion of /θ/ responses for groups tested on categorizing a seven-step [θ]–[ʃ] [phonetic continuum, by exposure condition.

Close modal

Results of experiment 2 showed that listeners exposed to an accent involving /θ/ = [θ/s] generalized learning to neighboring phonetic sounds, as shown by the clear categorization shift observed with [θ]-[ʃ] minimal pairs. These results build upon what was found in experiment 1, where learning appeared to be specific to the trained accent, with a categorization shift observed with [θ]–[s] minimal-pair continua, but not with [θ]–[f]. They illustrate that generalization of learning beyond that specific accent presented in exposure is possible, provided that the test sounds are phonetically similar enough. Thus, recalibration can result in changes to perception of neighboring phonetic categories when those are perceived to be sufficiently similar to sounds used in the exposure accent. These results, thus, add to existing recalibration literature that has found generalization of learning to novel phonetic contrasts (Kraljic and Samuel, 2006; Schuhmann, 2014).

The goal of the current study was to investigate the perceptual mechanism underlying phonetic recalibration. In particular, we wanted to test whether the change in categorization behavior following exposure to an atypical pronunciation of a particular phoneme was due to a targeted shift in the phonetic category or whether it was due to a more general mechanism of relaxing phonetic categorization criteria or expanding the category into neighboring perceptual space. Results of the current study provide support for a category expansion strategy, albeit one that is sensitive to the phonetic similarity between the exposure accent and neighboring sounds, consistent with the non-uniform category expansion proposed here [schematized in Fig. 1(C)].

The fact that perceptual learning in the present study generalized to a neighboring phonetic contrast involving the same target category /θ/ used in the exposure phase suggests that a simple boundary shift is not the underlying mechanism at work in recalibration—there must be some degree of category expansion, affecting subsequent perception of neighboring sounds. However, listeners cannot be doing this willy-nilly. That is, the strategy cannot simply be “this speaker has an unusual /θ/ pronunciation.” If this were the case, we would expect any atypical realization of /θ/ to be acceptable as an instance of this phoneme, but instead we see that the changes in perception appear to be constrained by (perceived) phonetic distance from the exposure accent. Listeners seem to be relying on some dimension of phonetic similarity to limit the degree of category expansion. In this case, it is possible that the spectral characteristics of [ʃ] and [s] are close enough for listeners, with the result that both categories show more overlap with neighboring /θ/. Meanwhile, because [f] is acoustically (Jongman et al., 2000; Stevens, 1998) and perceptually (Cutler et al., 2004; Miller and Nicely, 1955) distinct from [s], it appears to be unaffected by recalibration.

These results align with a large body of work on phonetic recalibration and perceptual learning more generally, which suggests that phonetic similarity between exposure and test sounds plays a critical role in generalization of learning. For instance, studies by Kraljic and Samuel (2005, 2007) and Eisner and McQueen (2005) have shown that generalization of ambiguous fricative pronunciations [s/f] or [s/ʃ] to new speakers is highly constrained. A possible explanation for this (Kraljic and Samuel, 2005, 2006, 2007) is that fricatives encode speaker-specific spectral information, and listeners are sensitive to acoustic differences in the way that these sounds are realized. More recent work bolsters this hypothesis, showing that when such differences between exposure and test speakers are minimized, generalization is possible. Reinisch and Holt (2014), for instance, found that listeners exposed to an ambiguous [f/s] accent in a Dutch-accented English speaker generalized to a novel Dutch-accented speaker when exposure and test speakers' fricatives were sampled from a similar perceptual space. Finally, an accent accommodation task by Xie and Myers (2017) found that successful transfer of learning for accented speech depended on the acoustic similarity between exposure and generalization speakers. Listeners did not appear to be learning an accent-wide schema that could be applied to new speakers, but rather more specific characteristics of the target sounds they were tested on.

The finding of generalization in the current study from a /θ/ = [θ/s] pronunciation to the /θ/-/ʃ/ contrast is particularly interesting as most previous work suggested that generalization of perceptual learning for fricatives is limited (Eisner and McQueen, 2005; Kraljic and Samuel, 2005, 2007) and has been found only when perceptual similarity between exposure and test speakers is controlled for (Reinisch and Holt, 2014). If, as earlier work has suggested, the specificity of learning in these cases has to do with listener sensitivity to the different acoustic realization of fricatives by male and female speakers, it is surprising to see generalization in the current study, where the differences between training and test contexts may be of similar magnitude. In the current study, the transfer of learning from an /s/-like realization of /θ/ to an /ʃ/-like one suggests that perceptual learning, at least for some fricatives, may be coarser-grained. Importantly, however, within-speaker transfer of learning across categories could be a process fundamentally different from the cross-speaker generalization tested in earlier work, so comparisons between the two should be made cautiously.

As we have already noted, the finding of generalization of perceptual learning found in the present is not novel on its own. The key finding in the current study is that learning can generalize to a novel phonetic contrast involving the same underlying category manipulated in training, a scenario that has received little attention in prior work. Generalization observed in previous recalibration studies typically involves sound contrasts where both categories are distinct from the trained phoneme, e.g., where listeners are trained on /d/ = [d/t] and generalize to the [b]-[p] contrast (Kraljic and Samuel, 2006, 2007; Mitterer et al., 2016; Mitterer and Reinisch, 2017; Schuhmann, 2014). Closer to the current study, generalization has also been found across languages for bilingual speakers, where learning-induced categorization shifts transfer from L1 to L2 or vice versa (Reinisch and Holt, 2014; Schuhmann, 2016). However, as the authors note, differences between each of the trained sounds /s/ and /f/ across these languages (Dutch, German, and English) are minimal (especially if speakers produce the L2 versions with an L1 accent). More relevant would be an example of transfer across languages with more distinct realizations of a given category (e.g., Spanish dental /d/ vs English alveolar /d/), but we are not aware of work that has tested this.

As we have seen, existing evidence for a category expansion account of perceptual learning across different tasks is mixed. In some cases, what we see looks more like an indiscriminate relaxation of categorization criteria that may extend beyond the trained category or phonetic pattern. In the face of uncertainty about the input or extensive variation, the perceptual system might adopt an “anything goes” adaptive strategy. However, the current study does not support such a view. While results clearly show that changes in categorization criteria can extend beyond the exact phonetic pattern in exposure stimuli, they also show that these changes are constrained by similarity to the trained accent. Results of the current study thus provide support for (a version of) the category expansion hypothesis discussed in earlier work (Kleinschmidt and Jaeger, 2015; Schmale et al., 2012).

However, additional research is necessary to show that this strategy is not specific to the particular sound categories and contrasts used in these experiments. Earlier work suggests that recalibration strategies may differ depending on the sounds involved (Kraljic and Samuel, 2006, 2007). Crucially, we have assumed in the current study that learning occurs at the level of the trained category (in this case, /θ/). However, other studies indicate that learning may occur at a more abstract (e.g., feature) level. For instance, Kraljic and Samuel (2006, 2007) have found that listeners can generalize a stop voicing contrast across place of articulation, while Schuhmann (2014) found generalization of a fricative pronunciation involving [f/s] to their voiced counterparts [v]-[z]. Therefore, it is plausible that listeners in the current study are abstracting over acoustic differences between [s] and [ʃ] and instead focusing on their shared properties as sibilants—relatively loud fricatives with spectral similarities.

Such results have implications for the learning of natural accents. Recent work on phonetic recalibration has investigated whether this mechanism is responsible for perceptual adaptation to non-native accents (Reinisch and Holt, 2014; Zheng and Samuel, 2020), and this remains an open question in the literature. While the present study does not directly address this question, results suggest that the perceptual mechanism underlying the recalibration effect is flexible enough to accommodate differences between exposure and generalization contexts. This is important, because to adapt to a natural accent, listeners must be able to handle the substantial variation that exists within speakers of a given accent group, where the same target L2 category may be realized in multiple distinct ways (Seibert, 2011). Thus, it can behoove listeners to generalize to some extent or to maintain a scope of learning that is relatively coarse. In other words, it is plausible that recalibration is not based on the precise acoustic properties of heard exemplars, but rather on rougher measures of phonetic similarity. Taken together, these results support a view of the perceptual system as flexible and able to adjust to variation in the environment, but simultaneously constrained to not overgeneralize learning.

This work was supported by a National Science Foundation (NSF) Graduate Research Fellowship to Y.V.M. (Grant No. 1752814). A portion of this work was presented at the 181st Meeting of the Acoustical Society of America and to the Stanford Phonetics Research Group and has benefited from audience feedback in both cases. We also thank Arthur Samuel and an anonymous reviewer for their thoughtful and constructive feedback and Ronald Sprouse for technical assistance.

See Table III for a list of training materials.

TABLE III.

Training materials.

Words
/θ/-words/s/-wordsFillersNon-words
anthem reconcile multitude aluminum kimono adgendoy dioryle oudrenoa hartacko ayarbik 
apathetic eraser polymer undertaker defending akelen udanaco pelade pelnimated nererant 
apathy rigorous topical maternal outnumber altartalized dynrem pleope bulerame bonimaded 
beneath clandestine durable domination Carribean altercole elember potler toalbinade contaluow 
breakthrough admissible undertow mutilated parachute tulable tonker pocorome nomikord odanatar 
commonwealth disparaged untoward detachment commoner amahate etoced polacual caltacate tumbodel 
Dorothy hallucinate workable gunpowder broadway ampoter etugant premetor okenel cumpamer 
earthquake intelligence challenged abandoned engineer anapt guncore prodabanga pontradashing odecogo 
empathy episode pretended marketable optional ancarrunt haderate radorcattoon cayarac nobelake 
hypothetical elements unarmed compilation termination ancorrack pabutda rapombargad coerpage dadratar 
Jonathan democracy adorable degree moderate darkackood andegaul reatonape arunimung oggander 
marathon pregnancy determine opportune dominated anguilder horabtalane reibonairo cleniot dargora 
Neanderthal narcotics underwater independent laundry nantor kedoac relecker puroucly omblatal 
tablecloth Tennessee abortion argument ambush noda lampunger garempit umberpater rallad 
telepathy absence awarded unlucky coronation becoor leggarer klepentip collattar omblegon 
Timothy outsmart terminated uploading Kennedy bocowlable meloded ungarnet nannotad demurea 
twentieth Arkansas coordinate retrograde commendation conkartat mebable tepelnim nadelmar omparkandar 
unauthorized amnesty opener bureaucrat hibernate malatad molorat umbelyaper combeter dentakter 
undergrowth prosper monitored abnormal mutilation booktugner morachable abolaper negryhad otler 
unethical participate congregation Abraham recommended rorana motolad adorshem ponimashum connontnor 
Words
/θ/-words/s/-wordsFillersNon-words
anthem reconcile multitude aluminum kimono adgendoy dioryle oudrenoa hartacko ayarbik 
apathetic eraser polymer undertaker defending akelen udanaco pelade pelnimated nererant 
apathy rigorous topical maternal outnumber altartalized dynrem pleope bulerame bonimaded 
beneath clandestine durable domination Carribean altercole elember potler toalbinade contaluow 
breakthrough admissible undertow mutilated parachute tulable tonker pocorome nomikord odanatar 
commonwealth disparaged untoward detachment commoner amahate etoced polacual caltacate tumbodel 
Dorothy hallucinate workable gunpowder broadway ampoter etugant premetor okenel cumpamer 
earthquake intelligence challenged abandoned engineer anapt guncore prodabanga pontradashing odecogo 
empathy episode pretended marketable optional ancarrunt haderate radorcattoon cayarac nobelake 
hypothetical elements unarmed compilation termination ancorrack pabutda rapombargad coerpage dadratar 
Jonathan democracy adorable degree moderate darkackood andegaul reatonape arunimung oggander 
marathon pregnancy determine opportune dominated anguilder horabtalane reibonairo cleniot dargora 
Neanderthal narcotics underwater independent laundry nantor kedoac relecker puroucly omblatal 
tablecloth Tennessee abortion argument ambush noda lampunger garempit umberpater rallad 
telepathy absence awarded unlucky coronation becoor leggarer klepentip collattar omblegon 
Timothy outsmart terminated uploading Kennedy bocowlable meloded ungarnet nannotad demurea 
twentieth Arkansas coordinate retrograde commendation conkartat mebable tepelnim nadelmar omparkandar 
unauthorized amnesty opener bureaucrat hibernate malatad molorat umbelyaper combeter dentakter 
undergrowth prosper monitored abnormal mutilation booktugner morachable abolaper negryhad otler 
unethical participate congregation Abraham recommended rorana motolad adorshem ponimashum connontnor 
1. Experiment 1 test continua

See Table IV for a full list of experiment 1 test materials.

TABLE IV.

Experiment 1 test continua.

GroupMinimal pairTarget word position
oath–oaf Word-final 
 death–deaf Word-final 
 thin–fin Word-initial 
 thought–fought Word-initial 
mouth–mouse Word-final 
 math–mass Word-final 
 thigh–sigh Word-initial 
 think–sink Word-initial 
GroupMinimal pairTarget word position
oath–oaf Word-final 
 death–deaf Word-final 
 thin–fin Word-initial 
 thought–fought Word-initial 
mouth–mouse Word-final 
 math–mass Word-final 
 thigh–sigh Word-initial 
 think–sink Word-initial 
2. Experiment 2 test continua

See Table V for a full list of experiment 2 test materials.

TABLE V.

Experiment 2 test continua.

Minimal pairTarget word position
math–mash Word-final 
wrath–rash Word-final 
thought–shot Word-initial 
thin–shin Word-initial 
Minimal pairTarget word position
math–mash Word-final 
wrath–rash Word-final 
thought–shot Word-initial 
thin–shin Word-initial 

See Figs. 6 and 7 for visualizations of the recalibration effect by word position.

FIG. 6.

Proportion of /θ/ responses for groups tested on categorizing seven-step [θ]–[s] phonetic continua, by exposure condition and target word position (experiment 1).

FIG. 6.

Proportion of /θ/ responses for groups tested on categorizing seven-step [θ]–[s] phonetic continua, by exposure condition and target word position (experiment 1).

Close modal
FIG. 7.

Proportion of /θ/ responses for groups tested on categorizing seven-step [θ]–[f] phonetic continua, by exposure condition and target word position (experiment 1).

FIG. 7.

Proportion of /θ/ responses for groups tested on categorizing seven-step [θ]–[f] phonetic continua, by exposure condition and target word position (experiment 1).

Close modal

See Fig. 8 for a visualization of the training effect by word position.

FIG. 8.

Proportion of /θ/ responses for groups tested on categorizing seven-step [θ]–[ʃ] phonetic continua, by exposure condition and target word position (experiment 2).

FIG. 8.

Proportion of /θ/ responses for groups tested on categorizing seven-step [θ]–[ʃ] phonetic continua, by exposure condition and target word position (experiment 2).

Close modal
1

These different views on how recalibration might occur see a parallel in earlier research on selective adaptation (SA), a perceptual phenomenon where the repeated presentation of a given stimulus (e.g., /ga/) reduces subsequent perception of that stimulus (resulting in fewer /ga/ responses when listeners categorize a [ga]–[ka] phonetic continuum). The earliest account of SA (Eimas and Corbit, 1973) posited that that the basic mechanism was fatigue of phonetic feature detectors. So, for instance, a detector tuned to the feature [+voice] was predicted to show a reduction in output strength following presentation of a sound like [ga], across the entire range of VOT values to which that detector is sensitive. This view thus suggests a more general mechanism structurally parallel to uniform category expansion. An alternative to the feature detector explanation was the adaptation-level (AL) or contrast account of SA (Diehl, 1981; Diehl et al., 1978). Under this view, repeated presentation of a given stimulus (the anchor, or adaptation level) causes listeners to shift their categorization boundary toward it, a mechanism more closely resembling the category shift or the non-uniform category expansion hypotheses being considered in the present study.

2

A version of this program can be accessed at https://github.com/keithjohnson-berkeley/perception_on_the_web (Last viewed October 5, 2022).

3

As noted by a reviewer, the importance of audio quality may be especially important in the perception of certain sounds, such as voiceless fricatives, where differences between sounds are primarily in the higher frequencies of the spectrum. While we utilized a headphone check in the present study, it was not possible to precisely control for differences in audio hardware quality, although results suggest that the majority (over 75%) of participants utilized high-quality headphones, based on participant self-reports. Nonetheless, previous research (Charoy, 2021; Liu and Jaeger, 2019) has shown that it is possible to conduct successful online recalibration experiments with voiceless fricatives, even when a headphone check is not utilized.

4

See supplementary materials at https://www.scitation.org/doi/suppl/10.1121/10.0014602 for all training and test audio.

1.
Alexander
,
J. E. D.
, and
Nygaard
,
L. C.
(
2019
). “
Specificity and generalization in perceptual adaptation to accented speech
,”
J. Acoust. Soc. Am.
145
(
6
),
3382
3398
.
2.
Barr
,
D. J.
(
2013
). “
Random effects structure for testing interactions in linear mixed-effects models
,”
Front. Psychol.
4
,
328
.
3.
Bates
,
D.
,
Mächler
,
M.
,
Bolker
,
B.
, and
Walker
,
S.
(
2014
). “
Fitting linear mixed-effects models using lme4
,” arXiv:1406.5823.
4.
Bradlow
,
A. R.
, and
Bent
,
T.
(
2008
). “
Perceptual adaptation to non-native speech
,”
Cognition
106
(
2
),
707
729
.
5.
Brysbaert
,
M.
,
New
,
B.
, and
Keuleers
,
E.
(
2012
). “
Adding part-of-speech information to the SUBTLEX-US word frequencies
,”
Behav. Res.
44
(
4
),
991
997
.
6.
Buhrmester
,
M.
,
Kwang
,
T.
, and
Gosling
,
S. D.
(
2011
). “
Amazon's Mechanical Turk: A new source of inexpensive, yet high-quality, data?
,”
Perspect. Psychol. Sci.
6
(
1
),
3
5
.
7.
Charoy
,
J.
(
2021
). “
Accommodation to non-native accented speech: Is perceptual recalibration involved?
,” Ph.D. thesis,
State University of New York at Stony Brook
,
Stony Brook, NY
.
8.
Clarke
,
C. M.
, and
Garrett
,
M. F.
(
2004
). “
Rapid adaptation to foreign-accented English
,”
J. Acoust. Soc. Am.
116
(
6
),
3647
3658
.
9.
Cooke
,
M.
, and
García Lecumberri
,
M. L.
(
2021
). “
How reliable are online speech intelligibility studies with known listener cohorts?
,”
J. Acoust. Soc. Am.
150
(
2
),
1390
1401
.
10.
Cooper
,
A.
, and
Bradlow
,
A.
(
2018
). “
Training-induced pattern-specific phonetic adjustments by first and second language listeners
,”
J. Phon.
68
,
32
49
.
11.
Cutler
,
A.
,
Weber
,
A.
,
Smits
,
R.
, and
Cooper
,
N.
(
2004
). “
Patterns of English phoneme confusions by native and non-native listeners
,”
J. Acoust. Soc. Am.
116
(
6
),
3668
3678
.
12.
Diehl
,
R. L.
(
1981
). “
Feature detectors for speech: A critical reappraisal
,”
Psych. Bull.
89
(
1
),
1
.
13.
Diehl
,
R. L.
,
Elman
,
J. L.
, and
McCusker
,
S. B.
(
1978
). “
Contrast effects on stop consonant identification
,”
J. Exp. Psychol.: Human Percept. Perform.
4
(
4
),
599
.
14.
Eimas
,
P. D.
, and
Corbit
,
J. D.
(
1973
). “
Selective adaptation of linguistic feature detectors
,”
Cogn. Psych.
4
(
1
),
99
109
.
15.
Eisner
,
F.
, and
McQueen
,
J. M.
(
2005
). “
The specificity of perceptual learning in speech processing
,”
Percept. Psychophys.
67
(
2
),
224
238
.
16.
Eisner
,
F.
, and
McQueen
,
J. M.
(
2006
). “
Perceptual learning in speech: Stability over time
,”
J. Acoust. Soc. Am.
119
(
4
),
1950
1953
.
17.
Ganong
,
W. F.
(
1980
). “
Phonetic categorization in auditory word perception
,”
J. Exp. Psychol. Hum. Percept. Perform.
6
(
1
),
110
125
.
18.
Iverson
,
P.
, and
Kuhl
,
P.
(
2000
). “
Perceptual magnet and phoneme boundary effects in speech perception: Do they arise from a common mechanism?
,”
Percept. Psychophys.
62
(
4
),
874
886
.
19.
Jesse
,
A.
, and
McQueen
,
J. M.
(
2011
). “
Positional effects in the lexical retuning of speech perception
,”
Psychon. Bull. Rev.
18
(
5
),
943
950
.
20.
Johnson
,
K.
(
2003
).
Acoustic and Auditory Phonetics
, 2nd ed. (
Blackwell
,
Oxford, UK
).
21.
Jongman
,
A.
,
Wayland
,
R.
, and
Wong
,
S.
(
2000
). “
Acoustic characteristics of English fricatives
,”
J. Acoust. Soc. Am.
108
(
3
),
1252
1263
.
22.
Kawahara
,
H.
,
Morise
,
M.
,
Takahashi
,
T.
,
Nisimura
,
R.
,
Irino
,
T.
, and
Banno
,
H.
(
2008
). “
Tandem-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation
,” in
Proceedings of the 2008 IEEE International Conference on Acoustics, Speech and Signal Processing
, March 31–April 4, Las Vegas, NV, pp.
3933
3936
.
23.
Kleinschmidt
,
D. F.
, and
Jaeger
,
T. F.
(
2015
). “
Robust speech perception: Recognize the familiar, generalize to the similar, and adapt to the novel
,”
Psychol. Rev.
122
(
2
),
148
203
.
24.
Kraljic
,
T.
,
Brennan
,
S. E.
, and
Samuel
,
A. G.
(
2008
). “
Accommodating variation: Dialects, idiolects, and speech processing
,”
Cognition
107
(
1
),
54
81
.
25.
Kraljic
,
T.
, and
Samuel
,
A. G.
(
2005
). “
Perceptual learning for speech: Is there a return to normal?
,”
Cogn. Psychol.
51
(
2
),
141
178
.
26.
Kraljic
,
T.
, and
Samuel
,
A. G.
(
2006
). “
Generalization in perceptual learning for speech
,”
Psychon. Bull. Rev
13
(
2
),
262
268
.
27.
Kraljic
,
T.
, and
Samuel
,
A. G.
(
2007
). “
Perceptual adjustments to multiple speakers
,”
J. Mem. Lang.
56
(
1
),
1
15
.
28.
Lev-Ari
,
S.
(
2015
). “
Comprehending non-native speakers: Theory and evidence for adjustment in manner of processing
,”
Front. Psychol.
5
,
1546
.
29.
Liu
,
L.
, and
Jaeger
,
T. F.
(
2019
). “
Talker-specific pronunciation or speech error? discounting (or not) atypical pronunciations during speech perception
,”
J. Exp. Psychol. Hum. Percept. Perform.
45
(
12
),
1562
1588
.
30.
Maye
,
J.
,
Aslin
,
R. N.
, and
Tanenhaus
,
M. K.
(
2008
). “
The Weckud Wetch of the Wast: Lexical adaptation to a novel accent
,”
Cogn. Sci.
32
(
3
),
543
562
.
31.
Melguy
,
Y. V.
, and
Johnson
,
K.
(
2021
). “
General adaptation to accented English: Speech intelligibility unaffected by perceived source of non-native accent
,”
J. Acoust. Soc. Am.
149
(
4
),
2602
2614
.
32.
Miller
,
G. A.
, and
Nicely
,
P. A.
(
1955
). “
An analysis of perceptual confusions among some English consonants
,”
J. Acoust. Soc. Am.
27
,
338
352
.
33.
Miller
,
J. L.
, and
Volaitis
,
L. E.
(
1989
). “
Effect of speaking rate on the perceptual structure of a phonetic category
,”
Percept. Psychophys.
46
(
6
),
505
512
.
34.
Mitterer
,
H.
,
Cho
,
T.
, and
Kim
,
S.
(
2016
). “
What are the letters of speech? Testing the role of phonological specification and phonetic similarity in perceptual learning
,”
J. Phon.
56
,
110
123
.
35.
Mitterer
,
H.
, and
Reinisch
,
E.
(
2017
). “
Surface forms trump underlying representations in functional generalisations in speech perception: The case of German devoiced stops
,”
Lang. Cogn. Neurosci.
32
(
9
),
1133
1147
.
36.
Mitterer
,
H.
,
Scharenborg
,
O.
, and
McQueen
,
J. M.
(
2013
). “
Phonological abstraction without phonemes in speech perception
,”
Cognition
129
(
2
),
356
361
.
37.
Norris
,
D.
,
McQueen
,
J. M.
, and
Cutler
,
A.
(
2003
). “
Perceptual learning in speech
,”
Cogn. Psychol.
47
(
2
),
204
238
.
38.
R Core Team
(
2022
).
R: A Language and Environment for Statistical Computing
(
R Foundation for Statistical Computing
,
Vienna, Austria
).
39.
Reinisch
,
E.
, and
Holt
,
L. L.
(
2014
). “
Lexically guided phonetic retuning of foreign-accented speech and its generalization
,”
J. Exp. Psychol. Hum. Percept. Perform.
40
(
2
),
539
555
.
40.
Reinisch
,
E.
, and
Mitterer
,
H.
(
2016
). “
Exposure modality, input variability and the categories of perceptual recalibration
,”
J. Phon.
55
,
96
108
.
41.
Reinisch
,
E.
,
Weber
,
A.
, and
Mitterer
,
H.
(
2013
). “
Listeners retune phoneme categories across languages
,”
J. Exp. Psychol. Hum. Percept. Perform.
39
(
1
),
75
86
.
42.
Schmale
,
R.
,
Cristia
,
A.
, and
Seidl
,
A.
(
2012
). “
Toddlers recognize words in an unfamiliar accent after brief exposure
,”
Dev. Sci.
15
(
6
),
732
738
.
43.
Schuhmann
,
K. S.
(
2014
). “
Perceptual learning in second language learners
,” Ph.D. thesis,
State University of New York at Stony Brook
,
Stony Brook, NY
.
44.
Schuhmann
,
K. S.
(
2016
). “
Cross-linguistic perceptual learning in advanced second language listeners
,”
Proc. Linguist. Soc. Am.
1
,
31
.
45.
Seibert
,
A.
(
2011
). “
A sociophonetic analysis of l2 substitution sounds of American English interdental fricatives
,” master's thesis,
Southern Illinois University at Carbondale
,
Carbondale, IL
.
46.
Shepard
,
R. N.
(
1972
). “
Psychological representation of speech sounds
,” in
Human Communication: A Unified View
, edited by
E.
David
and
P.
Denes
(
McGraw-Hill
,
New York
), pp.
67
113
.
47.
Sidaras
,
S. K.
,
Alexander
,
J. E. D.
, and
Nygaard
,
L. C.
(
2009
). “
Perceptual learning of systematic variation in Spanish-accented speech
,”
J. Acoust. Soc. Am.
125
(
5
),
3306
3316
.
48.
Stevens
,
K. N.
(
1998
).
Acoustic Phonetics
(
MIT
,
Cambridge, MA
).
49.
Vaughn
,
C. R.
(
2019
). “
Expectations about the source of a speaker's accent affect accent adaptation
,”
J. Acoust. Soc. Am.
145
(
5
),
3218
3232
.
50.
Weatherholtz
,
K.
(
2015
). “
Perceptual learning of systemic cross-category vowel variation
,” Ph.D. thesis,
The Ohio State University
,
Columbus, OH
.
51.
White
,
K. S.
, and
Aslin
,
R. N.
(
2011
). “
Adaptation to novel accents by toddlers
,”
Dev. Sci.
14
(
2
),
372
384
.
52.
Woods
,
K. J. P.
,
Siegel
,
M. H.
,
Traer
,
J.
, and
McDermott
,
J. H.
(
2017
). “
Headphone screening to facilitate web-based auditory experiments
,”
Atten. Percept. Psychophys.
79
(
7
),
2064
2072
.
53.
Xie
,
X.
, and
Myers
,
E. B.
(
2017
). “
Learning a talker or learning an accent: Acoustic similarity constrains generalization of foreign accent adaptation to new talkers
,”
J. Mem. Lang.
97
,
30
46
.
54.
Xie
,
X.
,
Theodore
,
R. M.
, and
Myers
,
E. B.
(
2017
). “
More than a boundary shift: Perceptual adaptation to foreign-accented speech reshapes the internal structure of phonetic categories
,”
J. Exp. Psychol. Hum. Percept. Perform.
43
(
1
),
206
217
.
55.
Zheng
,
Y.
, and
Samuel
,
A. G.
(
2020
). “
The relationship between phonemic category boundary changes and perceptual adjustments to natural accents
,”
J. Exp. Psychol. Learn. Mem. Cogn.
46
(
7
),
1270
1292
.

Supplementary Material