Limited exposure to ambiguous auditory stimuli results in perceptual recalibration. When unambiguous stimuli are used instead, selective adaptation (SA) effects have been reported, even after few adaptor presentations. Crucially, selective adaptation by an ambiguous sound in biasing lexical contexts had previously been found only after massive adaptor repetition [Samuel (2001). Psychol. Sci. 12(4), 348–351]. The present study shows that extensive exposure is not necessary for lexically driven selective adaptation to occur. Lexically driven selective adaptation can arise after as few as nine adaptor presentations. Additionally, build-up course inspection reveals several parallelisms with the time course observed for SA with unambiguous stimuli.

Selective adaptation in speech (henceforth SA), like other psychophysical adaptation phenomena, stems from the fact that the repeated presentation of stimuli with the same acoustic characteristics reduces the likelihood of subsequently perceiving similar sounds as members of the same category. SA has been found to occur after the repeated presentation of both unambiguous auditory stimuli (i.e., endpoints of a given category in an acoustic continuum) and congruent bimodal stimuli, that is, when unambiguous auditory stimuli are paired with videos of their articulation (Bertelson et al., 2003; Eimas and Corbit, 1973; Vroomen et al., 2007). A central tenet in studies of SA with auditory stimuli is that substantial exposure to adaptor stimuli is needed for SA to arise. Eimas and Corbit's (1973) seminal work presented 1350 adaptor trials distributed over 16 adaptation-test phases. For congruent bimodal stimuli, in contrast, studies by Vroomen and colleagues (Bertelson et al., 2003; Vroomen et al., 2007; Vroomen et al., 2004) showed that SA could be found reliably when exposure to the audio-visual adaptors was limited to 64 repetitions. In fact, there is evidence of SA arising as early as after only eight presentations of the critical adaptor stimulus (Vroomen et al., 2007).

While adaptation to unambiguous bimodal stimuli triggers SA, using ambiguous auditory stimuli together with disambiguating visual information as adaptors results in perceptual recalibration, which increases responses consistent with the biasing stimulus experienced in the adaptation phase (Bertelson et al., 2003). A case in point is Vroomen et al. (2007), where it was shown that, using the exact same experimental procedure, unambiguous auditory stimuli with congruent visual information (AbVb, AdVd) produced SA in a subsequent categorization task, while an ambiguous sound paired with the same visual information (A?Vb, A?Vd) triggered perceptual recalibration. This diverging pattern of results for ambiguous and unambiguous stimuli was also later replicated by Kleinschmidt and Jaeger (2015a) using a larger data-set. These authors argue that recalibration occurs rapidly after the presentation of ambiguous stimuli due to a shift of the phoneme boundary. By contrast, SA is characterized as a product of distributional learning (Kleinschmidt and Jaeger, 2015b) that occurs due to the sharpening of the phoneme boundary after presentation of unambiguous stimuli, which is caused by the shrinking of within-category variance.

Recalibration has also been found when the ambiguous sound appears in biasing lexical contexts (Kraljic and Samuel, 2007; Norris et al., 2003). Norris et al. (2003) replaced the final fricative (/f/ or /s/) from Dutch words with an ambiguous sound, half-way between /f/ and /s/. Listeners heard this ambiguous sound embedded either in /f/-final words (e.g., witlo?; witlof/*s) or in /s/-final words (e.g., naaldbo?; naaldbos/*f). Listeners who heard the ambiguous sound in /f/-final words were more likely to select /f/ in a following categorization task, whereas those who heard it in /s/-final words reported hearing more /s/ sounds. Critically, several parallels have been established between visually driven and lexically driven perceptual recalibration (van Linden and Vroomen, 2007) and it has been shown that both can occur after very limited exposure to the critical stimuli (i.e., after less than 10 repetitions in Vroomen et al., 2007; after 10 word adaptors in Kraljic and Samuel, 2007). There is, however, one study that reported SA by an ambiguous sound embedded in biasing lexical contexts after prolonged exposure to adaptors containing that sound (Samuel, 2001).

Samuel (2001) used the selective adaptation paradigm as a probe for lexical effects on phonemic processing. During the adaptation phases participants heard multiple presentations of an ambiguous /s/–/ʃ/ sound embedded in either /s/-final words (e.g., embarra?), or /ʃ/-final words (e.g., dimini?). Participants took part in two experimental sessions, one for each set of words, and were exposed to a total of 768 adaptor stimuli per session. In the post-adaptation test phases involving identification of sounds in an /ɪs/ - /ɪʃ/ continuum, participants were less likely to select the segment present in the words to which they had been exposed during adaptation. Samuel concluded that lexical activation filled in for the missing acoustic information in the signal, triggering SA in the same manner that an acoustically unambiguous sound does. Samuel and Frost (2015) have recently provided a replication of this finding for highly proficient non-native speakers of English. Nonetheless, alternative accounts suggest that the results of Samuel (2001), and by extension Samuel and Frost (2015), may stem from the very high number of adaptor repetitions in the experiment, which could be obscuring initial perceptual recalibration effects (Norris et al., 2003). This claim is further motivated by the fact that retuning in visually driven recalibration has been found to start dissipating quickly and be overtaken by SA after prolonged exposure (van Linden and Vroomen, 2007; Vroomen et al., 2007).

Building on Samuel (2001), the present study aims to test whether lexically driven selective adaptation occurs when the amount of adaptation is drastically reduced, closely matching the number of adaptors used in visual recalibration through lip-read information (Bertelson et al., 2003; Vroomen et al., 2007) and lexically driven perceptual recalibration (van Linden and Vroomen, 2007). By decreasing the number of adaptor presentations from 768 to 48, this work addresses three fundamental questions: First, it will be determined whether lexically driven SA can arise rapidly and with limited exposure. Second, and relatedly, our results will shed light on how limiting adaptor repetition may affect the magnitude of the SA effect. Finally, examining the changes in the participants' responses over time will allow us to assess whether SA by ambiguous auditory stimuli may be occurring as a by-product of recalibration, as suggested by Vroomen et al. (2007). If this were the case, one would expect effects in the opposite direction from that of SA effects that fade away over time.

To shed light on these questions we examined Southwestern American English (SWAE) speakers' categorization of a continuum of voice onset times (VOT), the endpoints of which were bilabial stops differing in voicing embedded in lexical contexts (i.e., body-potty; minimal pair in SWAE). Responses were analyzed after exposure to two sets of adaptors in two separate experimental sessions. One adaptor set contained three words whose first sound was unambiguously /b/ and the other set consisted of three words whose first sound was unambiguously /p/. Importantly, three conditions differing in the VOT of the initial stop consonant were created for each set of adaptors. In addition to the main experimental condition with acoustically ambiguous adaptors (VOT = 10 ms) there were two control conditions in which the initial stop had clear /b/ and /p/ VOT values (−40 and 60 ms, respectively). These control conditions served as a baseline for a response shift in the experimental condition. Participants were only exposed to one of the three conditions throughout the experiment.

Participants presented with the adaptors with unambiguous VOT values were not expected to show a shift in categorization as a function of the adaptor set (/b/-words, /p/-words) because (i) SA is not presumed to be affected by lexical influences itself (Samuel, 1997), and (ii) the unambiguous acoustic information should be enough to determine the identity of the first consonant in each case (Connine and Clifton, 1987), regardless of the lexicality of the resulting adaptors (e.g., posit vs *bosit). In the ambiguous VOT condition, however, lexical effects on SA were put to test. If lexical activation triggered SA in this condition, participants would be more likely to choose body after exposure to the set of /p/-adaptors and more potty responses would be elicited by the repetition of /b/-adaptors.

A total of 57 native speakers of SWAE participated in this experiment. None reported any hearing deficiencies. They were undergraduate students at the University of Arizona who received credit for their participation. Each listener participated in two experimental sessions separated by at least 48 h, one for each lexical set of adaptors (/b/-words, /p/-words).

Seven productions of eight target words (body, baller, bomber, bother, potty, pollen, posit, and possum) were elicited from a female speaker of SWAE. All tokens were recorded in a sound-attenuated booth at The University of Arizona by means of a Marantz PMD660 digital speech recorder, a Shure SM10A head-mounted microphone and a SD MM-1 preamplifier. The data were sampled at 44.1 kHz with 16-bit quantization. Two of the target words (body and potty) were used to create the categorization continuum, while the other six items served as the two sets of adaptor stimuli. The adaptors were three disyllabic words that start with /b/ followed by the vowel /ɑ/ (baller, bomber and bother) and three disyllabic words that start with /p/ followed by the same vowel (pollen, posit and possum). Three native speakers of SWAE verified the match in vowel following the critical phoneme. These words were chosen because they share syllabic structure and word length (mean /p/-adaptors: 432 ms, mean /b/-adaptors: 416 ms; initial stops excluded), do not contain any released stop consonant other than the critical phoneme, and, most importantly, they do not result in a minimal pair if their initial phoneme is swapped in voicing.

An 11-step body/potty continuum ranging from −40 to 60 ms of VOT was constructed from natural speech productions by progressive cross-splicing at zero crossings (Ganong, 1980; among others). The endpoints chosen were those that were best matched in pitch, sound quality and formant frequencies. The selected body token was prevoiced and had a VOT of close to −40 ms, whereas the endpoint for potty had a VOT of 60 ms. Portions of 10 ms were removed from the prevoicing in body for the first three steps (−40 to 0 ms). Then, for the following steps, portions of 10 ms of aspiration from the onset of the selected potty token were cross-spliced onto the selected body token. In other words, the [ɑɾi] portion of all stimuli in the continuum originated from the same, single production of body.

Adaptor stimuli were modified following the same procedure in order to establish three VOT conditions for all adaptor words. Continua from −40 to 60 ms were created for each target word but only the endpoints and the most ambiguous step were used in the experiment. The prevoiced condition consisted of adaptors with −40 ms of VOT, which was congruent for the /b/-adaptors, but incongruent for the /p/-adaptors. Conversely, the aspirated condition had VOTs of 60 ms, a congruent value for the /p/-adaptors, but incongruent for the /b/-adaptors. The experimental condition required adaptor stimuli in which the VOT of the initial consonant was ambiguous for /p/-words and /b/-words alike. To establish the most ambiguous step on the /b/-/p/ continua, a group of 10 monolingual speakers of SWAE categorized all target stimuli (from −40 to 60 ms) in a pilot experiment. The 50% cross over point between /b/ and /p/ occurred at approximately 8 ms, that is, closest to the continuum step of 10 ms of VOT. Consequently, tokens from this step were chosen as the ambiguous stimuli for the adaptors in the experimental condition.

Participants were randomly assigned to one of the three adaptor VOT conditions (prevoiced, aspirated or ambiguous) and were tested in two sessions, one for each set of adaptor stimuli (i.e., /b/-adaptors, /p/-adaptors), separated by at least 48 h. Hence, the adaptor stimuli belonged to the same VOT condition in the two sessions. Each session consisted of 16 adaptation-test blocks. In each adaptation phase listeners were exposed to one repetition of the three adaptor words for that session in randomized order. The inter-stimulus interval was 300 ms. Adaptation phases were immediately followed by a test phase. In each test phase participants were presented with all eleven steps of the body/potty continuum once, in randomized order. Their task was to categorize the stimuli as /b/ or /p/-initial by pressing two labeled buttons on a computer keypad. Each session lasted approximately 8 min and the order of the sessions was counterbalanced: Half of the listeners were exposed to the /p/-adaptors in the first experimental session and the other half heard the /b/-adaptors first. Nine participants' data were eliminated from further analyses due to missing files or because they selected only one response option throughout the experiment. Therefore, data from 48 participants were used for the final analyses: 15 in the prevoiced condition, 17 in the aspirated condition, and 18 in the ambiguous (i.e., experimental) condition.

Data from the categorization task were analyzed using a generalized linear mixed-effects model with a binomial linking function (as implemented in the lme4 package 1.1–10 in R 3.2.2). The model included response (body/potty) as the dependent variable, and adaptor condition (prevoiced, ambiguous, aspirated), lexical set (/b/-adaptors, /p-adaptors/), and continuum step (ranging from −40 to +60 VOT in 10 ms increments) as fixed factors. A “body” response was coded as “0” and a “potty” response was coded as “1.” Causal priority was given to lexical set and the model included random intercepts for each subject. Significance of main effects and all possible interactions were assessed using hierarchical partitioning of the variance via nested model comparisons. Orthogonal contrast coding directly compared the participants' responses in each adaptor condition as a function of lexical set. We report p-values with alpha set at 0.05 and include confidence intervals of parameter estimates in order to provide an assessment of effect sizes.

The three panels in Fig. 1 show the proportion of “potty” responses for each step of the VOT continuum (from −40 to +60 VOT) as a function of adaptor condition (prevoiced, ambiguous, aspirated) and lexical set (/b/-adaptors, /p/-adaptors).

Fig. 1.

Proportion of “potty” responses as a function of continuum step (−40 to +60 VOT), adaptor condition (prevoiced, ambiguous, aspirated—Panels A, B, and C) and lexical set (/b/-adaptors, /p/-adaptors). Error bars represent the 95% confidence intervals.

Fig. 1.

Proportion of “potty” responses as a function of continuum step (−40 to +60 VOT), adaptor condition (prevoiced, ambiguous, aspirated—Panels A, B, and C) and lexical set (/b/-adaptors, /p/-adaptors). Error bars represent the 95% confidence intervals.

Close modal

The model revealed a main effect of continuum step (χ(1) = 16 389.00; p < 0.001), but not of adaptor condition [χ(2) = 0.842; p = 0.656], or lexical set [χ(1) = 0.825; p = 0.363]. There was, however, an adaptor condition × lexical set interaction [χ(2) = 8.20; p = 0.017], an adaptor condition × step interaction [χ(2) = 9.50; p = 0.008], as well as an adaptor condition × lexical set × step interaction [χ(2) = 13.40; p = 0.001]. The omnibus model directly compared the two lexical sets for each adaptor condition. The planned orthogonal contrasts revealed that the participants in the ambiguous condition responded to the VOT continuum differently as a function of lexical set. Specifically, participants were less likely to select “potty” when exposed to /p/-adaptors than /b/ adaptors (β = −0.55; CI = −0.79, −0.27; SE = 0.14; p < 0.001). The parameter estimates of the model indicated that, for the ambiguous condition, a one-unit increase in the continuum yielded a change in the log odds of selecting “potty” by −1.4 in the /b/-adaptor context, and by −1.6 in the /p/-adaptor context. Neither the prevoiced (β = 0.00; CI = −0.26, 0.26; SE = 0.13; p = 0.999) nor the aspirated (β = 0.20; CI = −0.03, 0.43; SE = 0.12; p = 0.092) conditions varied as a function of lexical set. In sum, the model revealed that the participants of the prevoiced and aspirated conditions identified the eleven steps of the VOT continuum in a similar manner in the two lexical sets. This is illustrated in Fig. 1 (panels A and C), as the 95% confidence intervals overlap at every step of the VOT continuum. Regarding the ambiguous condition (panel B), the boundary between “body” and “potty” shifted slightly to the right in the /p/-adaptor context, indicating that the participants were less likely to select “potty” at the crucial 10 ms step of the continuum.

Additionally, the examination of the lexical shift's build-up course suggested that lexically driven SA in the ambiguous condition arose rapidly. A post hoc analysis examining the responses to the 10 ms stimuli in the ambiguous condition revealed a trial × lexical set interaction [χ(1) = 8.19; p < 0.05]. Visual inspection of Fig. 2 shows a categorization pattern consistent with the effects expected of SA (i.e., fewer “potty” responses after /p/-adaptors than after /b/-adaptors) as early as in block number 2, that is, after only nine adaptor presentations (blocks 0, 1 and 2 × 3 adaptors).

Fig. 2.

Proportion of “potty” responses as a function of lexical set (/b/-adaptors, /p/-adaptors) and block for most ambiguous stimulus (VOT = 10 ms). The grey band represents the 95% confidence interval.

Fig. 2.

Proportion of “potty” responses as a function of lexical set (/b/-adaptors, /p/-adaptors) and block for most ambiguous stimulus (VOT = 10 ms). The grey band represents the 95% confidence interval.

Close modal

In the current study we tested whether lexically driven SA would be found after exposure to a limited number of adaptor stimuli. We adapted the procedure in Samuel (2001) such that participants heard only three adaptors per block (vs 32 in the original study) for a total of 48 presentations. The main finding is that lexical context triggered a small shift in categorization responses in the direction of SA (i.e., fewer /p/ responses after /p/-adaptors than after /b/-adaptors) in the ambiguous VOT condition (10 ms) but not in the two conditions where the VOT of the adaptors' initial stops was unambiguous (−40 and 60 ms). Our results show that SA effects can be found even when adaptor presentation in each block is limited to 9% of the exposures in Samuel (2001). Moreover, the results support the notion that lexically driven SA by ambiguous auditory stimuli occurs rapidly, with observable effects on categorization as early as after exposure to nine adaptor stimuli. This finding is in line with results from SA with congruent bimodal stimuli (Vroomen et al., 2007).

Crucially, the results suggest that the lexical effect reported in the current study is not easily attributable either to massive repetition of distorted phonemes or to SA overcoming perceptual recalibration due to excessive exposure (Norris et al., 2003; Vroomen et al., 2007). The former hypothesis is not supported given the fact that SA effects could be found from early on in the present experiment. The latter proposal, on the other hand, seems unlikely due to critical differences between the present build-up course and that of perceptual recalibration through lip-read information in Vroomen et al. (2007). Perceptual recalibration effects (i.e., more /aba/ responses after A?Vb) were found to start decreasing after 32 exposures and were only cancelled out after the 256 exposures of which the experiment consisted. Since limited exposure was needed for the appearance of SA in our study, our results are not likely to stem from lexically driven recalibration having faded over time. Further evidence of this can be found by contrasting the present build-up course with that of SA due to congruent bimodal stimuli (i.e., AbVb) in Vroomen et al. (2007). In the two cases, an initial shift in the direction of recalibration is quickly done away with and SA already appears after nine and eight exposures, respectively (see Fig. 2, and Figs. 1 and 2 in Vroomen et al., 2007). Of note, a similar shift is also found after the first block, with 32 adaptor presentations, in Samuel (2001). As Vroomen et al. (2007) put it, a plausible explanation for the initial recalibration-like effect in all three studies is that the presentation of a stimulus perceived as unambiguous produces a priming or repetition effect in the opposite direction of SA after the first adaptor presentations, which is then rapidly overtaken by SA.

In sum, it has been shown that the results of Samuel (2001) can be replicated even when the number of adaptor stimuli is sharply reduced. The present study found that SA by adaptors with an ambiguous sound in unambiguous lexical contexts arose rapidly, and its build-up course was found to be similar to that of SA by unambiguous bimodal stimuli. The results are in line with previous research that finds that ambiguous sounds in unambiguous lexical contexts are able to render effects comparable to acoustically unambiguous stimuli due to the effect of lexical activation. That being said, given that Samuel (2001) and the present work both made use of an experimental paradigm in which repetition played a central role, it remains to be seen whether this lexical effect on phonemic processing is a result of the interaction between prelexical and lexical levels during online speech perception (McClelland et al., 2006), or if it stems from feedback for learning, which impacts the development of perceptual representations and processes over time (Norris et al., 2000, 2003).

We are thankful to Eva Reinisch and Miquel Simonet for their helpful comments on a previous version of this article. We would also like to thank Raquel de Horna for her help with participant recruiting.

1.
Bertelson
,
P.
,
Vroomen
,
J.
, and
De Gelder
,
B.
(
2003
). “
Visual recalibration of auditory speech identification: A McGurk aftereffect
,”
Psychol. Sci.
14
(
6
),
592
597
.
2.
Connine
,
C. M.
, and
Clifton
,
C., Jr.
(
1987
). “
Interactive use of lexical information in speech perception.
,”
J. Exp. Psychol.: Human Percept. Perform.
13
(
2
),
291
299
.
3.
Eimas
,
P.
, and
Corbit
,
J.
(
1973
). “
Selective adaptation of linguistic feature detectors
,”
Cognit. Psychol.
4
(
1
),
99
109
.
4.
Ganong
,
W. F.
(
1980
). “
Phonetic categorization in auditory word perception
,”
J. Exp. Psychol.: Human Percept. Perform.
6
(
1
),
110
125
.
5.
Kleinschmidt
,
D. F.
, and
Jaeger
,
T. F.
(
2015a
). “
Robust speech perception: Recognize the familiar, generalize to the similar, and adapt to the novel
,”
Psychol. Rev.
122
(
2
)
148
203
.
6.
Kleinschmidt
,
D.
, and
Jaeger
,
T. F.
(
2015b
). “
Re-examining selective adaptation: Fatiguing feature detectors, or distributional learning?
,”
Psychon. Bull. Rev.
1
14
.
7.
Kraljic
,
T.
, and
Samuel
,
A.
(
2007
). “
Perceptual adjustments to multiple speakers
,”
J. Mem. Lang.
56
,
1
15
.
8.
McClelland
,
J. L.
,
Mirman
,
D.
, and
Holt
,
L. L.
(
2006
). “
Are there interactive processes in speech perception?
,”
Trends Cognit. Sci.
10
(
8
),
363
369
.
9.
Norris
,
D.
,
McQueen
,
J.
, and
Cutler
,
A.
(
2000
). “
Merging information in speech recognition: Feedback is never necessary
,”
Behav. Brain Sci.
23
(
3
),
299
325
.
10.
Norris
,
D.
,
McQueen
,
J. M.
, and
Cutler
,
A.
(
2003
). “
Perceptual learning in speech
,”
Cognit. Psychol.
47
,
204
238
.
11.
Samuel
,
A.
(
1997
). “
Lexical activation produces potent phonemic percepts
,”
Cognit. Psychol.
32
(
2
),
97
127
.
12.
Samuel
,
A.
(
2001
). “
Knowing a word affects the fundamental perception of the sounds within it
,”
Psychol. Sci.
12
(
4
),
348
351
.
13.
Samuel
,
A.
, and
Frost
,
R.
(
2015
). “
Lexical support for phonetic perception during nonnative spoken word recognition
,”
Psychon. Bull. Rev.
22
,
1746
1752
.
14.
van Linden
,
S.
, and
Vroomen
,
J.
(
2007
). “
Recalibration of phonetic categories by lipread speech versus lexical information
,”
J. Exp. Psychol.: Human Percept. Perform.
33
(
6
),
1483
1494
.
15.
Vroomen
,
J.
,
van Linden
,
S.
,
de Gelder
,
B.
, and
Bertelson
,
P.
(
2007
). “
Visual recalibration and selective adaptation in auditory-visual speech perception: Contrasting build-up courses
,”
Neuropsychologia
45
(
3
),
572
577
.
16.
Vroomen
,
J.
,
van Linden
,
S.
,
Keetels
,
M.
,
De Gelder
,
B.
, and
Bertelson
,
P.
(
2004
). “
Selective adaptation and recalibration of auditory speech by lipread information: Dissipation
,”
Speech Commun.
44
,
55
61
.