Abstract
Listeners use lexical information to guide the mapping between acoustic signals and representations of speech sound. This process is known as perceptual learning and results in recalibration of phonetic categories. The current work examines the effect of lexical frequency of exposure words on the magnitude of recalibration. Results showed comparable levels of perceptual learning for listeners exposed to high-frequency vs low-frequency critical words, in line with empirical findings that suggest that if frequency affects recalibration, such effects may be difficult to detect. These findings warrant further empirical probing and theoretical characterization of the role of lexical frequency in perceptual learning.
1. Introduction
Speech perception involves mapping from input acoustics to phonetic categories. This mapping can be challenged if a listener finds the input from a talker ambiguous between categories or maps it to a category different from what the talker intended. The problem can often be resolved in online processing by resorting to contextual cues to identify the intended category. Repeated exposure to examples of variation can lead the listener to recalibrate their mapping. Lexically guided perceptual learning (LGPL) refers to a special case of recalibration where the listener's lexical knowledge serves as the contextual cue. To illustrate, following an experimental paradigm by Norris (2003) that has become a standard in the LGPL literature, suppose an English listener hears a talker say tenni- followed by some consonant [?] that sounds ambiguous between [f] and [s]. The listener identifies the intended category as /s/ because tennis is a word but tennif is not. With more exposure to such critical words (e.g., jealous ending in [?]) the listener adjusts their category boundary along the [f]-[s] continuum to classify future [?] from the talker as /s/. Various factors that could affect the recalibration process have been examined in the literature (see Samuel and Kraljic, 2009). Here, we examine the effect of lexical frequency of critical words on the extent of recalibration by testing whether more frequent words (e.g., tennis ending in [?]) lead listeners to adjust the boundary more than less frequent words (e.g., hubris ending in [?]). Lexical frequency is controlled for and assumed a priori to be a confound in many LGPL experiments (e.g., Kataoka and Koo, 2017; O'Mahony and Möbius, 2019; Norris , 2003). For example, Norris (2003) biased one group of participants to categorize [?] as /f/ and another group to categorize it as /s/ and compared how the two groups subsequently categorized tokens along an [ɛf]-[ɛs] continuum to look for evidence of recalibration. In doing so, they made efforts to match the mean frequency of critical words between the two groups. The motivation for doing so is not explicitly stated, but one can imagine potential issues both on quantity—less frequent words have a higher risk of being unrecognized by listeners and force them to recalibrate based on fewer examples—and quality—a less frequent word, even when recognized, is not as effective in inducing recalibration as more frequent words. Our study addresses the issue of quality explicitly.
To the authors' knowledge, there is no study that directly examines whether and how lexical frequency affects the extent of phonetic recalibration. In theory, existing models predict that more frequent words should lead to more recalibration. First, Mirman (2006) simulate LGPL by augmenting the TRACE model of speech perception (McClelland and Elman, 1986) with a Hebbian learning algorithm. The model consists of three layers of units representing features, phonemes, and words. Activation spreads along cross-layer connections between features and phonemes as well as phonemes and words. The learning algorithm adjusts feature-to-phoneme connection weights so that an activation pattern in the feature layer representing an ambiguous sound activates a phoneme unit more if the phoneme unit receives more activation from the word layer (i.e., more lexical support), and as the learning algorithm is Hebbian, the more active the phoneme unit, the greater the weight change. Lexical frequency was not implemented in Mirman (2006)'s simulation, but one could follow Dahan (2001) and set the baseline activation level of each word unit to be proportional to its lexical frequency. In that case, a phoneme unit embedded in a more frequent word would receive more activation from the word layer and subsequently lead the learning algorithm to change the weight more. Second, Kleinschmidt and Jaeger (2015) propose an ideal adapter framework in which phonetic recalibration occurs as listeners probabilistically update their beliefs about how acoustic cues are distributed for each phonetic category. Upon observing some acoustic cue x as evidence of a category, say, ci, the factor by which listeners update their belief in the distribution for category ci is proportional to how well the distribution can predict the observation x and how certain they are that x is from category ci. The latter, i.e., categorization certainty, will depend on various things, one of which may be lexical frequency. If listeners observed x in place of phoneme ci in a word that was more frequent and, therefore, more familiar, they would categorize x as ci with more certainty, and as a result, their belief in the distribution for ci would exhibit greater change.
Empirical findings in the LGPL literature seem to suggest that it may be difficult to observe such frequency effects on phonetic recalibration. First, disambiguation cues may come from sources other than the lexicon, such as phonotactic knowledge (Cutler , 2008) and lip reading (Van Linden and Vroomen, 2007). For example, listeners may recognize [?]-tick as stick rather than ftick not because the latter is not available in the lexicon but because it is phonotactically illegal in English. Lexical frequency would not factor into phonetic recalibration as the lexicon is not accessed to begin with in such cases. Second, even if the disambiguation cue came from the lexicon, the frequency information may not be accessed in the process. McQueen (2006) argue that recalibration is prelexical and automatic because it occurs even when listeners merely count the number of trials while hearing stimulus words and nonwords and never explicitly decide their lexical status. Similar findings are reported elsewhere: Recalibration also occurs through passive listening (Eisner and McQueen, 2005) and counting the number of syllables in each stimulus (Samuel, 2016). Lexical access in the process may be too superficial for listeners to retrieve frequency information. Third, even if frequency information were accessed, its effect on recalibration may not be substantial enough. More frequent words are more predictable a priori, and higher predictability, in turn, should facilitate word recognition and subsequent processes, such as learning. However, the effect of predictability on perceptual learning appears rather nuanced or hard to detect. For example, while sentential context can help listeners predict and resolve phonetic ambiguities, its facilitative effect on recalibration is found when it is required for word recognition but not when it merely helps. Jesse (2021) found that preceding context boosts recalibration when [?] (a sound ambiguous between [f] and [s]) occurs in sentences like “That she had met Lady Gaga at the airport would be her only claim to [?]ame.” Since both same and fame are real words, the preceding context was necessary for disambiguation. On the other hand, Luthra (2021) did not find such facilitative effects when [?] (here, a sound ambiguous between [s] and [ʃ]) is found in sentences like “I love The Walking Dead and eagerly await every new epi[?]ode.” Since episode is a word but epishode is not, the preceding context was not necessary for disambiguation, although it may have helped (e.g., resulting in shorter latency in subsequently deciding whether the word is a concrete noun). Put differently, facilitative effects of sentential context on recalibration are hard to observe when it merely provides additional predictive value. The same may be true for word frequency: If lexical frequency merely offers additional predictive value, its effects on recalibration, even if they exist, may be hard to observe.
Thus, the theory and data predict that effects of lexical frequency on phonetic recalibration should exist but may be hard to observe. However, the theoretical prediction has not been tested, and the empirical findings are suggestive but indirect. In response, we conducted an exploratory study endeavoring to observe facilitative effects of lexical frequency on phonetic recalibration. Lexical frequency may facilitate recalibration in various ways: e.g., faster learning, broader generalizability, and longer retention. Here, we test whether they shift category boundary more immediately after exposure. To that end, we created a novel English accent in which /s/ is pronounced [?], i.e., acoustically shifted so that it sounds ambiguous between [f] and [s], and tested the effect of lexical frequency on its learnability by conducting an LGPL experiment following Norris (2003). The experiment consisted of two blocks: (1) an exposure block in which participants listen to words and nonwords and perform a lexical decision task and (2) a test block in which they phonetically categorize stimuli on an 11-step [ɛf]–[ɛs] (henceforth F–S) continuum. Participants were divided into four groups, with each group hearing different critical words in the exposure block: high-frequency shifted (HS), high-frequency clear (HC), low-frequency shifted (LS), and low-frequency clear (LC). The critical words were higher in frequency for HS and HC than LS and LC. The critical words ended in the acoustically shifted [?] for HS and LS, while they ended in clear [s] for HC and LC. Crucially, the critical words would be nonwords if their final /s/ were replaced by /f/ as in the tennis example. Participants in HS and LS were, thus, biased to interpret [?] as /s/. Perceptual learning should lead them to categorize more stimuli as /s/ than those in HC and LC.1 Any effect of lexical frequency on perceptual learning should result in a difference in the extent to which they do so between HS and LS: e.g., HS would categorize more stimuli as /s/ relative to HC after exposure than LS relative to LC.
2. Experiment
2.1 Participants
A total of 198 listeners2 participated in the experiment online via Prolific (https://www.prolific.co) and were paid $5.00 as remuneration. They were all adult native speakers of American English with normal hearing between 18 and 30 years of age and gave informed consent to participate in the experiment. Each participant was randomly assigned to one of the four groups: 47 in HS, 50 in HC, 51 in LS, and 50 in LC.
2.2 Materials
Stimuli for the exposure block consisted of 96 recordings from a female native speaker of American English: 12 critical words ending in /s/; 36 filler words, 12 of which ended in /f/; and 48 nonwords. The stimuli were identical across the participant groups except for the critical words, which differed in their lexical frequency and the acoustic quality of the final consonant. The critical words for HS and HC were higher in frequency than those for LS and LC, although they were comparable in terms of part-of-speech, number of syllables, location of primary stress, and the vowel before the final /s/. According to the CELEX lexical database of English (Baayen , 1995), the mean frequency per million was 23.148 [standard deviation (SD) = 19.631] for the high-frequency words and 1.956 (SD = 1.513) for the low-frequency words: t = 3.729, p = 0.003. According to the Corpus of Contemporary American English (CoCA; Davies, 2010), the mean frequency per million was 16.038 (SD = 13.452) for the high-frequency words and 0.899 (SD = 0.554) for the low-frequency words: t = 3.900, p = 0.002. We also confirmed in a survey that speakers are more familiar with the high-frequency words than the low-frequency words. Prior to the creation of the experiment, 88 participants from the same population that did not later participate in the main experiment rated familiarity of each word on a five-point scale where higher means more familiar. Mean rating was 4.752 (SD = 0.542) for the high-frequency words and 3.547 (SD = 0.912) for the low-frequency words: t = 12.909, p < 0.001. See Fig. 1 for the corpus frequencies and familiarity ratings. The final /s/ in the critical words was realized as [s] for HC and LC but [?] for HS and LS. The aforementioned female speaker recorded two versions of each critical word: The final /s/ was pronounced [s] in one version and [f] in the other version. The [s]-version was used as is for HC and LC. The two versions were blended for HS and LS as follows. Each version was divided into a prefix and a suffix at the midpoint of the vowel preceding the final consonant. The two suffixes were blended by directly interpolating their waveforms in varying proportions and appended to the prefix from the [s]-version to generate 41 different tokens of the word along an [f]–[s] continuum. The most ambiguous token, to be used as the critical word ending in [?], was identified through a norming pretest. Forty native speakers of English at San José State University heard a subset of 11 tokens from the continuum and decided whether each token ended in [f] or [s]. We pooled their responses for each word, fit a logistic curve to model the distribution of [s]-responses across the continuum, and chose the token closest to the midpoint.
Stimuli for the test block consisted of 11 recordings forming the F–S continuum. They were generated similar to the critical words ending in [?]. The female speaker recorded F [ɛf] and S [ɛs]. The two recordings were blended to generate a 41-step continuum. Participants in the norming pretest mentioned above heard a subset of 21 tokens (steps = 1, 3,…, 39, 41) from the continuum and decided whether they heard F or S in a separate block from blocks containing the stimuli to decide the most ambiguous tokens for the critical words. We pooled their responses, averaged the percentage of S-responses for each token, and chose the ones closest to 0%, 5%, 10%, 25%, 33%, 50%, 67%, 75%, 90%, 95%, and 100%.
2.3 Procedures
The experiment was hosted and run online using the Gorilla platform (Anwyl-Irvine , 2020). Participants started off the experiment with a headphone test following Woods (2017) in which they judged the quietest of three pure tones per trial for a total of six trials. Those with accuracy lower than were asked to adjust their headphone volume and repeat the test once more. In the exposure block that followed, the stimuli were randomly ordered and presented to the participants one at a time. Participants were asked to decide whether each stimulus was an English word or not by pressing 0 (No) or 1 (Yes) on their keyboard within 1 s. Finally, in the test block, the F–S continuum was repeated ten times with its stimuli presented in a random order within each repetition. Participants were asked to decide whether each stimulus was F or S by pressing 0 (F) or 1 (S) on their keyboard within 1 s after hearing the stimulus in full. The whole experiment took around 20 min to complete.
2.4 Results
We defined separate exclusion criteria for the headphone test and exposure block. For the headphone test, participants were excluded if they failed to reach 67% accuracy in both attempts. This applied to 22 participants: 5 in HS, 5 in HC, 6 in LS, and 6 in LC. For the exposure block, following O'Mahony and Möbius (2019), participants were excluded if their overall accuracy in the lexical decision task was lower than 70% or if they endorsed (recognized as English words) less than 50% of the critical words. This applied to one participant in HS (low overall accuracy) as well as two in LS and three in LC (low endorsement rate). After applying both criteria, data from 170 participants in total are included in the subsequent analyses: 41 in HS, 45 in HC, 43 in LS, and 41 in LC.
In the exposure block, the participants endorsed the high-frequency critical words more often than the low-frequency critical words, but there was no difference between the acoustically shifted critical words ending in [?] and the clear critical words ending in [s]. To test for significance, we fit a logistic mixed-effects model on trial-level responses (0 = not a word, 1 = word) for the critical words in the exposure block pooled from all four groups. For fixed effects, the model included frequency (effect-coded as –0.5 for low frequency and 0.5 for high frequency), acoustics of how /s/ is rendered (effect-coded as –0.5 for clear [s] and 0.5 for acoustically shifted [?]), and their interactions. Following the maximal random effect structure justified by the design, the model included a random intercept for each participant and a random intercept for each critical word. The main effect of frequency was significant: , standard error (SE) = 0.531, z = 4.348, p < 0.001. However, the main effects of acoustics and the interaction between frequency and acoustics were not: , SE = 0.334, , p = 0.882 and , SE = 0.666, , p = 0.647. The results suggest that the extent to which listeners endorsed items as words varied only as a function of lexical frequency. Analyzing participants' reaction times (RTs) revealed the same trend. Analysis was performed after removing RTs that were three or more SDs away from the mean for each participant: 300 data points (1.838%) excluded in total. We fit a linear mixed-effects model on log of trial-level RTs for the critical words with the same model structure as before. The main effect of frequency was significant: , SE = 0.034, , p = 0.042. However, the main effects of acoustics and the interaction between frequency and acoustics were not significant: , SE = 0.025, t = 0.708, p = 0.480 and , SE = 0.050, , p = 0.695. See Table 1 for mean endorsement rates and RTs for the critical words.
. | HS . | HC . | LS . | LC . |
---|---|---|---|---|
Endorsement rate | 0.982 (0.134) | 0.983 (0.128) | 0.882 (0.323) | 0.868 (0.339) |
RT | 1380.198 (295.295) | 1391.904 (457.495) | 1505.684 (379.524) | 1475.635 (446.711) |
. | HS . | HC . | LS . | LC . |
---|---|---|---|---|
Endorsement rate | 0.982 (0.134) | 0.983 (0.128) | 0.882 (0.323) | 0.868 (0.339) |
RT | 1380.198 (295.295) | 1391.904 (457.495) | 1505.684 (379.524) | 1475.635 (446.711) |
In the test block, the participants exposed to the acoustically shifted critical words categorized more stimuli on the F–S continuum as S than those exposed to the clear critical words, but the magnitude of contrast was not different between the high-frequency groups and the low-frequency groups. Mean proportions of S-responses were 0.561 in HS, 0.529 in HC, 0.571 in LS, and 0.549 in LC. See Figs. 2(a) and 2(b) for how they are distributed over the continuum and repetitions (recall that the continuum was repeated ten times). To test for significance, we fit a logistic mixed-effects model on trial-level responses (0 = F, 1 = S) in the test block pooled from all four groups. For fixed effects, the model included continuum step (continuously coded from −5 to 5), repetition (continuously coded from 0 to 9), frequency (effect-coded as –0.5 for low frequency and 0.5 for high frequency), acoustics (effect-coded as –0.5 for clear [s] and 0.5 for shifted [?]), and their interactions. Following the maximal random effect structure justified by the design, the model included a random intercept and random slopes over step and repetition for each participant. The model revealed evidence of recalibration but no evidence of frequency effects. The main effect of acoustics was significant: , SE = 0.253, z = 2.821, p = 0.005. However, the main effect of frequency and the interaction between acoustics and frequency were not: , SE = 0.253, , p = 0.109 and , SE = 0.505, , p = 0.873. Simple slopes analyses revealed similar trends when effects of acoustics and frequency were assessed at each level of frequency and acoustics, respectively. The effect of acoustics was marginally significant in the high-frequency groups and significant in the low-frequency groups: , SE = 0.353, z = 1.906, p = 0.057 and , SE = 0.362, z = 2.084, p = 0.037. The effect of frequency was significant in neither the shifted groups nor the clear groups: , SE = 0.362, , p = 0.219 and , SE = 0.352, , p = 0.300. Consistent with previous findings (e.g., Van Linden and Vroomen, 2007; Liu and Jaeger, 2018), the effect of recalibration declined over repetitions. The main effect of repetition as well as the interaction between repetition and acoustics were significant: , SE = 0.016, , p < 0.001 and , SE = 0.031, , p = 0.007. However, this did not appear to be affected by frequency, as the three-way interaction between repetition, acoustics, and frequency was not significant: , SE = 0.062, z = 0.587, p = 0.557.
2.5 Discussion
In sum, we found no evidence of frequency effects on the extent of phonetic recalibration. Participants who heard [?] in place of /s/ in critical words subsequently categorized more stimuli on the F–S continuum as S, but the trend was no different between those who heard [?] in high-frequency words and those who did so in low-frequency words. We present follow-up analyses below to better interpret the results.
As we clarified in Sec. 1, we are interested in the difference in quality between high-frequency and low-frequency words provided they are both recognized as words. To maintain rigor, we should compare between participants who endorsed the same number of critical words, ideally all the critical words. We revised our exclusion criteria to consider only data from those who endorsed 100% of the critical words, leaving data from 100 participants in total: 33 in HS, 39 in HC, 13 in LS, and 15 in LC. Mean proportions of S-responses were 0.570 in HS, 0.518 in HC, 0.538 in LS, and 0.475 in LC. See Figs. 2(c) and 2(d) for how they are distributed over the continuum and repetition blocks. Despite the changes in summary statistics, mixed-effects analyses (using weighted effect coding to address the large difference in sample size between high-frequency and low-frequency groups) revealed the same trend as before, i.e., evidence of recalibration but no frequency effect. The main effect of acoustics was significant: , SE = 0.158, z = 3.941, p < 0.001. However, neither the main effect of frequency nor the interaction between acoustics and frequency was significant: , SE = 0.092, z = 0.592, p = 0.554 and , SE = 0.099, , p = 0.300. In simple slopes analyses, the effect of acoustics was significant in both the high-frequency groups and the low-frequency groups: , SE = 0.187, z = 2.801, p = 0.005 and , SE = 0.301, z = 2.953, p = 0.003. The effect of frequency was significant in neither the shifted groups nor the clear groups: , SE = 0.136, , p = 0.721 and , SE = 0.123, z = 1.152, . It was less clear whether the effect of recalibration declined over repetitions for this subset of participants. The main effect of repetition was significant, but the interaction between repetition and acoustics was marginally significant at best: , SE = 0.033, , p < 0.001 and , SE = 0.044, z = 1.848, p = 0.065. At any rate, there was no evidence for frequency effects, as the three-way interaction between repetition, acoustics, and frequency was not significant: , SE = 0.085, z = 1.465, p = 0.143.
The main effect of frequency and its interaction with acoustics did not reach statistical significance. We are left asking whether the absence of significant results is evidence of no frequency effect. In response, we conducted a Bayes factor analysis to quantify the strength of evidence for null effects of frequency and interaction between acoustics and frequency. We first defined a Bayesian logistic regression model (henceforth BM) with the same model structure as the mixed-effects model above (henceforth MM). To set the model priors, we created a small artificial dataset consisting of mean proportions of S-responses from a hypothetical experiment with evidence of recalibration and frequency effects. Figures 2(e) and 2(f) illustrate what the response curves from the artificial data would look like. We translated the mean proportion of S-responses in the data to log-odds and fit a linear model (henceforth LM), which consisted of the fixed-effects parameters of BM and MM. We chose normal distributions as priors for all the fixed-effects parameters in BM and set their means to the regression coefficients in LM and fixed their SDs to 0.1. We fit BM on our actual experiment data—either the full data from all participants or the data from just the “perfect” participants—using R (version 4.2.2) and the brms package (version 2.19.0) to run four Markov chains for 2000 discarded warm-up iterations and 8000 posterior draws each. Posterior predictions from the model mimic the data well as can be seen in Figs. 2(a)–2(d): Transparent response curves for 50 random draws from the model fit snugly around those from the original data. Bayes factors from the model tell different stories depending on which data the model was fit on, i.e., the full data vs the data from only “perfect” participants. Evidence for recalibration was strong in both datasets: for acoustics. Evidence against frequency effects was “anecdotal” at best in the full data: for frequency and for interaction between acoustics and frequency. However, it was strong in the data from just the “perfect” participants: for frequency and for interaction between acoustics and frequency. For the “perfect” participants, at least, our study seems to suggest that lexical frequency does not affect the post-exposure magnitude of the recalibration effect.
The lack of evidence for or against frequency effects in the full data warrants further discussion. First, the evidence against frequency effects found among the “perfect” participants appears to have been diluted by the “imperfect” participants, 80% of whom are from the low-frequency groups. Perhaps this dilution is a quantitative effect of frequency on recalibration: The low-frequency “imperfect” participants recognized fewer examples and, therefore, did not learn as well as the high-frequency participants. However, mixed-effects model and Bayes factor analyses of data from everyone in the high-frequency groups and just the “imperfect” participants in the low-frequency groups revealed no evidence of such a quantitative effect. The interaction between acoustics and frequency was not significant: , SE = 0.130, z = 0.645, p = 0.519. There was anecdotal evidence against rather than for the quantitative effect: . This brings us to our second point: The lack of significant results for frequency effects in the full data and the quantitative effect just discussed using the mixed-effects models may have been type II errors due to insufficient sample size. In response, we conducted a power analysis following Kumle (2021) by building an artificial mixed-effects model with the same structure as our original mixed-effects model (i.e., MM) and running a power simulation using the R packages simr (version 1.0.7) and mixedpower (version 0.1.0). The idea was to estimate power assuming a hypothetical experiment in which evidence is found for both recalibration and frequency effects just like the Bayes factor analysis above. We set the coefficients for the fixed-effects parameters using the regression coefficients from LM. We were less informed a priori about the random effects covariance matrix, so we used approximate values based on those from MM. The results suggest that a substantially large sample size would be required to detect frequency effects using our current experiment design. Estimated power for a significant (at ) main effect of frequency with sample sizes of 200, 300, and 400 would be 0.132, 0.158, and 0.224, respectively. Estimated power for a significant interaction between acoustics and frequency with the corresponding sample sizes would be 0.169, 0.235, and 0.274.
3. Conclusion
We presented an exploratory study that set out to test the theoretical prediction that lexical frequency should facilitate phonetic recalibration in the LGPL paradigm. We were not able to find evidence of such frequency effects in our study. Our results are in line with empirical findings in the literature that suggest that it may be difficult to find frequency effects, if they do exist. Results from the follow-up analyses raise more questions to explore and inform directions for future confirmatory studies. There is evidence for absence of frequency effects among the “perfect” participants. This may be a form of ceiling effect. Perhaps the magnitude of recalibration is barely distinguishable as long as listeners recognize a word above some threshold familiarity, and our low-frequency words were familiar enough to the “perfect” participants. We could explore various ways to avoid such ceiling effects: Use words with still lower frequency, hinder word recognition by adding noise, etc. There is an absence of evidence for frequency effects in the full data. Aside from recruiting more participants, we could explore ways to increase potential effect size: Place participants in two opposite biasing conditions (e.g., /s/-biasing vs /f/-biasing), reduce the number of critical words since there may be a ceiling effect in amount of exposure (Liu and Jaeger, 2018), shorten the length of the test phase since recalibration effect decays over time, encourage participants to rely on frequency information for disambiguation by using critical words that form minimal pairs (e.g., lea-[?] for lease vs leaf), etc. Further findings from future studies thus informed will add to the body of knowledge on LGPL.
Acknowledgments
This research was supported by the Office of Research at San José State University through the University Faculty RSCA Assigned Time Program awarded to the first author. The research protocol was approved by San José State University Institutional Review Board (IRB Tracking No. 21061). The authors have no conflicts of interest to declare that are relevant to the content of this article. The data that support the findings of this study are openly available in Open Science Framework at https://doi.org/10.17605/OSF.IO/ES9YK.
In typical LGPL studies, comparisons are made between two (sets of) groups who each hear [?] in critical words that bias listeners in opposite directions: e.g., /s/-biasing vs /f/-biasing. The effect of recalibration would be easier to observe in such a design. But it turns out that words ending in /f/ are much more limited than those ending in /s/, and we were unable to identify well-balanced contrasting sets of /f/-biasing words that were comparable to our/s/-biasing words (see Sec. 2.2). The approach may have worked for a different pair of phonemes, and we will revisit the issue in future research.
The choice of number of participants was based on the effect size we measured from an F-statistic reported for experiment 1 in Norris (2003): for when comparing participants in /s/-biasing condition with those in /f/-biasing condition. We are comparing participants in /s/-biasing condition with those in non-biasing condition, so we aimed for roughly half the effect size for recalibration: . Loosely assuming that the recalibration effect will be tested with analysis of variance (ANOVA), total sample size required would be 199 for 80% power at . We primarily aimed to create an experimental setting comparable to Norris (2003), so we did not factor in frequency effects at this stage, but see Sec. 2.5 for a follow-up power analysis.