Different speakers produce the same intended vowel with very different physical properties. Fundamental frequency (F0) and formant frequencies (FF), the two main parameters that discriminate between voices, also influence vowel perception. While it has been shown that listeners comprehend speech more accurately if they are familiar with a talker's voice, it is still unclear how such prior information is used when decoding the speech stream. In three online experiments, we examined the influence of speaker context via F0 and FF shifts on the perception of /o/-/u/ vowel contrasts. Participants perceived vowels from an /o/-/u/ continuum shifted toward /u/ when F0 was lowered or FF increased relative to the original speaker's voice and vice versa. This shift was reduced when the speakers were presented in a block-wise context compared to random order. Conversely, the original base voice was perceived to be shifted toward /u/ when presented in the context of a low F0 or high FF speaker, compared to a shift toward /o/ with high F0 or low FF speaker context. These findings demonstrate that that F0 and FF jointly influence vowel perception in speaker context.

When different speakers articulate identical words, the physical properties of the produced sounds can vary substantially. There is no static mapping from words to sound or from sound to words, which we can learn to effectively speak and understand—a fundamental property of speech that is called the “lack of invariance” problem (Liberman, 1957). Nevertheless, in everyday circumstances, we have no difficulties decoding a speaker's intended word from the audio stream. This is possible even if we operate in challenging conditions, for example, noisy environments such as the classic “cocktail party,” where multiple speakers talk simultaneously (Johnsrude et al., 2013; Kreitewolf et al., 2018).

Two dominant features account for most of the variability between voices (Kreitewolf et al., 2014). One is vocal tract length, which correlates strongly with body size. The vocal tract shapes the spectral envelope of speech, and its length influences the scaling of formant frequencies (FF), the peak resonances of speech sounds which are the most important cues for vowel perception. The other parameter is the glottal pulse rate or fundamental frequency (F0), which is the basis of speech prosody. Vocal tract length and glottal pulse rate are naturally correlated so that children usually have high and adult males have low F0 and FF values (Peterson and Barney, 1952). At the same time, FF and F0 account for a large proportion of the speech variance. It is still an open question as to how the brain maps acoustic inputs that vary on these two voice parameters to the same perceived vowel. The aim of this study was to examine how the influence of F0 and FF voice characteristics on vowel perception depends on context. To do this, we systematically manipulated speakers' voice combinations of FF and F0 and tested how these parameters influenced perception of an /o/-/u/ vowel continuum without and within consistent voice context.

Theories of “intrinsic normalization” posit that the speech signal itself carries enough information to define the underlying voice characteristics and decode the intended speech signals correctly—hence, without assuming any influence of voice context. There are several approaches of transforming formant information to separate vowels more distinctly than simply forming F1/F2 clusters, based on which different vowel categories strongly overlap and would be confusable (Peterson and Barney, 1952). For example, Johnson and Sjerps (2021) show a range of normalization methods that can, theoretically, transform vowel patterns into spaces that are easier to segment. However, these schemes often rely on clearly separable, higher formants to serve as normalization references and have difficulties in explaining results where formants are merged together and still produce the percept of a vowel (Chistovich and Lublinskaya, 1979).

According to F0-free formant-based normalization schemes, F0 should have no effect on vowel categorization. As an example for how the impact of vocal tract length could be removed from the speech signal, a Mellin transform was used to separate the resonance spectrum into a size and shape component using the estimated F0 track (Irino and Patterson, 2002). The size is an estimate of vocal tract length while the shape is a size-normalized representation of resonances which should identify the intended vowel. Following this account, F0 is simply needed to correctly execute the transformation that removes size-related confounds from the signal. Therefore, except for other physical phenomena at very high or low frequencies, it should not matter what combination of F0 and FF a speaker has as long as FF can be normalized adequately using the F0 estimate. According to indirect F0 normalization schemes, F0 influences vowel perception indirectly via perceived identity of the speaker (Johnson, 1990b). However, the underlying mechanism by which these factors operate remains to be determined (Barreda and Nearey, 2012).

By systematically shifting FF and F0, we tested whether these voice characteristics influence perception of otherwise identical vowels or whether these shifts have no effect on vowel categorization in line with successful intrinsic normalization.

There are several observations that contradict a pure intrinsic normalization strategy to understand vowels independent of FF and F0 properties. While proponents of F0-free approaches claimed that vowel perception works well even beyond natural ranges of FF and F0 (Smith and Patterson, 2004), others have demonstrated that recognition performance drops off the further away a voice is from the diagonal that connects men, women, and children in F0/FF space (Assmann and Nearey, 2008). In addition, speech perception is sensitive to a large array of contextual cues, which cannot be explained by intrinsic normalization. For example, vowel perception is influenced by the spectral properties of preceding acoustic contexts such that perception of a “bit-bet” vowel contrast was shifted by elevating or lowering the first and second formants in a synthesized context sentence (Ladefoged and Broadbent, 1957). Specifically, a test word was usually perceived as “bit” if context F1 was increased and “bet” if it was decreased. Similar results were demonstrated with a vowel contrast between /u/ and /o/, which could be manipulated by elevating or lowering the first formant (Sjerps et al., 2019; Sjerps and Smiljanić, 2013; Sjerps et al., 2018). In addition to context effects arising from acoustic properties of precursor sounds, other research has shown that listeners benefit from a general familiarity with target speakers when comprehending speech (Adank et al., 2009; Holmes et al., 2018; Kleinschmidt and Jaeger, 2015). These findings exemplify that speech comprehension is not a generic process which transforms input sounds into linguistic structures in a pure bottom-up fashion, but that speaker-related prior expectations are used to improve speech decoding.

By presenting vowels systematically shifted in FF or F0 either in a mixed, i.e., without consistent, or in blocked, i.e., within consistent voice context, we tested the context-dependent influence of FF or F0 on vowel perception (Fig. 1).

FIG. 1.

(Color online) The study design. We recorded speech samples from one center speaker and based on those, we created four new edge speakers by shifting the original voice in either F0 or FF. These four speaker positions in FF/F0 space were chosen so that they formed a –cross centered on the original speaker. Speaker context was provided by presenting these edge speakers in blocks (i.e., blocked condition in comparison to mixed condition). To examine how the influence of F0 and FF voice characteristics on vowel perception depends on context, we tested how these parameters influenced perception of an /o/-/u/ vowel continuum without and within speaker context.

FIG. 1.

(Color online) The study design. We recorded speech samples from one center speaker and based on those, we created four new edge speakers by shifting the original voice in either F0 or FF. These four speaker positions in FF/F0 space were chosen so that they formed a –cross centered on the original speaker. Speaker context was provided by presenting these edge speakers in blocks (i.e., blocked condition in comparison to mixed condition). To examine how the influence of F0 and FF voice characteristics on vowel perception depends on context, we tested how these parameters influenced perception of an /o/-/u/ vowel continuum without and within speaker context.

Close modal

We recorded speech samples from one speaker and based on these, we created four new speakers by shifting the original voice in either FF or F0 (Fig. 1). The speaker positions in FF/F0 space were chosen so that they formed a cross centered on the original speaker, which lay approximately between the main male and female distribution (Assmann and Nearey, 2008; Hillenbrand et al., 1995; Peterson and Barney, 1952). Each edge speaker shared either FF or F0 with the center speaker while the distances on the other parameter were chosen to be large enough to matter for vowel perception but close enough to be confusable with the center speaker (Gaudrain et al., 2009). In this way, each edge speaker extended into the more “unlikely” areas of the FF/F0 space, i.e., combinations of FF and F0 that are less likely to be found in real speakers.

To investigate the effects of speaker expectations on vowel perception, we conducted three versions of a vowel-perception task. In the first experiment, the mixed condition, the center speaker, and all four edge speakers were played in random order, thereby prohibiting any precise expectation about the upcoming speaker. In the second experiment, the blocked condition, all edge speakers were grouped into blocks so that within each block, the current speaker was predictable. This way we could compare how having the correct speaker expectation influences or improves vowel discrimination on a /o/-/u/ continuum. In addition, we added occasional appearances of the original center speaker in each edge block to measure how speech from the same speaker is perceived given a different FF or F0 context. In the third experiment, the precursor condition, we established context not by repeatedly playing same-speaker stimuli but by prefacing each test stimulus with a full sentence spoken by the same speaker. In this condition, we also added occasional center speaker test words following after an edge speaker precursor so that we could compare context influence on experiment 2.

First, we investigated how perception of vowels shifted in FF or F0 depends on speaker context. For the mixed condition, we expected the following main effects, i.e., that high F0 shifts induce more /u/ judgments and low F0 shifts induce more /o/ judgments (Johnson, 1990a), whereas high FF shifts lead to more /o/ judgments and low FF shifts lead to more /u/ judgments (Sjerps et al., 2019). These hypotheses are based on the assumption that speakers with high FF relative to their F0 (high FF or low F0 speakers) should appear to speak the vowel with higher FF, /o/, more often because listeners expect a certain FF range for /o/ and /u/ given a speaker's F0. Conversely, speakers with low FF (low FF or high F0 speakers) should appear to speak the vowel with lower FF, /u/, more often. The central speaker should not cause biased /o/ or /u/ perception as the general expectation of FF given the speaker's F0 should match their real FF. We hypothesized that the shift of the edge speakers toward /o/ or /u/ in context (i.e., the blocked and precursor conditions) should be less strong than without context (i.e., in the mixed condition) since expectations based on speaker context should override the general expectation of FF given F0. This effect might be stronger for the precursor condition than for the blocked condition as it should be easier to correctly assess a reference FF based on the semantic constraints provided in real sentences compared to ambiguous single words without feedback in the blocked condition.

Second, we investigated how perception of identical vowel stimuli depends on different voice context. To do this, we compared perception of interleaved center speakers in edge speaker contexts in blocked and mixed conditions. We hypothesized that a high FF context block should shift perception of the center speaker toward more /u/ and a low FF context block should shift perception of the center speaker toward more /o/ responses (Ladefoged and Broadbent, 1957; Sjerps et al., 2019). With the F0 context effect, we did not have a clear, single hypothesis as previous work provided support for three possible alternatives. First, there could be no context effect at all because the context establishes the correct FF for the center speaker, which is not further influenced by F0 (Irino and Patterson, 2002). Second, the context shift could go in the same direction as the pure F0 effect, toward /u/ for the high F0 context and toward /o/ for the low F0 context, as Johnson (1990b) showed that an F0 precursor can have a reduced effect on target stimuli if it is likely that both speakers are perceived to have the same identity. Third, there could be a contrastive effect in which the general tendency of the center speaker to sound more like /o/ than the high F0 speaker is amplified such that high F0 context shifts perception toward /o/ and the low F0 context shifts perception toward /u/.

Finally, we formulated two Bayesian models to explain the observed behavioral effects of speaker context on vowel perception. These computational models can help us distinguish the distinct mechanisms underlying the influence of voice characteristics in mixed and blocked context conditions. We expected to see the usage of a general population prior in the mixed condition, whereas in the blocked condition, a more precise speaker-specific prior based on the deviation from the population prior should be used. To test this hypothesis, we implemented two Bayesian models, one incorporating the use of the general population prior only and one additionally incorporating a speaker-specific prior. Thus, we expected a better prediction of the model with the speaker-specific prior, in particular, in the blocked condition. Both models were based on the priors for FF and F0 correspondence and variance based on naturally occurring speaker characteristics derived from Hillenbrand et al. (1995). In the mixed condition, we could investigate whether the model predictions based on the FF and F0 population prior coincide with our measured data. In the blocked condition, when speaker-specific information was available, the inclusion of a prior update based on the estimated prediction error for a given speaker's voice characteristics served as an alternative model to the less complex population prior model. In sum, with the two models, we tested whether listeners apply a generic population prior for typical combinations of FF and F0 values when they cannot form specific FF and F0 priors for familiar speakers, e.g., from predictable speaker context.

The design and hypotheses were preregistered online.1

Analysis of a pilot dataset with a similar design to experiment 2 showed a context-dependent vowel shift of the central, ambiguous morph for formant frequency and pitch shift conditions. The differences between responses to the middle morph of the center speaker in high vs low frequency contexts for ten participants were analyzed with Wilcoxon's signed rank tests, which were both significant with p <0.05. Based on the means and standard deviations of the differences between the mixed and the blocked conditions, resulting from a F0 or FF manipulation, Cohen’s d around 0.9 was calculated for both conditions. Those were subsequently fed into G*Power to compute the power for a standard two-tailed t-test against the null hypothesis of a shift difference of zero. For the estimated effect size around 0.9 with α = 0.05 and power = 0.95, this resulted in a suggested 18 and 20 participants, respectively. Due to the inherent noise of the online experiment process, we decided to increase the number of participants for each of the 3 experiments to 30.

We recruited participants on an online platform2 using the following inclusion criteria: age between 18 and 65 years old, first language German, no hearing difficulties, no participation in any previous experiments run with similar goal or stimulus set, no neurological or psychiatric illnesses, and no addiction to alcohol, medication or other substances.

We recruited complete datasets from 30 participants for each experiment. In experiment 1, there were 17 male and 13 female participants with ages ranging from 18 to 60 years old, with a median of 26 years old. In experiment 2, there were 19 male and 11 female participants with ages ranging from 18 to 57 years old, with a median of 28 years old. In experiment 3, there were 16 male and 14 female participants with ages ranging from 19 to 59 years old, with a median of 28 years old. During data collection, we inspected behavioral responses and excluded participants based on the following preregistered criteria. Participants were excluded if they missed about 2% of trials, which was equal to 16 trials in experiments 1 and 2, and 6 trials in experiment 3. Additionally, participants that showed either no separation of vowel end points or otherwise flat or one-sided response patterns were excluded. Based on these criteria, four participants in the blocked condition and two participants each in the mixed and precursor conditions were excluded. Single trials were not considered in the analyses when no valid response was given or the reaction time was faster than 400 ms or exceeded 5000 ms as derived from our pilot data as an appropriate time window. This resulted in 29 invalid trials in the blocked condition, 25 invalid trials in the mixed condition, and 15 invalid trials in the precursor condition with no more than 2% of the trials being invalid per participant.

Raw stimuli were recorded by a female professional speaker (age 41 years old, height 1.60 m) who speaks German as her native language. For the recordings, we used an AKG C414 XLII microphone (Vienna, Austria) and a Focusrite Scarlett 2i2 audio interface (High Wycombe, UK). The stimulus material consisted of sentences and words. For the precursor sentences, we used the first 270 sentences from a German translation of “Alice in Wonderland.” Sentences beginning with the expression “Oh!” had this phrase trimmed off because it was too similar to the test stimulus, and some excessively long sentences were shortened.

For target words, we needed confusable pairs of words which phonetically only differed in their central vowel being /u/ or /o/. A pool of candidate words was first selected using the German Celex database3 with the query “number of syllables is 1, orthographic pattern contains consonant-vowel-consonant structure.” This pool was then filtered manually to exclude unusual and rare words. We recorded 15 word pairs, which were processed with the morphing technique described below. All precursors and target word morph continua were shifted in F0 and FF as described below and normalized to −15 dB root mean square (RMS).

We validated psychometric curves for the 15 word pairs in a pilot online experiment (n = 15). Participants rated stimuli from a seven-step morph continuum as /o/ or /u/. Afterward, we selected the nine word pairs whose end points were classified correctly most often. The following nine word pairs remained in the final selection: tut/tot, Ruhm/Rom, Zug/zog, Bug/bog, Kur/Chor, fuhr/vor, Schuh/Show, Brut/Brot, and Huhn/Hohn.

We chose different software tools for voice morphing and voice-feature manipulation (for details, see Secs. II B 1 and II B 2). Specifically, we used TandemSTRAIGHT (Kawahara et al., 2008) to create morphs between the /o/ and /u/ end points for each target word pair. We used Praat (Boersma and Weenink, 2001) to create the four perceptually distinct voice identities by shifting the fundamental frequency and formant frequencies (for a recent study using the same approach, see Lavan et al., 2019). We chose this procedure after comparing shifting and morphing with both software tools and, subjectively, we obtained the best results for naturalistic sounding stimuli by using Praat to change the vocal tract parameters and TandemSTRAIGHT to morph the vowels in a trajectory along their continuum.

1. Morphing

TandemSTRAIGHT separates speech waveforms into an F0 track, a spectral envelope, and an aperiodicity component. These parts can be transformed separately and recombined to create manipulated speech. It is also possible to interpolate between two end points on each of the component dimensions to morph between naturalistic stimuli. We morphed our target stimuli on the spectral envelope dimension only. The goal of TandemSTRAIGHT's morphing procedure is to create intermediate versions that are not just mixtures of the end points but points on a trajectory through formant space from end point A to end point B.

Each end point was loaded into TandemSTRAIGHT and analyzed with automatic F0 refinement. In the subsequent anchoring step, we identified peaks approximately representing the main format frequencies in both end point spectra across multiple timepoints. These were then used to morph between the spectral envelopes. We selected the F0 tracks of the /u/ end point for the final morphs to increase the chance that the morphed end points would still be clearly identified as /u/. Pilot experiments had shown that there was a tendency for /u/ stimuli to sound more like /o/ after the STRAIGHT procedure.

2. Shifting

We adjusted the median F0 of our female base voice to 170 Hz, which is approximately the center between naturally occurring male and female distributions (Barreda and Nearey, 2012), and we assumed that it would be easier for shifted voices to remain within a plausible realm for naturally occuring voices if they originated right in the distribution center. We additionally shifted the base voice down in FF by a factor of 0.93 before creating the other shifted voices because formant measurements and listening suggested this as a good, natural sounding value for the new median F0. To facilitate the learning of speaker F0 and avoid large overlaps of F0 due to natural prosodic variations, we flattened each F0 track around the median with a factor of 0.2.

Next, we shifted the base voice, i.e., the center speaker, to create four edge speakers that differed in F0 and FF. The F0 and FF shifting was executed in Praat with the “change gender” routine. We shifted FF by 2.36 semitones up and down, and shifted F0 by 3.78 semitones up and down similar to Lavan et al. (2019). These values were derived from Gaudrain et al. (2009), who found that FF shifts are 1.6 times more effective than F0 shifts for causing perceptions of different-speaker identities 3.78/2.36 = 1.6. We wanted the outer four voices to be confusable with the central voice so we set the shift values compared to the center speaker below the speaker identity thresholds of 45% for F0 and 25% for FF as was reported in Gaudrain et al. (2009).

We ran all three experiments online using the Pavlovia platform. Each experiment was written in JavaScript using the jsPsych framework (de Leeuw, 2015). The research was approved by the local ethics committee. After confirming the consent page, participants were asked to adjust their headphones to a loud, comfortable level while a test sentence was playing. Absolute volume control is not possible in online experiments, but we asked subjects not to vary their self-adjusted volume during the experiment. Then, we ran a headphone check task to screen for people without reliably working headphones (Woods et al., 2017). The task consists of six triplets of tones, in each of which the target sound is the one with the lowest volume. As the left and right channels are phase-inverted, normal speakers cause wave cancellations which give the wrong stimuli—the appearance of the lowest volume. We allowed participants to proceed further if they answered four out of six items correctly in one of two attempts. The first experiment took, on average, 30.3 min (range 22.5–60.7 min), the second experiment took, on average, 29.7 min (range 23.8–53.3 min), and the third experiment took, on average, 38.5 min (range 34–56.2 min).

1. Experiment 1

Experiment 1 was the mixed condition, in which the center and edge speakers were presented in random order. There were 828 trials, resulting from the following factors: 4 edge speakers × 5 morph levels × 9 words × 4 repetitions + 1 center speaker × 3 morph levels (0.25, 0.5, and 0.75) × 9 words × 4 repetitions. We included different numbers of morph levels for the edge and center speakers because (1) we expected strongest shift effects on the intermediate morph levels of the center speakers and, hence, omitted the morph end points for this speaker; and (2) we included the morph end points for the edge speakers to provide clear /o/ and /u/ stimuli for all of the F0 and FF manipulations. Overall, we aimed to include more trials with edge speakers than center speakers to induce context effects (see experiment 2 with the same amount of trials per speaker). Trials were split into eight blocks in pseudo-randomized order, in which the same word or same speaker never appeared in two adjacent trials. So as not to draw attention to the manner in which the different voices were created, we did not want participants to be able to compare directly between the same source word shifted in two different ways. In each trial, first, the target word was played, and then the participants could press the “o” or “u” key on their keyboard to respond which vowel they perceived. The maximum trial duration, including response time, was 5 s, giving participants enough time to respond without time pressure.

2. Experiment 2

Experiment 2 was the blocked condition. The combination of trials was the same as that in experiment 1, but the presentation order was different. Trials were ordered into 4 consecutive blocks, 1 for each edge speaker, with the same 27 center speaker trials randomly interspersed in each of the 4 larger blocks. We pseudo-randomized trial order within blocks such that the same word or center speaker never appeared in two adjacent trials. The reasoning for this was the same as that in experiment 1. Additionally, the first three trials in each block had to be from the associated edge speaker so that block context was established before the center speaker was presented.

3. Experiment 3

Experiment 3 was the precursor condition. This experiment consisted of 270 trials, corresponding to the available selection of sentences from “Alice in Wonderland.” There were 216 trials, where the target word was spoken by an edge speaker, with 4 speakers × 9 words × 3 morph levels (0.25, 0.5, and 0.75) × 2 repetitions. In the remaining 54 trials, the target word was spoken by the center speaker with 9 words × 1 morph level (0.5) × 4 repetitions. Because 54 is not divisible by 4, 2 randomly chosen precursor edge speakers per participant had 1 trial more than the other 2. Again, we included different numbers of morph levels for the edge and center speakers for the same reasons as were stated above. In addition, we included less morph levels to reduce the number of trials because the precursor experiment with whole sentences took longer overall. To that end, we only included the intermediate levels for which we anticipated the strongest context effects. The trial order was pseudo-randomized such that the same target speaker, precursor speaker, or word could not appear in two adjacent trials.

1. Mixed model 1: Perception of shifted vowels in speaker context

We tested the main hypotheses about FF and F0 shift effects on vowel perception across all three experiments with a generalized linear mixed model (GLMM) using the Julia library “MixedModels.jl” (Bates et al., 2021). The dependent variable was modelled as a Bernoulli variable. The first mixed model was specified with the following formula without a full interaction structure as we had no hypotheses about higher order interactions, and our experiment was not designed to test an interaction of FF and F0:

ResponseOExperiment×Morph+Experiment×FFofthevoice+Experiment×F0ofthevoice+(1+Morph|Participant)+(1+Morph|Wordpair).

We specified random intercepts and random slopes for both participants and word pairs as grouping variables. For this model, all trials except for center speaker trials in the blocked and precursor conditions were used to compare only the responses without context (in the mixed condition) or with same-speaker context (in the blocked and precursor conditions) not with different-speaker context. In this model, we used the default contrast coding system “dummy coding” to compare the mean of each dependent variable to the baseline condition (i.e., the mixed condition). This allows us to test differences between the context conditions and the mixed condition. Before feeding the data into the model, morph levels were normalized to a scale from −2 to 2 in steps of one so that the model coefficient for the morph level is interpretable as one step on the morph continuum. The F0 and FF factors were normalized to the numbers −1, 0, and 1 for shift down, no change, and shift up, respectively. This is due to the fact that both shifts up and down are perceptually similar steps as a result of the logarithmic perception of sound frequencies.

2. Mixed model 2: Perception of unshifted vowels in blocked speaker context

We tested the hypotheses about the effects of FF and F0 context on the perception of the center speaker in the blocked condition with a separate mixed model. In this model, FF and F0 refer to the values of the edge speaker contexts, in which the center speaker targets were embedded. Because there was no inter-experiment comparison, the model formula was simplified to ResponseO ∼ Morph + FF of the context + F0 of the context + (1 + Morph | Participant) + (1 + Morph | Wordpair). The factors were normalized in the same way as those in model 1.

3. Mixed model 3: Perception of unshifted vowels after precursor sentences providing speaker context

To test the same hypotheses as in model 2 with the precursor condition, we computed a third mixed model. This model was further reduced compared to model 2 because there is only one central morph step with which center speaker targets were tested. Therefore, the model formula was simplified to ResponseO ∼ FF of the context + F0 of the context + (1 | Participant) + (1 | Wordpair). The factors were normalized in the same way as those in models 1 and 2.

4. Bayesian model

We built two exploratory Bayesian models to test how well they could explain the behavioral patterns and context effects of the mixed and blocked conditions using priors for F0 and FF correspondence and variance in naturally occurring speakers. The idea is similar to the probabilistic sliding-template model by Nearey and Assmann (2007) in that it incorporates F0 information indirectly by influencing the estimated scale factor for a given speaker's formant frequencies. However, the present model is a vowel-pair discrimination model that estimates psychometric curves for one /o/-/u/ continuum with a specific context term. The model was built using the Turing.jl library (Ge et al., 2018).4

First, Bayesian linear regression coefficients (slope, a; intercept, b; and variance, σ2) were fitted on the vowel formant dataset from Hillenbrand et al. (1995) to estimate a population prior. Each speaker's vowel exemplars from the dataset were averaged over F1, F2, and F3 after converting to Bark to have an estimate of FF for each speaker, forming a point cloud in FF and F0 space that represented the rough population prior. The intercept, slope, and variance of the population prior were sampled with four chains, each drawing on 4000 iterations. The model is based on the assumption that listeners have a prior expectation of speaker FF when they hear speech with a given F0 that corresponds to a normal distribution FFN(a·F0+b,σ2). Hence, the formant frequencies were modelled to depend on the slope, intercept, and respective fundamental frequency as a normal distribution with a variance σFF2. This variance was implemented as an uninformative prior following a normal distribution with a mean, μσFF2=0, and variance, σσFF22=100 truncated to the positive value range. The prior for the intercept is drawn from a uniform distribution from 0 to 20, which corresponds to the critical range of hearing in Bark units. The prior for the slope was set to vary as a normal distribution around μslope=0 with a variance of σslope2=5. With this population prior, it is believed that the incoming sensory estimate is shifted toward the prior belief, i.e., the population prior. As this effect is undesirable when the listener is more certain of a speakers's real FF value, a listener should be able to learn when the population prior is incorrect for a given speaker. Therefore, in our full model, the estimated FF is shifted by a fraction θϵ, where θ is the strength of the prior correction and ϵ is a learned “prediction error” or the difference between population prior, FF, and context, FF. A full shift (θ=1) would imply that the population prior is completely neutralized.

Two independent scale factors differentiate between the experiment conditions. Each scale factor, θ, is sampled from an uninformative normal distribution centered on μscale=0 with a variance of σscale2=2. As no prediction error is expected to be relevant in the mixed condition, the estimate for the scale factor should be θscalemixed=0.

The prior belief about speaker FF is generated by predicting first from the input F0 and then shifting by the scaled context-dependent prediction error. When combined with the prior variance, σprior2, this gives the prior distribution, Dprior=N(FF,pred,σprior2). The input FF is also estimated as a normal distribution, Dinput=N(FF,input,σinput2), with the variance, σinput2, sampled from an uninformative normal distribution with the mean, μσ2input=0, and a variance, σσinput22=100, limited to the positive numerical range. These two distributions are combined analytically into a third normal distribution, Dspeaker, which is an estimate of the formant mean for the current speaker. The probability to respond /o/ for a given morph level, m, was then computed as p(o)=cdf(Nspeaker,FF,stim+ηm), where η is a scaling factor for the impact of the morph steps on the perceptual decision. Here, η is assumed to be normally distributed with the mean, μmorphscale=0, and a variance, σmorphscale2=3. An alternative model was implemented but did not incorporate a prediction error and, hence, no local speaker-specific prior. Except for the two parameters, θ, scaling the prediction error, which were obsolete in this case, the parameters and their priors remained the same. In the final step, both Bayesian models were fitted on the number of /o/ responses that were given for each morph step, which are estimated to come from a binomial distribution in which p(o) is the value from the previous equation.

Both models were sampled with 4 chains each drawing on 8000 iterations. The evidence for each model was calculated with the Julia interface5 for the PYTHON package ArviZ (Kumar et al., 2019). Using Pareto-smoothed importance sampling leave-one-out cross-validation (PSIS-LOO-CV), the pointwise log-likelihood estimating the prediction accuracy was calculated and a model comparison evaluated the model which fit best to the observed data (Vehtari et al., 2017).

5. Exploratory analyses

We conducted some exploratory analyses for which we did not have a clear hypothesis based on the previous literature.

To directly compare the influence of blocked and precursor context, we computed an additional version of the first mixed model “perception of shifted vowels in speaker context” with sequential difference coding. This allows us to directly test the difference between mixed and blocked conditions and the difference between blocked and precursor conditions. In addition, we merged the mixed models 2 and 3 and added the interaction term with experiment [ResponseO ∼ Experiment × FF of the context + Experiment × F0 of the context + (1 | Participant) + (1 | Wordpair)] to compare the effect of blocked and precursor context on the shifts of the center speaker. The coefficient for each predictor becomes the coefficient for that variable only for the reference group, which is, here, the blocked experiment. The interaction term between experiment and each predictor represents the difference in the coefficients between the blocked and precursor experiments. In addition, we aimed at testing whether the shifts observed in experiment 3 are smaller than those in experiment 2. This pattern should be observed if the full sentence precursor allows for more effective normalization. Finally, we were interested in how the context effects develop over time. In a descriptive approach, we plotted the development of mean responses over time in the blocked condition for each morph over four time bins.

1. Perception of shifted vowels depends on speaker context

We tested how FF and F0 shift effect vowel perception across the three experiments with a GLMM (see Fig. 2 and coefficient estimates across the three experiments in Table I). In the mixed condition, high F0 of the voice lead to more /u/ judgments and vice versa low F0 of the voice lead to more /o/ judgments, as the estimate for F0 of the voice was significantly negative (−1.7009, p<1e99). Conversely, for FF shifts, high FF of the voice led to more /o/ judgments and vice versa low FF of the voice led to more /u/ judgments, as the estimate for FF of the voice was significantly positive (2.5684, p<1e99).

FIG. 2.

(Color online) An overview of the psychometric response curves in all three experiments [in (A), (B), and (C)], as well as a comparison of means [in (D)]. Individual curves are plotted in lighter colors in the background, and mean curves across participants are plotted with darker colors and scatter markers for each morph level. For visualization of the psychometric functions, we averaged across the nine individual words and repetitions per word per morph level.

FIG. 2.

(Color online) An overview of the psychometric response curves in all three experiments [in (A), (B), and (C)], as well as a comparison of means [in (D)]. Individual curves are plotted in lighter colors in the background, and mean curves across participants are plotted with darker colors and scatter markers for each morph level. For visualization of the psychometric functions, we averaged across the nine individual words and repetitions per word per morph level.

Close modal
TABLE I.

The results of mixed model 1, including shifted vowels in different contexts. The model formula was ResponseO ∼ Experiment × Morph + Experiment × FF of the voice + Experiment × F0 of the voice + (1 + Morph | Participant) + (1 + Morph | Wordpair). *p < 0.05, **p < 0.01, ***p < 0.001. SE, standard error. ffv, FF of the voice; f0v, F0 of the voice.

EstimateSEzpσparticipantσwordpair
(Intercept) −0.0259 0.2254 −0.11 0.9085 0.6473 0.5732 
Experiment: Blocked 0.0231 0.1692 0.14 0.8914   
Experiment: Precursors −0.6616 0.1714 −3.86 0.0001***   
Morphc 1.2655 0.1238 10.22 <1e - 23*** 0.2486 0.3407 
ffv 2.5684 0.0382 67.20 <1e - 99***   
f0v −1.7009 0.0318 −53.52 <1e - 99***   
Experiment: Blocked and morphc 0.0468 0.0690 0.68 0.4980   
Experiment: Precursors and morphc −0.0183 0.0794 −0.23 0.8177   
Experiment: Blocked and ffv −1.4929 0.0470 −31.76 <1e - 99***   
Experiment: Precursors and ffv −1.2186 0.0619 −19.68 <1e - 85***   
Experiment: Blocked and f0v 0.8299 0.0416 19.94 <1e - 87***   
Experiment: Precursors and f0v 1.0151 0.0543 18.69 <1e - 77***   
EstimateSEzpσparticipantσwordpair
(Intercept) −0.0259 0.2254 −0.11 0.9085 0.6473 0.5732 
Experiment: Blocked 0.0231 0.1692 0.14 0.8914   
Experiment: Precursors −0.6616 0.1714 −3.86 0.0001***   
Morphc 1.2655 0.1238 10.22 <1e - 23*** 0.2486 0.3407 
ffv 2.5684 0.0382 67.20 <1e - 99***   
f0v −1.7009 0.0318 −53.52 <1e - 99***   
Experiment: Blocked and morphc 0.0468 0.0690 0.68 0.4980   
Experiment: Precursors and morphc −0.0183 0.0794 −0.23 0.8177   
Experiment: Blocked and ffv −1.4929 0.0470 −31.76 <1e - 99***   
Experiment: Precursors and ffv −1.2186 0.0619 −19.68 <1e - 85***   
Experiment: Blocked and f0v 0.8299 0.0416 19.94 <1e - 87***   
Experiment: Precursors and f0v 1.0151 0.0543 18.69 <1e - 77***   

The estimate for the interaction factor of the blocked experiment with FF of the context was significantly negative (β=1.4929,p<1e99) and with F0 of the context was significantly positive (β=0.8299,p<1e87). Since the mixed estimates for FF were positive and those for F0 were negative, the estimates for the blocked condition indicate a reduction of these base values toward zero, i.e., a smaller shift than in mixed, as hypothesized. The same effect could be confirmed for the precursor condition with an estimate of β=1.2186 (p<1e85) for the interaction of the precursor experiment with FF of the context and an estimate of β=1.0151 (p<1e77) for F0 of the context.

There was no significant tendency across subjects to respond with /o/ or /u/ in the mixed and blocked conditions, i.e., the intercept was not significantly different from zero (β=0.0259, p =0.9085 and β=0.0231, p =0.8914, respectively). There was, however, a significantly negative estimate of the intercept for the precursors condition (β=0.6616, p =0.0001), indicating that responses were shifted toward /u/.

2. Perception of unshifted vowels depends on blocked speaker context

The second mixed model (see Table II) confirmed that FF context in the blocked condition had a significant effect on the perception of the embedded central speaker by shifting responses toward /u/ for high FF contexts (β=1.5476,p<1e98). F0 context also had a significant effect on central speaker perception in the opposite direction from that of FF context, shifting responses toward /o/ for high F0 contexts (β=1.2717,p<1e75). Both context effects are in the opposite direction of the shift effects in the corresponding matching-context edge speakers. Again, the intercept was not significantly different from zero (β=0.1400, p =0.5736), indicating that participants were overall not biased toward responding with /o/ or /u/ for the mismatching-context center speaker targets.

TABLE II.

The results of mixed model 2, including the blocked condition. The model formula was ResponseO ∼ Morph + FF of the context + F0 of the context + (1 + Morph | Participant) + (1 + Morph | Wordpair). *p < 0.05, **p < 0.01, ***p < 0.001. ffc, FF of the context; f0c, F0 of the context.

EstimateSEzpσparticipantσwordpair
(Intercept) −0.1400 0.2487 −0.56 0.5736 0.6920 0.6270 
Morphc 1.3631 0.2031 6.71 <1e - 10*** 0.2457 0.5610 
ffc −1.5476 0.0732 −21.13 <1e - 98***   
f0c 1.2717 0.0689 18.44 <1e - 75***   
EstimateSEzpσparticipantσwordpair
(Intercept) −0.1400 0.2487 −0.56 0.5736 0.6920 0.6270 
Morphc 1.3631 0.2031 6.71 <1e - 10*** 0.2457 0.5610 
ffc −1.5476 0.0732 −21.13 <1e - 98***   
f0c 1.2717 0.0689 18.44 <1e - 75***   

3. Perception of unshifted vowels depends on precursor sentences

The third mixed model (see Table III) confirmed that, as in the blocked condition, FF and F0 contexts in the form of precursor sentences shifted perception of the center speaker targets. The directions were the same as those in the blocked condition: Higher FF contexts shifted perception toward /u/ (β=0.5180,p<1e08) and higher F0 contexts shifted perception toward /o/ (β=0.2436, p =0.0039). Here, the intercept was significantly different from zero (β=0.8867, p =0.0040), indicating that participants were biased toward responding with /u/ for the mismatching-context center speaker targets. This matched the global bias toward /u/ responses in the precursor condition as was seen in the first model and Fig. 2.

TABLE III.

The results of mixed model 3, including the precursor condition. The model formula was ResponseO ∼ FF of the context + F0 of the context + (1 | Participant) + (1 | Wordpair). *p < 0.05, **p < 0.01, ***p < 0.001. ffc, FF of the context; f0c, F0 of the context.

EstimateSEzpσparticipantσwordpair
(Intercept) −0.8867 0.3080 −2.88 0.0040** 1.2030 0.6172 
ffc −0.5180 0.0865 −5.99 <1e − 08***   
f0c 0.2436 0.0844 2.89 0.0039**   
EstimateSEzpσparticipantσwordpair
(Intercept) −0.8867 0.3080 −2.88 0.0040** 1.2030 0.6172 
ffc −0.5180 0.0865 −5.99 <1e − 08***   
f0c 0.2436 0.0844 2.89 0.0039**   

1. Comparison of blocked and precursor context

We calculated the direct comparison of the influence of blocked and precursor context in an additional version of the first mixed model “perception of shifted vowels in speaker context” by using sequential difference coding (see Table IV). This showed that the interaction of the precursor experiment with FF of the context was weaker than in the blocked condition, while the interaction with F0 of the context was stronger than in the blocked condition (β=0.2742,p<1e05 and β=0.1853,p<1e87, respectively). Since the mixed estimates for FF were positive and those for F0 were negative (β=1.6645,p<1e99 and β=1.08589,p<1e99, respectively), the negative interaction of FF context and the positive interaction of F0 context with the blocked condition indicate less bias than in the mixed condition, i.e., effective normalization. In contrast, the positive interactions of FF and F0 contexts with the precursor condition indicate less normalization in the FF context than in the blocked experiment but an even more effective normalization in the F0 context than in the blocked experiment.

TABLE IV.

The results of mixed model 4, including the shifted vowels in different contexts with sequential difference coding. The model formula was ResponseO ∼ Experiment × Morph + Experiment × FF of the voice + Experiment × F0 of the voice + (1 + Morph | Participant) + (1 + Morph | Wordpair) with sequential difference coding. *p < 0.05, **p < 0.01, ***p < 0.001. ffv, FF of the voice; f0v, F0 of the voice.

EstimateSEzpσparticipantσwordpair
(Intercept) −0.2406 0.2033 −1.18 0.2367 0.6473 0.5729 
Experiment: Blocked 0.0226 0.1692 0.13 0.8940   
Experiment: Precursors −0.6834 0.1715 −3.99 <1e - 04***   
Morphc 1.2752 0.1178 10.83 <1e - 26*** 0.2486 0.3407 
ffc 1.6645 0.0230 72.35 <1e - 99***   
f0c −1.0859 0.0204 −53.21 <1e - 99***   
Experiment: Blocked and Morphc 0.0463 0.0690 0.67 0.5022   
Experiment: Precursors and Morphc −0.0657 0.0791 −0.83 0.4058   
Experiment: Blocked and ffc −1.4930 0.0470 −31.76 <1e - 99***   
Experiment: Precursors and ffc 0.2742 0.0564 4.86 <1e - 05***   
Experiment: Blocked and f0c 0.8299 0.0416 19.94 <1e - 87***   
Experiment: Precursors and f0c 0.1853 0.0519 3.57 0.0004**   
EstimateSEzpσparticipantσwordpair
(Intercept) −0.2406 0.2033 −1.18 0.2367 0.6473 0.5729 
Experiment: Blocked 0.0226 0.1692 0.13 0.8940   
Experiment: Precursors −0.6834 0.1715 −3.99 <1e - 04***   
Morphc 1.2752 0.1178 10.83 <1e - 26*** 0.2486 0.3407 
ffc 1.6645 0.0230 72.35 <1e - 99***   
f0c −1.0859 0.0204 −53.21 <1e - 99***   
Experiment: Blocked and Morphc 0.0463 0.0690 0.67 0.5022   
Experiment: Precursors and Morphc −0.0657 0.0791 −0.83 0.4058   
Experiment: Blocked and ffc −1.4930 0.0470 −31.76 <1e - 99***   
Experiment: Precursors and ffc 0.2742 0.0564 4.86 <1e - 05***   
Experiment: Blocked and f0c 0.8299 0.0416 19.94 <1e - 87***   
Experiment: Precursors and f0c 0.1853 0.0519 3.57 0.0004**   

A numerical comparison of the shift strength across the two mixed models on the center speaker revealed that the normalization in the precursor condition was smaller than in the blocked condition. Contrary to our exploratory hypothesis, this suggests that many short and similar context samples (i.e., words) induced stronger speaker normalization than longer and inherently more variable single sentences.

In an additional mixed model, we directly compared the shift effect of the center morph stimulus across the blocked and precursor experiment. As in the second mixed model, FF context in the blocked condition had a significant effect on the perception of the embedded central speaker by shifting responses toward /u/ for high FF contexts (β=1.6098,p<1e41) and F0 by shifting responses toward /o/ for high F0 contexts (β=1.364,p<1e33). The coefficients for the precursor experiment can be obtained by adding the coefficients for the FF or F0 predictor alone, and that predictor's interaction with the precursor experiment results in corresponding significant but smaller effects for FF context (−1.60978 + 1.09945 = −0.51033) and F0 context (1.36396 − 1.12131 = 0.24265). The significant interaction terms indicate the significant differences between the experiments (see Table V).

TABLE V.

The results of mixed model 5, including the blocked vs precursor contexts. The model formula was ResponseO ∼ Experiment × FF of the context + Experiment × F0 of the context + (1 | Participant) + (1 | Wordpair). *p < 0.05, **p < 0.01, ***p < 0.001. ffv, FF of the voice; f0v, F0 of the voice.

EstimateSEzpσparticipantσwordpair
(Intercept) −0.1616 0.2759 −0.59 0.5581 0.9616 0.5934 
Experiment: Precursors −0.7029 0.2681 −2.62 0.0088**   
ffc −1.6098 0.1182 −13.62 <1e - 41***   
f0c 1.3640 0.1110 12.29 <1e - 33***   
Experiment: Precursors and ffc 1.0994 0.1453 7.57 <1e - 13v   
Experiment: Precursors and f0c −1.1213 0.1388 −8.08 <1e - 15***   
EstimateSEzpσparticipantσwordpair
(Intercept) −0.1616 0.2759 −0.59 0.5581 0.9616 0.5934 
Experiment: Precursors −0.7029 0.2681 −2.62 0.0088**   
ffc −1.6098 0.1182 −13.62 <1e - 41***   
f0c 1.3640 0.1110 12.29 <1e - 33***   
Experiment: Precursors and ffc 1.0994 0.1453 7.57 <1e - 13v   
Experiment: Precursors and f0c −1.1213 0.1388 −8.08 <1e - 15***   

2. Discrimination over time

The development of mean responses over time in the blocked condition was visualized for each morph over four time bins (Fig. 3). In all of the panels, the range of p(o) between /o/ and /u/ end points was larger in the last bin than it was in the first bin with all of the morph levels “fanning out” over time. This observation supports a slowly establishing context effect, where discriminability of morph levels rises over time. If voice characteristics were learned over a couple of example trials, we would expect to see mostly flat lines after the second time bin.

FIG. 3.

(Color online) The development of mean responses in the blocked condition for each morph over time. The colors indicate the context speaker's voice properties. The 50% morph of the center speaker (in gray) shows the strong main effect of biased perception in each context condition. In the blocked condition, a complete normalization of the edge speaker would adjust the graphs of the morphed stimuli of the respective speaker to their veridical probability [e.g., 0.75 morph ends at p(o)=0.75]. Standard errors of the mean are plotted as a shaded area. In all four context panels, the range of p(o) between /o/ and /u/ end points was larger in the last bin than in the first bin, indicating that the discriminability of morph levels rises over time.

FIG. 3.

(Color online) The development of mean responses in the blocked condition for each morph over time. The colors indicate the context speaker's voice properties. The 50% morph of the center speaker (in gray) shows the strong main effect of biased perception in each context condition. In the blocked condition, a complete normalization of the edge speaker would adjust the graphs of the morphed stimuli of the respective speaker to their veridical probability [e.g., 0.75 morph ends at p(o)=0.75]. Standard errors of the mean are plotted as a shaded area. In all four context panels, the range of p(o) between /o/ and /u/ end points was larger in the last bin than in the first bin, indicating that the discriminability of morph levels rises over time.

Close modal

3. Bayesian model of speaker context on vowel perception

To explain the observed effect of speaker context on vowel perception, we implemented two Bayesian models. The first model predicted that only the general population prior was applied in the mixed and blocked conditions, whereas the second model used a general population prior of F0 and FF correspondence in the mixed condition and specific-speaker priors in the blocked condition.

In the first step, population prior parameters were fit to the vowel data from Hillenbrand et al. (1995). The chains converged sufficiently with R̂1 (see Table VI). The parameter estimates for the linear regression from F0 to FF in Bark were a slope of a =0.886, an intercept of β=9.614, and variance of σ2=0.094. These were then fed into both Bayesian models as described in Sec. II D 4 to fit the parameters that described the mean response in the mixed and blocked conditions. We left out the precursor condition because its smaller shift effects would have been less informative while modeling. For both models R̂ values were below the recommended <1.01, indicating that the chains converged. Moreover, stable parameter estimates are presumed as all the effective sample sizes (ESSs). Kruschke (2021) arrived at ESS > 400 as Vehtari et al. (2021) advocate when using four chains.

TABLE VI.

The results of the Bayesian model with prediction error. ESS, effective sample size; MCSE, Monte Carlo Standard Error.

ParameterμσNaive SEMCSEESSr̂
σFF2 0.0297 0.0008 0.0000 0.0000 19 874.2726 1.0000 
θmixed −0.0026 2.0099 0.0112 0.0111 30 462.2169 1.0000 
θblocked 0.5812 0.0090 0.0001 0.0000 28 922.0722 0.9999 
log(η) −1.6923 0.0164 0.0001 0.0001 20 003.3502 1.0000 
ParameterμσNaive SEMCSEESSr̂
σFF2 0.0297 0.0008 0.0000 0.0000 19 874.2726 1.0000 
θmixed −0.0026 2.0099 0.0112 0.0111 30 462.2169 1.0000 
θblocked 0.5812 0.0090 0.0001 0.0000 28 922.0722 0.9999 
log(η) −1.6923 0.0164 0.0001 0.0001 20 003.3502 1.0000 

With the estimated mean of the fitted parameters, we predicted the probability of responding /o/ for all of the speaker and morph combinations that were present in the mixed and blocked experiments for both models separately. A visualization of the predictions compared with the measured values is visible in Fig. 4.

FIG. 4.

(Color online) A comparison of the mean response psychometric curves in mixed and blocked conditions with the corresponding predictions from a Bayesian model, where solid lines show the measured data and dotted lines show the model prediction based on the two models. (Top) Bayesian model without prediction error (assuming the population prior). (Bottom) Bayesian model with calculation of prediction error based on context speaker. The model with local speaker priors predicts the data better than the model with only the population prior, and this is especially apparent for the context effect on the center speaker (right column).

FIG. 4.

(Color online) A comparison of the mean response psychometric curves in mixed and blocked conditions with the corresponding predictions from a Bayesian model, where solid lines show the measured data and dotted lines show the model prediction based on the two models. (Top) Bayesian model without prediction error (assuming the population prior). (Bottom) Bayesian model with calculation of prediction error based on context speaker. The model with local speaker priors predicts the data better than the model with only the population prior, and this is especially apparent for the context effect on the center speaker (right column).

Close modal

Already from visual inspection, it can be inferred that the null model has a reduced model fit in comparison to the full model. Table VII constitutes the model comparison based on Pareto-smoothed importance sampling leave-one-out (PSIS-LOO) between the full model, incorporating the prediction error, thereby leading to a local speaker-specific prior in the blocked condition and the null model, which assumes the general population prior independent of the condition. The evidence supports the full model as it can predict the observed data mode accurately with the leave-one-out (LOO) comparison [i.e., the expected log-pointwise predictive density (ELPD) −3092, which is greater than the ELPD of the null model 8030; see Table VII]. The ELPD difference between the two models was larger than ten times the standard error of the difference in the ELPD, which emphasizes strong evidence in favor of the full model. Since the predictive ability of our full model was higher than the null model, we only report the summary statistics of the estimated parameters of this model in Table VII. As expected, the estimate of the scale parameters, θmixed, revolves around zero for the mixed condition, in which the speaker properties vary from trial to trial and, therefore, no context-based prediction error is relevant. The scale parameter for the blocked condition, θblocked=0.58, indicates the strength of the influence of the speaker-specific prior over the general population prior. Thus, the stimuli presented in the blocked condition were strongly shifted toward the local speaker-specific prior.

TABLE VII.

The model comparison using leave-one-out-cross-validation. LOO, leave-one-out cross-validation (expected log-pointwise predictive density); P_LOO, estimated effective number of parameters; D_LOO, relative difference between each PSIS-LOO; weight, relative weight for each model, and this can be loosely interpreted as the probability of each model (among the compared model) given the data; SE, standard error of the LOO estimate; DSE, standard error of the difference in IC between each model and the top-ranked model.

ModelRankLOOp_LOOd_LOOWeightSEDSE
Full model −3092 204.404 0.0 0.8373 207.302 0.0 
Null model −8030 531.325 4938.03 0.1627 461.017 435.56 
ModelRankLOOp_LOOd_LOOWeightSEDSE
Full model −3092 204.404 0.0 0.8373 207.302 0.0 
Null model −8030 531.325 4938.03 0.1627 461.017 435.56 

The estimate for log(η) needs to be exponentiated to get η=0.185 as a scaling factor on the scaled morph levels (centered on zero, the scaled morphs are one unit apart, thus, one morph equals one step of size η on the abscissa of the cumulative density function, which represents the probability of an “o” response).

We examined how vowel perception is shaped by expectations about speakers' FF and F0 voice parameters in three online experiments. Our results show that FF and F0 deviations introduced strong shifts in the perception of an /u/ to /o/ vowel continuum when the current speaker characteristics were unpredictable (mixed condition). While higher FF values increased the probability of responding /o/, higher F0 values increased the probability of responding /u/ with opposite directions for decreased values. These shifts were reduced when speaker characteristics were predictable, i.e., in the blocked and precursor conditions. These findings suggest that participants were able to correct for some of the influence of the FF or F0 shift if they could learn or predict the current speaker's FF and F0 voice parameters. Additionally, we observed contrastive perceptual effects for unshifted ambiguous vowels in speaker context with FF or F0 shifts. These context effects on perception of shifted and identical unshifted vowels contradict pure intrinsic normalization techniques based on which external context should not effect vowel perception.

First, we demonstrated that perception of vowels shifted in FF or F0 depends on speaker context. Our results show that high F0 shifts induce more /u/ and low F0 shifts induce more /o/ judgments, whereas high FF shifts lead to more /o/ judgments and low FF shifts lead to more /u/ judgments, in agreement with previous studies investigating the influence of these parameters separately (Johnson, 1990a; Sjerps et al., 2019). These results support the assumption that given a speaker's F0, listeners anticipate a certain FF range for /o/ and /u/ such that speakers with relative high FF (our high FF or low F0 edge speakers) appear to speak the vowel with higher FF, i.e., /o/, more often and speakers with low FF (our low FF or high F0 edge speakers) appear to speak the vowel with lower FF, i.e., /u/, more often. Accordingly, without anticipating a speaker with an unusual combination of FF or F0 (i.e., in the mixed condition), perception of the central speaker was unbiased as the expected FF given the speaker's F0 matched the actual FF. As hypothesized, the perceptual shift of the edge speakers toward /o/ or /u/ in context (i.e., the blocked and precursor conditions) was less strong than without context (i.e., in the mixed condition). This finding indicates that expectations based on current speaker context can override the general expectation of FF given F0.

Second, we showed that perception of identical vowel stimuli differed between voice context with unusual F0 and FF combinations. As hypothesized, a high FF context shifted perception of the center speaker toward more /u/ responses, whereas a low FF context shifted perception toward more /o/ responses (Ladefoged and Broadbent, 1957; Sjerps et al., 2019). Equivalently, in the F0 contexts, we observed a contrastive effect, in which the general tendency of the center speaker to sound more like /o/ than the high F0 speaker was amplified such that high F0 context shifted perception toward /o/ and the low F0 context shifted perception toward /u/. Hence, our observations are in contrast to the two alternatives from previous studies, suggesting either no F0 context effect, because context should establish the correct FF for the center speaker without further influence of F0 (Irino and Patterson, 2002), or context shifts in the same direction as the direct F0 effect, i.e., toward /u/ for the high F0 and toward /o/ for the low F0 context (Johnson, 1990b). Since the stimulus properties of the center speaker were identical, the observed perceptual differences in the four different voice contexts indicate that embedding voices within mismatching contexts of F0/FF combinations induced shifts in perception. The observed direct and indirect effects of F0 on perceived vowel quality argue against F0-free theories of speech perception, such as the Mellin transform (Irino and Patterson, 2002), which separates vocal tract size from shape information using F0, hence, should be agnostic to F0 shifts.

We formulated a Bayesian model to explain the observed behavioral effects of speaker context on vowel perception by implementing priors for F0 and FF correspondence and variance based on naturally occurring speaker characteristics. The Bayesian model is similar to the formulation in the probabilistic sliding-template model by Nearey and Assmann (2007) but with a focus on context effects through speaker expectations.

This model can fit the observed average response patterns in the mixed condition relatively well (Fig. 4). The largest discrepancies in model fit are observed for speakers that vary on F0, indicating that in our data, F0 changes should cause even larger shifts than for the relationship between F0 and FF derived from Hillenbrand et al. (1995).

The comparison of the model containing local speaker priors against the model with only a population prior shows that the model with speaker-specific priors can fit the observed average response patterns better, especially for the context effects on the center speaker (Fig. 4). While the model without local speaker priors cannot explain the shifted perception of the identical center speaker items in different contexts, the model with local speaker priors predicts the perceptual shift toward /u/ for low F0 and high FF context and vice versa toward /o/ in high F0 and low FF context. These model predictions suggest that in informative speaker context (e.g., the blocked condition), listeners were able to correct the induced bias in vowel perception based on F0 and FF shifts. The fits of the predictive model for the context with manipulation of the formant frequencies were overall better than for those with manipulation of F0. The largest discrepancies in model fit, also for the null model without local speaker priors for the mixed condition, are observed for speakers that vary on F0, indicating that in our data, F0 changes should cause even larger shifts than for the relationship between F0 and FF derived from Hillenbrand et al. (1995). Hence, even though the population prior based on the data from Hillenbrand et al. (1995) supports the idea behind these models that listeners have a generic population prior for typical combinations of F0 and FF values (approximated by a linear relationship from F0 to FF with an additional variance term), future work may explore how the population prior fits may be improved. For example, the shape of the prior might be centered on typical distributions of FF/F0 for female, male, and children speakers. In addition, it has been shown that vowel formant frequencies in German and American populations vary (Strange et al., 2004). This suggests that future work could profit from extracting the population prior based on the language at investigation, i.e., in this study, a sample that is more similar to the speakers that our participant population listens to.

A future extension of this model could incorporate how participants become more certain about the context speaker's voice properties, which allows them to improve their performance over time as in Fig. 3. The exploration of how the context effects develop over time suggests that the population prior might take time to be overwritten and context evidence is accumulated and strengthened even further over time.

Our findings demonstrate that F0 and FF jointly influence vowel perception in speaker context. These observations can be explained by the application of general priors about natural voice parameter distributions of F0 and FF in conditions without predictable speaker context and specific F0 and FF priors in known speaker context. This context-dependent usage of speaker priors could contribute to understanding why listeners benefit from speaker familiarity when comprehending speech (Adank et al., 2009; Holmes et al., 2018; Kleinschmidt and Jaeger, 2015).

In general, it is difficult to generalize from restricted set of speech stimuli to natural speech perception as real speech has strong semantic constraints and high temporal continuity. However, it was interesting to note that the precursor condition failed to produce better normalization than the blocked condition. Precursor sentences impose semantic constraints so that it should be possible to quickly infer the correct voice parameters for each speaker. Apparently, in the current study, participants did not apply the voice from the sentence to the subsequently presented test stimulus as efficiently as in the blocked condition. It is possible that sentences and test words were too loosely linked to be classified as belonging to the same voice as they were only played sequentially but were not connected through sentence structure or intonation, an issue that was also explored by Johnson (1990a,b). Alternatively or in addition, the stylistic differences between isolated words (in the blocked condition) and connected read speech (in the precursor condition) may entail the differential use of sociolinguistic variance in production such that speaker information extracted from the sentences in one speech style may not be effectively applied to word stimuli presented in a different style (Labov, 1972).

The chosen /o/-/u/ vowel contrast is special in that it might be specifically sensitive to changes in FF and F0. If a vowel is represented as a combination of formants, and these formants are allowed to slide on a log-frequency axis (Turner et al., 2009), then those vowel pairs that lie approximately on the main F1/F2 diagonal, such as /o/-/u/, should be prone to confusion when FF or F0 are shifted, whereas vowel pairs that lie orthogonally to the main F1/F2 diagonal should be less prone to confusion because they cannot be scaled linearly from one to the other. Therefore, there should be a smaller impact of unexpected F0 and FF values on vowel quality for such orthogonally positioned vowels.

We used a twofold approach to investigate the influence of speaker expectations on vowel perception. On the one hand, we showed how shifts in FF and F0 influenced vowel perception depending on context, i.e., when the FF and F0 voice characteristics of the speaker could or could not be anticipated. On the other hand, we observed that different expected speaker FF and F0 characteristics influence perception of identical, unshifted vowels in a contrastive manner. In conclusion, our findings support the view that knowledge about a speaker's voice characteristics influences vowel perception.

This research was funded by the Emmy Noether program of the Deutsche Forschungsgemeinschaft to H.B. (German Research foundation; Grant No. DFG BL 1736/1-1).

1

See https://osf.io/kt5ge (Last viewed July 5, 2022).

2

See https://prolific.co (Last viewed July 5, 2022).

4

See https://turing.ml/stable/ (Last viewed July 5, 2022).

5

See https://julia.arviz.org/stable/ (Last viewed July 5, 2022).

1.
Adank
,
P.
,
Evans
,
B. G.
,
Stuart-Smith
,
J.
, and
Scott
,
S. K.
(
2009
). “
Comprehension of familiar and unfamiliar native accents under adverse listening conditions
,”
J. Exp. Psychol.: Hum. Percept. Perform.
35
(
2
),
520
529
.
2.
Assmann
,
P. F.
, and
Nearey
,
T. M.
(
2008
). “
Identification of frequency-shifted vowels
,”
J. Acoust. Soc. Am.
124
(
5
),
3203
3212
.
3.
Barreda
,
S.
, and
Nearey
,
T. M.
(
2012
). “
The direct and indirect roles of fundamental frequency in vowel perception
,”
J. Acoust. Soc. Am.
131
(
1
),
466
477
.
4.
Bates
,
D.
,
Alday
,
P.
,
Kleinschmidt
,
D.
,
Bayoán Santiago Calderón
,
J.
,
Zhan
,
L.
,
Noack
,
A.
,
Arslan
,
A.
,
Bouchet-Valat
,
M.
,
Kelman
,
T.
,
Baldassari
,
A.
,
Ehinger
,
B.
,
Karrasch
,
D.
,
Saba
,
E.
,
Quinn
,
J.
,
Hatherly
,
M.
,
Piibeleht
,
M.
,
Mogensen
,
P. K.
,
Babayan
,
S.
, and
Gagnon
,
Y. L.
(
2021
). “
JuliaStats/MixedModels.Jl: v4.5.0
,” available at (Last viewed July 5, 2022).
5.
Boersma
,
P.
, and
Weenink
,
D.
(
2001
). “
Praat: Doing phonetics by computer [computer program]
,”
Glot Int.
5
(
9/10
),
341
345
, available at http://www.praat.org (Last viewed July 5, 2022).
6.
Chistovich
,
L. A.
, and
Lublinskaya
,
V. V.
(
1979
). “
The ‘center of gravity’ effect in vowel spectra and critical distance between the formants: Psychoacoustical study of the perception of vowel-like stimuli
,”
Hear. Res.
1
(
3
),
185
195
.
7.
de Leeuw
,
J. R.
(
2015
). “
jsPsych: A JavaScript library for creating behavioral experiments in a web browser
,”
Behav. Res.
47
(
1
),
1
12
.
8.
Gaudrain
,
E.
,
Li
,
S.
,
Ban
,
V. S.
, and
Patterson
,
R.
(
2009
). “
The role of glottal pulse rate and vocal tract length in the perception of speaker identity
,” Interspeech 2009
1
(
5
),
152
155
.
9.
Ge
,
H.
,
Xu
,
K.
, and
Ghahramani
,
Z.
(
2018
). “
Turing: A language for flexible probabilistic inference
,” in
Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics
, edited by
A.
Storkey
and
F.
Perez-Cruz
(Playa Blanca, Lanzarote, Canary Islands), pp.
1682
1690
.
10.
Hillenbrand
,
J.
,
Getty
,
L. A.
,
Clark
,
M. J.
, and
Wheeler
,
K.
(
1995
). “
Acoustic characteristics of American English vowels
,”
J. Acoust. Soc. Am.
97
(
5
),
3099
3111
.
11.
Holmes
,
E.
,
Domingo
,
Y.
, and
Johnsrude
,
I. S.
(
2018
). “
Familiar voices are more intelligible, even if they are not recognized as familiar
,”
Psychol. Sci.
29
(
10
),
1575
1583
.
12.
Irino
,
T.
, and
Patterson
,
R. D.
(
2002
). “
Segregating information about the size and shape of the vocal tract using a time-domain auditory model: The stabilised wavelet-Mellin transform
,”
Speech Commun.
36
(
3-4
),
181
203
.
13.
Johnson
,
K.
(
1990a
). “
Contrast and normalization in vowel perception
,”
J. Phonetics
18
(
2
),
229
254
.
14.
Johnson
,
K.
(
1990b
). “
The role of perceived speaker identity in F0 normalization of vowels
,”
J. Acoust. Soc. Am.
88
(
2
),
642
654
.
15.
Johnson
,
K.
, and
Sjerps
,
M. J.
(
2021
). “
Speaker normalization in speech perception
,” in
The Handbook of Speech Perception
, 2nd ed., edited by J. S. Pardo, L. C. Nygaard, R. E. Remez, and D. B. Pisoni (
Wiley
,
New York
), pp.
145
176
.
16.
Johnsrude
,
I. S.
,
Mackey
,
A.
,
Hakyemez
,
H.
,
Alexander
,
E.
,
Trang
,
H. P.
, and
Carlyon
,
R. P.
(
2013
). “
Swinging at a cocktail party: Voice familiarity aids speech perception in the presence of a competing voice
,”
Psychol. Sci.
24
(
10
),
1995
2004
.
17.
Kawahara
,
H.
,
Morise
,
M.
,
Takahashi
,
T.
,
Nisimura
,
R.
,
Irino
,
T.
, and
Banno
,
H.
(
2008
). “
Tandem-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation
,” in
2008 IEEE International Conference on Acoustics, Speech and Signal Processing
,
IEEE
,
Las Vegas, NV
, pp.
3933
3936
.
18.
Kleinschmidt
,
D. F.
, and
Jaeger
,
T. F.
(
2015
). “
Robust speech perception: Recognize the familiar, generalize to the similar, and adapt to the novel
,”
Psychol. Rev.
122
(
2
),
148
203
.
19.
Kreitewolf
,
J.
,
Gaudrain
,
E.
, and
von Kriegstein
,
K.
(
2014
). “
A neural mechanism for recognizing speech spoken by different speakers
,”
NeuroImage
91
,
375
385
.
20.
Kreitewolf
,
J.
,
Mathias
,
S. R.
,
Trapeau
,
R.
,
Obleser
,
J.
, and
Schönwiesner
,
M.
(
2018
). “
Perceptual grouping in the cocktail party: Contributions of voice-feature continuity
,”
J. Acoust. Soc. Am.
144
(
4
),
2178
2188
.
21.
Kruschke
,
J. K.
(
2021
). “
Bayesian analysis reporting guidelines
,”
Nat. Hum. Behav.
5
(
10
),
1282
1291
.
22.
Kumar
,
R.
,
Carroll
,
C.
,
Hartikainen
,
A.
, and
Martin
,
O.
(
2019
). “
ArviZ a unified library for exploratory analysis of Bayesian models in PYTHON
,”
J. Open Source Softw.
4
(
33
),
1143
.
23.
Labov
,
W.
(
1972
).
Sociolinguistic Patterns
(
University of Pennsylvania Press
,
Philadelphia, PA
).
24.
Ladefoged
,
P.
, and
Broadbent
,
D. E.
(
1957
). “
Information conveyed by vowels
,”
J. Acoust. Soc. Am.
29
(
1
),
98
104
.
25.
Lavan
,
N.
,
Knight
,
S.
, and
McGettigan
,
C.
(
2019
). “
Listeners form average-based representations of individual voice identities
,”
Nat. Commun.
10
(
1
),
2404
.
26.
Liberman
,
A. M.
(
1957
). “
Some results of research on speech perception
,”
J. Acoust. Soc. Am.
29
(
1
),
117
123
.
27.
Nearey
,
T. M.
, and
Assmann
,
P. F.
(
2007
). “
Probabilistic ‘sliding template’ models for indirect vowel normalization
,” in
Experimental Approaches to Phonology
, edited by
M. J.
Sole
,
P. S.
Beddor
, and M. Ohala (
Oxford University Press
,
Oxford
), pp.
246
269
.
28.
Peterson
,
G. E.
, and
Barney
,
H. L.
(
1952
). “
Control methods used in a study of the vowels
,”
J. Acoust. Soc. Am.
24
(
2
),
175
184
.
29.
Sjerps
,
M. J.
,
Fox
,
N. P.
,
Johnson
,
K.
, and
Chang
,
E. F.
(
2019
). “
Speaker-normalized sound representations in the human auditory cortex
,”
Nat. Commun.
10
(
1
),
2465
.
30.
Sjerps
,
M. J.
, and
Smiljanić
,
R.
(
2013
). “
Compensation for vocal tract characteristics across native and non-native languages
,”
J. Phonetics
41
(
3-4
),
145
155
.
31.
Sjerps
,
M. J.
,
Zhang
,
C.
, and
Peng
,
G.
(
2018
). “
Lexical tone is perceived relative to locally surrounding context, vowel quality to preceding context
,”
J. Exp. Psychol.: Hum. Percept. Perform.
44
(
6
),
914
924
.
32.
Smith
,
D. R. R.
, and
Patterson
,
R. D.
(
2004
). “
The existence region for scaled vowels in Pitch-VTL space
,” in
18th International Conference on Acoustics
, Kyoto,
Japan
, Vol. I, pp.
453
456
.
33.
Strange
,
W.
,
Bohn
,
O.-S.
,
Trent
,
S. A.
, and
Nishi
,
K.
(
2004
). “
Acoustic and perceptual similarity of North German and American English vowels
,”
J. Acoust. Soc. Am.
115
(
4
),
1791
1807
.
34.
Turner
,
R. E.
,
Walters
,
T. C.
,
Monaghan
,
J. J. M.
, and
Patterson
,
R. D.
(
2009
). “
A statistical, formant-pattern model for segregating vowel type and vocal-tract length in developmental formant data
,”
J. Acoust. Soc. Am.
125
(
4
),
2374
2386
.
35.
Vehtari
,
A.
,
Gelman
,
A.
, and
Gabry
,
J.
(
2017
). “
Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC
,”
Stat. Comput.
27
(
5
),
1413
1432
.
36.
Vehtari
,
A.
,
Gelman
,
A.
,
Simpson
,
D.
,
Carpenter
,
B.
, and
Bürkner
,
P.-C.
(
2021
). “
Rank-normalization, folding, and localization: An improved R̂ for assessing convergence of MCMC (with discussion)
,”
Bayesian Anal.
16
(
2
),
667
718
.
37.
Woods
,
K. J. P.
,
Siegel
,
M. H.
,
Traer
,
J.
, and
McDermott
,
J. H.
(
2017
). “
Headphone screening to facilitate web-based auditory experiments
,”
Atten. Percept. Psychophys.
79
(
7
),
2064
2072
.