The present investigation examined the extent to which asymmetries in vowel perception derive from a sensitivity to focalization (formant proximity), stimulus prototypicality, or both. English-speaking adults identified, rated, and discriminated a vowel series that spanned a less-focal/prototypic English /u/ and a more-focal/prototypic French /u/ exemplar. Discrimination pairs included one-step, two-step, and three-step intervals along the series. Asymmetries predicted by both focalization and prototype effects emerged when discrimination step-size was varied. The findings indicate that both generic/universal and language-specific biases shape vowel perception in adults; the latter are challenging to isolate without well-controlled stimuli and appropriately scaled discrimination tasks.
A central issue in the field of speech perception is how listeners map the input acoustic signal onto phonetic categories. Over the years, considerable research has focused on addressing how this mapping is modified by linguistic experience over the course of development, beginning in early infancy (e.g., Kuhl et al., 2008). This emphasis on exploring language-specific biases as opposed to more generic, language-universal speech processing biases stems in large part from research by Kuhl et al. (Kuhl, 1991; Kuhl et al., 1992). Their studies with human infants, human adults, and monkeys revealed that early linguistic experience profoundly alters phonetic perception by decreasing discrimination sensitivity near native-language phonetic category prototypes, and increasing sensitivity near boundaries between categories.
However, there is now a growing body of evidence that infants reared in diverse linguistic communities initially display generic biases or preferences that guide and constrain how infants perceive segmental elements in speech (Polka and Bohn, 2011), and that such biases continue to operate in adult language users independently of “language-specific” prototype categorization processes (Masapollo et al., 2017b). These generic biases are evident in studies showing that infants display directional asymmetries in discrimination tasks. In the domain of vowel perception, numerous studies report that, early in development, infants' discrimination of a vowel change presented in one direction is significantly better compared to when the same change is presented in the reverse direction (for review, see Polka and Bohn, 2011). These effects have been found with a wide range of vowel contrasts and occur independently of specific linguistic experience. By late infancy and adulthood, linguistic experience has altered perception; analogous asymmetries are observed for non-native contrasts but are mitigated for native contrasts which are typically perceived with near-perfect accuracy (e.g., Polka and Bohn, 2011; Tyler et al., 2014).
Over the last decade, Polka et al. formulated and experimentally tested a developmental model of vowel perception, termed the Natural Referent Vowel (NRV) framework, which has been applied to explicate the processes underlying directional asymmetries (Polka and Bohn, 2011; Masapollo et al., 2017a; Masapollo et al., 2017b; Masapollo et al., 2018). This model, which incorporates ideas from both quantal theory (Stevens, 1989) and Dispersion-Focalization Theory (Schwartz et al., 1997), proposes that asymmetries expose a perceptual bias favoring extreme vocalic constrictions which give rise to acoustic signals with well-defined spectral prominences due to the convergence of adjacent formant frequencies, referred to as “focal points.” In NRV, vowels with a high degree of formant convergence are argued to have salient and stable phonetic structures, making them easier for listeners to detect, encode, and retain in verbal working memory. Thus, differences in perceptual salience bias perception and give rise to the directional asymmetries observed in discrimination tasks. Consistent with this view, recent results show that asymmetries predicted by focalization are observed when adults discriminate auditory as well as visual-only variants of the same vowels (Masapollo et al., 2017a; Masapollo et al., 2017b; Masapollo et al., 2018). These findings established the focal vowel bias in adults, and demonstrated that bias reflects a sensitivity to how articulatory movements shape the speech signal across sensory modalities.
The focalization bias is argued to play an important role in developing and supporting vowel perception across the lifespan. Infants show asymmetries that point to a focalization bias in the first few months of life for both native or non-native vowel contrasts. In the first year, infants begin tuning to native vowel contrasts. This will enhance or reduce the initial focalization bias, depending on their native language repertoire. This generic bias provides a scaffold to support the acquisition of a more detailed vowel system. Accordingly, in the NRV account, both universal and language-specific biases operate to shape vowel perception in mature, adult language users.
An alternative account of asymmetries derives from the Native Language Magnet (NLM) model (Kuhl et al., 2008). This model, which combines principles from prototype theory and statistical learning theory, posits that listeners develop acoustic prototypes for native phonetic categories (i.e., adult-defined “best” instances of a category). Furthermore, these prototypes are argued to have “magnet-like” effects, in which the nearby perceptual space is “shrunk,” making it more difficult to discriminate variants around prototypes than around non-prototypes of the same category. NLM applies to both consonants and vowels, although many of the studies that led to its development investigated the shrinking and stretching of the underlying perceptual space for vowels, and how this “warping” of perceptual space relates to category goodness (Kuhl, 1991; Kuhl et al., 1992; Iverson and Kuhl, 1995; Lotto et al., 1998). Moreover, NLM assumes that speech perception involves general auditory mechanisms that process acoustic rather than specifically phonetic information. Nevertheless, from this theoretical perspective, directional asymmetries reflect an experience-dependent bias favoring native prototypes; asymmetries emerge because listeners display reduced sensitivity when discriminating a change from a more-prototypic to a less-prototypic vowel compared to the reverse. Compatible with this view, Kuhl (1991) reported a directional asymmetry, such that English-learning infants performed better at discriminating a change going in the direction from a non-prototype of the /i/ category to a prototype, compared to the same change presented in the reverse direction. Critically, however, in this case, the prototype was more focal (between F2 and F3) compared to the non-prototype. Thus, this finding could be attributed to prototypicality effects and/or focalization effects.
Masapollo et al. (2017b) attempted to assess NRV and NLM accounts of vowel perception asymmetries in a cross-language study comparing English- and French-speaking adults' vowel perception. They synthesized a set of vowels that were identified as /u/ by both language groups, but varied in stimulus goodness such that the best French /u/ exemplars were more focal compared to the best English /u/ exemplars. In subsequent AX discrimination tests, both English and French adults performed better at discriminating changes from a less to a more focal /u/ compared to the reverse, despite variation in prototypicality, which under NLM predicts an asymmetry in the reverse direction for English adults. These findings established the existence of a universal bias favoring vowels with greater formant convergence that operates independently of biases related to language-specific prototype categorization.
The failure to isolate a language-specific bias in Masapollo et al. (2017b) may have been due to stimulus limitations in that the psychoacoustic distance between stimulus pairs (which ranged between 56 and 128 mels) may have been too large to measure a prototype effect. In prior work, Kuhl (1991) found a “perceptual magnet effect” when adults discriminated small acoustic differences (30 mels) close to the prototype in psychoacoustic space, but not when discriminating larger acoustic differences (60, 90, or 120 mels) regardless of whether they were close to the prototype. One possibility is that, when the acoustic differences between stimulus pairs are more subtle, listeners correct for memory uncertainty by biasing all responses toward the location of the best exemplar of the category. This suggests that a more granular stimulus set that includes vowels varying in small psychoacoustic steps around the native-language prototype is required to isolate perceptual magnet effects as an asymmetry.
We tested this idea in the current study. We predicted that asymmetries in adult vowel perception would pattern as predicted by NRV when discriminating stimulus pairs with large acoustic intervals (either closer to or farther from the prototype), and as predicted by NLM when discriminating stimulus pairs with small acoustic intervals (closer to the prototype). Thus, in this view, the prevailing effect will be modulated by the degree of acoustic distance between the contrasting vowels such that prototype effects will be evident for stimuli close to the best exemplar of the category, and focalization effects will arise when formant proximity differences are larger, but still fall within the same vowel category.
2. Materials and methods
Nineteen undergraduate students [4 males, mean age 20 yr (SD = 1.5)] from McGill University participated in this experiment for pay. All were native, monolingual English speakers who reported no history of hearing, speech, language, or neurological deficits. Three additional participants were tested but excluded from the final sample due to failure to meet inclusion criteria (2) or to follow task directions (1). Subjects were screened for eligibility criteria before coming to the lab where they then completed the Language Experience and Proficiency Questionnaire (Marian et al., 2007) to verify language background. Participants had to meet the following inclusion criteria: (1) no prior linguistic or phonetics training, (2) raised in a monolingual home and educated in a monolingual school in their respective language, (3) no experience learning a second language before 10 yr of age, and (4) no experience conversing in a second language on a regular basis.
Masapollo et al. (2017b) initially synthesized a broad array of vowels that spanned the upper region of vowel space and ranged in F1 (from 275–330 Hz) and F2 (from 476–2303 Hz) in equal psychophysical steps on the bark scale (Zwicker and Ternhardt, 1980). English and French listeners provided identification and goodness ratings that were used to select six vowel stimuli that were identified exclusively as [u] by both language groups, included “best exemplars” for each language group, and also varied in formant proximity (between F1 and F2). This set included three less-focal/English-prototypic /u/ tokens and three more-focal/French-prototypic /u/ tokens. Two /u/ vowels from this narrow stimulus set were used as the starting point to develop the stimuli for the current experiment: stimulus u1 (less-focal/prototypic English/u) and u6 (more-focal/prototypic French /u/).
Emulating the design of Kuhl's (1991) stimulus set, we used these select English /u/ and French /u/ prototypes to create a new set of variants that surrounded each of the selected prototypes in F1/F2 acoustic space. As shown in Fig. 1, there were 40 variants forming five orbitals around each prototype (eight located on each orbital). The acoustic intervals between the stimuli in the five orbitals and the two prototypes were equated on the mel scale. The first orbital was 18.4 mels from the corresponding prototype, and the second through fifth orbitals were located at 36.8, 55.2, 73.6, and 92 mels, respectively. This resulted in equivalent perceptual gradients around each of the two prototypes with smaller step-intervals than those used by Kuhl (1991; i.e., 30 mels). Note that the stimuli along one vector were common to both the English /u/ prototype and French /u/ prototype sets. Twelve tokens along this vector (shaded in Fig. 1) were selected for use in the current study. On this vector, stimulus 6 matches the less-focal/English /u/ token (u1) and stimulus 11 matches the more-focal/French /u/ token (u6) in Masapollo et al. (2017b).
The stimuli were five-formant vowels synthesized using the Variable Linear Articulatory Model (VLAM) (see Masapollo et al., 2017b, for details). This articulatory-based synthesizer is constrained by current knowledge of the physiology of the articulators; the anatomical measurements of the vocal tracts generated by this model have previously been shown to be consistent with dynamic articulatory parameters obtained using magnetic resonance imaging, and the resulting acoustic signals are consistent with those reported for vowels produced by adult speakers. Thus, VLAM generates subjectively natural-sounding source and filter characteristics. The present vowel stimuli were created to emulate the articulatory and acoustic properties of an adult male vocal mechanism (supralaryngeal vocal tract geometry and vocal fold vibration). The variants were created by manipulating the values of F1 and F2; the values of F0, F3, F4, and F5 remained constant for all vowels at 120, 2522, 3410, and 4159 Hz, respectively. Each stimulus was 400 ms in duration and had the same intonation and intensity contour.
2.3 Procedure and design
Participants completed a phonetic identification and goodness rating task followed immediately by an AX discrimination task. However, rather than presenting one vowel stimulus per trial for identification and goodness rating as is typically done (e.g., Kuhl, 1991; Iverson and Kuhl, 1995; Tyler et al., 2014; Masapollo et al., 2017b), vowels were presented in pairs on each trial (as in Lotto et al, 1998). This was done to control for any potential effects of stimulus context across the identification and AX discrimination tasks. There were two blocks of trials. In one block, the less-focal/English /u/ (stimulus 6) was paired with itself or with one of the adjacent vowels located one, two, or three step-intervals away on either side of the stimulus vector (stimuli 3, 4, 5, 7, 8, 9). In the other block, the more-focal/French /u/ (stimulus 11) was paired with itself or with one of the adjacent vowels located one, two, or three step-intervals away on either side of the stimulus vector (stimuli 8, 9, 10, 12, 13, 14). In each block, there were 12 same trials containing the prototype (6–6 or 11–11), 12 same trials for each of the six neighboring stimuli (3–3, 4–4, 5–5, 7–7, etc.) and 12 different trials for each of the adjacent pairs (6–3, 6–4, 6–5 or 11–8, 11–9, 11–10). On different trials, stimulus order within pairs was counter-balanced; for half of the trials, the English prototypic /u/ (stimulus 6) or the French prototypic /u/ (stimulus 11) was presented first (e.g., 6–3 or 11–8), and for the other half the prototypic stimulus (English or French) was presented second (e.g., 3–6 or 8–11).
For the identification and rating task, on each trial, participants heard two stimuli separated by an inter-stimulus-interval (ISI) of 1500 msec (the same ISI used in Masapollo et al. 2017b). Half of the subjects were instructed to identify and rate only the first vowel on each trial, and half were told to do this only for the second vowel on each trial. To identify the vowel, they were instructed to indicate whether either they perceived the vowel “oo” as in the word “boo” by pressing an appropriately labeled button on a response pad. If a vowel was not perceived as “oo,” they were instructed to press the “N” key. Immediately after making this choice they evaluated the “goodness” of the vowel as an example of the vowel in “boo” by pressing one of seven buttons on a response box labeled from “very poor” (1) to “very good” (5). If participants choose “N” during identification, the rating was automatically scored as 0. Prior to the two test blocks, participants completed a short practice session to ensure that they understood the task instructions.
Following the identification and goodness rating task, the same participants completed an AX-discrimination task. Note that separate groups of participants completed the two perceptual tasks in Masapollo et al. (2017b). Prior NLM investigations have reported that the precise location of phonetic category prototypes differs across subjects (e.g., Lively and Pisoni, 1997). On the basis of such findings, one could argue that it was a methodological flaw that Masapollo et al. (2017b) administered the goodness and discrimination tasks to different groups of subjects. Therefore, the current study was designed to partially replicate the asymmetries reported by Masapollo et al. (2017b) when the same groups of subjects performed both tasks. The stimulus presentation protocol in the AX task was the same as in the identification task, except that participants indicated whether the stimulus pairings were the same or different by pressing one of two labeled keys on a keyboard. Participants were instructed to respond to any differences they heard between the stimuli. No feedback was provided in either the identification or discrimination task.
3.1 Identification and goodness rating task
All of the stimuli were identified as /u/ on more than 99% of all the trials,1 demonstrating that the participants were highly consistent in their categorization of these tokens. The overall mean category goodness ratings for each vowel along the stimulus series is plotted in Fig. 2. These ratings are collapsed across stimulus pair type (same vs different) and stimulus presentation order as results showed consistent prototype biases across all levels of analysis. For example, the ratings of stimulus 5 in the English prototype block are averaged across 12 identifications of stimulus 5 paired with itself, 6 identifications of stimulus 5 followed by the English prototype (5–6), and 6 identifications of stimulus 5 preceded by the English prototype (6–5). Trials where the stimulus was not identified as a member of the /u/ category received a score of 0. Each circle in the figure represents one of the 12 vowel tokens along the series. The circle size is scaled to correspond to the /u/ goodness ratings for each item collapsed across all subjects; the median goodness rating for the group is shown in the center of each circle and below it is the number of subjects in the group for whom that token was rated as the best exemplar of the /u/ category. As in prior studies (Lively and Pisoni, 1997), we found that the precise location of the prototype differed across individual subjects.
Overall, the goodness results roughly correspond to those reported in Masapollo et al. (2017b). However, in this data set our English adults assigned the highest goodness ratings to stimulus 8 instead of stimulus 6 (the English /u/ prototype in Masapollo et al., 2017b,). Although it is unclear how to explain this finding, it aligns with previously reported context effects on vowel identification and goodness ratings (Lotto et al., 1998). Critically, although stimulus 8 (the best /u/ exemplar identified in the present experiment) is more focal than the expected English prototype (stimulus 6), it is still less focal and received lower category goodness ratings than stimulus 11, similar to the stimulus array used in Masapollo et al. (2017b).
3.2 Discrimination task
We employed a signal detection theory analysis to assess discrimination; the dependent measure was A-prime (Zhang and Mueller, 2005), which is an unbiased index of discrimination performance that ranges from 0.50 (chance)–1.0 (perfect discrimination). To determine whether focalization or prototype effects were present, separate A' scores were computed for four stimulus regions defined around the two prototypes [A (left of stimulus 6) vs B (right of stimulus 6) vs C (left of stimulus 11) vs D (right of stimulus 11)] for each stimulus interval (1-step vs 2-step vs 3-step) and for each order of stimulus presentation (less to more focal vs more to less focal). The four regions and step sizes are shown in Fig. 2. The mean A' scores are presented in Fig. 3. Separate analyses of variance (ANOVAs) were performed on these scores for each stimulus interval (1-step, 2-step, 3-step) with acoustic region and stimulus presentation order as within-subject factors. Greenhouse-Geisser corrections were applied when appropriate, and partial eta-squared effect sizes were calculated for all main effects and interactions. Post hoc pairwise comparisons were reported as significant at the 0.05 level.
If differences in focalization alone drive asymmetries, then participants should display superior performance when discriminating changes in the less to more focal direction across all stimulus regions, i.e., A, B, C, and D). In this case, we expect a main effect of stimulus order (better in less-focal to more-focal direction) and no interaction with region due to reversal of this direction effect. If, however, differences in stimulus prototypicality also drive asymmetries, then the direction effect should reverse in the region of the English prototype. This reversal is expected because, for English adults, the direction of stimulus change that is easier based on prototypality and focality go in opposite directions [e.g., 11–6 corresponds to a less to more prototypic vowel change (easier) but corresponds to a more to less focal vowel change (harder)]. This reversal of the focality effect is expected for small acoustic differences close to the English prototype (i.e., for smaller step size intervals). With the present stimuli, such a reversal in direction effect (regarding focality) is expected for stimulus pairs close to stimulus 8, which straddles regions B and C. Note that although the task was not ideally designed to target stimulus 8 as the prototype (as we assumed that stimulus 6 would be the English /u/ prototype), nevertheless, based on the observed prototype, we can predict that regions B and C provided favorable environments for a prototype effect to manifest. To summarize, if both focality effects and prototype effects emerge, we expect to observe directional asymmetries in discrimination of vowel pairs drawn from the 4 regions of the stimulus array, such that both directions of asymmetry are observed. More specifically, the asymmetry direction should reverse (from favoring the less to more focal direction to favoring the more to less focal direction) in the middle region (near the prototype) when the acoustic distance between vowels is small.
The ANOVA performed on the 1-step intervals (shown in Fig. 3, left panel) indicated that there was a significant main effect of region [F(1,54) = 9.179, p < 0.001, η2p = 0.338] but not of order of stimulus presentation [F(1,54) = 2.111, p = 0.163, η2p = 0.105]. As expected, the order of stimulus presentation × region interaction was significant [F(3,54) = 5.491, p = 0.004, η2p = 0.234]. Post hoc t-tests on the pairwise comparisons indicated that the effect of stimulus presentation order was significant in regions B [t(18) = 2.232, p = 0.039, d = 0.62], C [t(18) = −2.754, p = 0.013, d = 0.88], and D [t(18) = −2.361, p = 0.030, d = 0.22], but not A [t(18) = 0.879, p = 0.391]. Importantly, in region B performance is better in the less to more focal direction (and the same non-significant trend is noted for region A). However, the direction effect reverses in regions C and D, counter to predictions based on focalization. These results support NLM across the acoustic regions when subjects discriminate the small, 1-step (18.4 mels) stimulus pairs.
The ANOVA performed on the 2-step interval (shown in Fig. 3, center panel) indicated that there were significant main effects of region [F(3,54) = 11.990, p < 0.001, η2p = 0.400] and order of stimulus presentation [F(1,54) = 7.993, p = 0.011, η2p = 0.308]. There was also a significant interaction effect [F(3,54) = 4.550, p = 0.015, η2p = 0.202]. Post hoc t-tests on the pairwise comparisons indicated that the effect of stimulus presentation order was significant in region B [t(18) = 3.596, p = 0.002, d = 0.88], approached significance in region D [t(18) = 1.913, p = 0.072], and was not significant in either regions A [t(18) = 1.569, p = 0.134] or C [t(18) = −1.286, p = 0.215]. Here, asymmetries predicted by focalization are observed for regions B and D; a reversal is observed in region C but it fails to reach statistical significance. Thus, the 2-step data show effects of focalization, while effects of prototypicality are weak to absent.
Finally, the ANOVA performed on the 3-step interval (shown in Fig. 3, right panel) indicated that there were significant main effects of region [F(3,54) = 11.154, p < 0.001, η2p = 0.383] and stimulus presentation order [F(1,54)= 9.667, p = 0.006, η2p = 0.349]. There was also a significant interaction effect [F(3,54) = 5.911, p = 0.008, η2p = 0.247]. Post hoc t-tests on the pairwise comparisons indicated that the effect of stimulus presentation order was significant in regions B [t(18) = 2.974, p = 0.008, d = 0.65] and D [t(18) = 3.247, p = 0.004, d = 0.81], but not in regions A [t(18) = 0.537, p = 0.598] and C [t(18) = −0.415, p = 0.683]. Overall, discrimination performance is higher for these larger acoustic differences, and reliable asymmetries predicted by focalization are observed only for regions B and D. There is no evidence of a prototype effect.
As can be observed from the A′ scores plotted in Fig. 3, discrimination of the stimulus pairs in region B (which contained the English /u/ prototype) always showed a focalization effect. This is perhaps unsurprising given that both NRV and NLM predict an asymmetry in the same direction since the present prototype (stimulus 8) is also more focal than the reference stimulus (stimulus 6). Critically, however, this asymmetry reversed in region C (which also contains the English /u/ prototype) toward the prototypic location, demonstrating a prototype effect that operates independently of differences in focalization. The asymmetry then shifted back toward the more focal location in region D (which does not contain the English /u/ prototype) for the two- and the three-step intervals; task performance hovered around chance in the one-step condition. Thus, it appears that focalization and stimulus prototypicality both influence asymmetries in vowel discrimination because the effects are differentially influenced by the size of the acoustic interval and proximity to the location of the best exemplar.
In the present research, we investigated whether directional asymmetries in adult vowel perception reveal language-specific biases favoring acoustic prototypes for native vowel categories (explained by NLM), as well as language-general biases favoring more “focal” vowels (explained by NRV). Recent research by Masapollo et al. (Masapollo et al., 2017a; Masapollo et al., 2017b; Masapollo et al., 2018) has shown evidence for focalization effects alone. The present experiment extends this work by providing evidence that stimulus prototypicality also plays a role in shaping asymmetries. These findings have both methodological and theoretical implications.
Asymmetries favoring the best exemplar of the native /u/ category emerged most clearly when subjects were discriminating small stimulus differences (1-step/18.4 mels) close to the region of perceptual space centered on the prototype (region C). When stimulus differences were larger (2-step/36.8 mels or 3-step/55.2mels), discrimination asymmetries favored more focal exemplars of /u/. Thus, stimulus prototypicality and formant proximity appear to differentially influence asymmetries depending on the acoustic distance of AX pairs and their location relative to the prototype stimulus.
The presence of a prototype effect for the 1-step (18.4 mels) but not the 2- (36.8 mels) or 3-step (55.2 mels) intervals suggests that focalization effects are more apparent and easier to isolate compared to prototype effects, whereas prototype effects can only be detected in very restricted acoustic regions close to the best exemplar of a native vowel category. Thus, in order to reliably measure a prototype effect, speech scientists need to identify the precise location in psychophysical space of the best exemplar for a particular stimulus array and testing context, and it is optimal to measure these effects within the same subjects as we have done in the present study. Although we do not have direct evidence to address a potential effect of listening during the identification task on the discrimination results, other studies indicate that the location of a vowel prototype does not vary throughout the duration of an identification task (Iverson and Kuhl, 1995), suggesting that additional listening experience does not affect perceptual performance in this type of task.
Collectively, the present results suggest that both NRV (Polka and Bohn, 2011) and NLM (Kuhl, 1991; Kuhl et al., 2008) are needed to account for directional asymmetries in vowel perception. Our findings contribute further evidence for the NRV account, stipulating that formant convergence gives rise to salience differentials across the vowel space, which can be measured using either natural or synthetic speech tokens. The current results also provide novel evidence in support of the experience-dependent predictions of NLM (Kuhl, 1991; Kuhl et al., 2008) and highlight the challenges of isolating these effects. Clearly, both theories outline factors that jointly influence vowel perception which can be clarified through ongoing investigation and integration.
This research was supported by a grant from the Natural Sciences and Engineering Research Council of Canada (DG 105397) to L.P. We thank members of the McGill Speech Perception Lab for assistance with participant recruitment and data collection. Finally, this work benefited from helpful discussions with colleagues at the 178th Meeting of the Acoustical Society of America (San Diego, CA), including Catherine Best, T. Christina Zhao, Ratree Wayland, Shoju Tsuji, and Patricia K. Kuhl.
Participants identified the tokens as a member of a category other than /u/ on less than 0.06% of all test trials.