Building on previous observations of variability in speech research, we examine variability in speech perception study materials associated with the specific talker and contrast under examination. English-speaking listeners completed a web-based auditory AXB task involving Hindi dental-retroflex stop contrasts produced by four talkers. Main effects of talker and contrast, as well as the interaction of the two, were observed. Further, there was a great deal of individual listener variation. These findings complicate our ability to characterize the difficulty that Hindi dental-retroflex contrasts pose for English speakers, and lead to critical questions concerning the generalizability of speech perception study findings.
1. Introduction
Linguistics research is often aimed at identifying patterns that are shared within speech communities, treating individual differences as noise that obscures normative patterns (Yu and Zellou, 2019). This is evident in the use of statistical models that label individual differences as the “error term” (Yu and Zellou, 2019); in the typically limited descriptions provided about participants' linguistic backgrounds in favor of vague and essentialist categories like “native” or “nonnative” speaker (Cheng , 2021); in the application of experimental control measures like the exclusion of participants based on factors such as disability, handedness, or multilingualism; and in data cleaning methods that involve removal of data associated with “outliers” [see, e.g., Barrios and Hayes-Harb (2021), Darcy (2013), and Harnsberger (2000)].
However, variability is an inherent property of speech, and inter-talker differences in speech production are ubiquitous. Talkers differ in their productions on myriad dimensions, including, e.g., speech rate (Bradlow , 1996), vowel formant frequencies (Hillenbrand , 1995), voice onset time (Allen , 2003), and frication centroid and skewness (Newman , 2001). Such differences are observed among first language (L1) speakers [e.g., Allen (2003)] as well as language learners (Munro, 2023), and in the production of “clear” (Ferguson, 2004), conversational (Bradlow , 1996), and infant-directed speech (Cristià, 2010). Talkers even differ in the amount within-talker variability they produce (Newman , 2001; Smith , 2019).
Individual talker differences have well-documented consequences for listeners, as observed across several lines of inquiry. For example, listeners can use systematic differences between talkers' productions for the purpose of talker identification (Allen and Miller, 2004). Listeners also exhibit more accurate word identification for familiar than unfamiliar talkers (Nygaard and Pisoni, 1998), as well as when words are repeated by the same talker as opposed to different talkers (Clapp and Sumner, 2024). Moreover, individual talker differences also influence language learning. For example, language learners can track individual talker acoustics, exhibiting a preference for the speech patterns of specific talkers depending on the social characteristics attributed to them (Hayes-Harb , 2022). Studies of the differential effects of single- and multi-talker exposure on adults' perception of novel phonological contrasts provide additional evidence of the consequences of talker variability for language learning. In these studies, listeners hear either a single or multiple talkers during an exposure phase and are subsequently tested on their sensitivity to a novel contrast. Participants who are exposed to multiple talkers typically exhibit greater sensitivity to the targeted novel contrast—observed as an ability to generalize to new talkers and/or phonological environments—than those exposed to only one talker (Bradlow , 1997; Logan , 1991). The assumption behind these studies is that talker variability provides learners with the data they need to identify which aspects of the speech signal systematically cue the new contrast from those that are “irrelevant,” such as individual talker differences (Logan , 1991). It is worth noting that these studies emphasize the linguistic relevance of the same inter-talker variability that is so often discarded as noise in other studies.
Similarly, studies of L1 listeners (Fuhrmeister and Myers, 2021; Ou and Law, 2017) and language learners (Kong, 2019; Mora and Darcy, 2023) show that listeners also differ from one another in their speech perception performance. Listeners vary in their use of particular acoustic dimensions (Schertz , 2015; Shport, 2015), as well as in their pre-training sensitivity to novel contrasts (Perrachione , 2011) and learning trajectories (Kim , 2018; Nagle, 2021). Some studies investigate listener factors that may underlie this variability, associating it with cognitive (Mora and Darcy, 2023; Ou and Law, 2017; Perrachione , 2011) and neural (Golestani , 2002) factors, musical ability (Jansen , 2023), exposure to an accent (Clarke and Garrett, 2004), or attitudes about the talker (Ingvalson , 2017). For example, Perrachione (2011) found that performance on a sound blending task—a measure of phonological awareness—was a significant predictor of English speakers' ability to learn Mandarin lexical tone contrasts. Importantly, however, listener variability is observed even in studies where participants are characterized as forming relatively homogenous groups (Díaz , 2012; Nagle , 2023).
The literature provides scattered reports of listener variability in cross-language speech perception [e.g., Cebrian (2021) and Nagle (2023)], as well as a smaller number of reports of variability among the talkers producing the study materials (Harnsberger, 2000). Given that individual variation is a property of both talkers and listeners, we might expect talker and listener variability to both be present in any given perception experiment (Zaar and Dau, 2015).
A less-studied aspect of speech variability is how production and perception of a phonological contrast vary among the different versions of the contrast, and it is not uncommon for researchers to select a single contrast to represent the group. For example, cross-language perception studies have often involved only the voiced unaspirated /d̪ - ɖ/contrast to represent Hindi's dental-retroflex contrasts (Fuhrmeister and Myers, 2020; Golestani and Zatorre, 2004; Werker and Tees, 1984), though Hindi's dental-retroflex contrasts also include voiceless unaspirated /t̪/-/ʈ/, voiceless aspirated/t̪h/-/ʈh/, and voiced aspirated/ d̪ɦ/-/ɖɦ /instantiations. There is evidence that English speakers experience differential difficulty among these four contrasts (Hayes-Harb and Barrios, 2021; Pruitt , 2006), suggesting that they may in principle behave differently from one another.
To investigate the combined effects of talker, contrast, and listener variability, we investigated the cross-language perception of all four Hindi dental-retroflex contrasts by American English speakers using materials produced by four Hindi speakers. We ask the following research questions:
-
Does cross-language discrimination accuracy of Hindi dental-retroflex stop contrasts by English speakers vary by the talker who produced the materials?
-
Does discrimination accuracy differ by contrast?
-
Does discrimination accuracy exhibit an interaction of talker and contrast?
-
Does discrimination accuracy vary by listener? Is listener variability explained by a measure of phonological awareness?
2. Methods
Study materials, including stimuli and experiment presentation code, are 99 available at http://doi.org/10.17605/OSF.IO/UT6ZN.
2.1 Participants
English-speakers (n = 326) were recruited from the Linguistics study pool at the University of Utah and received course credit for participating. Of these, 146 met one or more exclusionary criteria: younger than 18 years old (4), not identifying English as either a childhood or “native” language (34), or indicating knowledge of a language with aspiration contrasts and/or phonemic or allophonic retroflex consonants (Ancient Greek, Bosnian, Cherokee, Danish, Fijian, German, Gujarati, Hindi, Hmong, Japanese, Korean, Laotian, Malagasy, Malayalam, Mandarin/Chinese, Navajo, Norwegian, Pashto, Polish, Polish, Portuguese, Quechua/Kichwa, Russian, Sanskrit, Scots Gaelic, Sichuanese, Sindhi, Swedish, Tagalog, Thai, Urdu, Vietnamese; 137). Five additional participants were excluded for failing to respond to any AXB trials. Ultimately, data from 169 participants who met all inclusionary criteria are presented in this manuscript (age: M = 21.87 years, sd = 6.30; gender: female = 100; male = 63; neither = 5; prefer not to answer = 1).
2.2 Materials
Auditory stimuli consisted of four tokens of each of the eight Hindi dental and retroflex stop phonemes,/t̪ ʈ t̪h ʈh d̪ ɖ d̪ɦ ɖɦ/ produced by four Hindi speakers who were recruited from the University of Utah community, and recorded in a sound-attenuated booth. Two of these talkers identified as male (T1, T2) and two as female (T3, T4; see Table 1 for talker details). Tokens were elicited by asking the talkers to name the Devanagari letters corresponding to the eight phonemes six times each in random order, resulting in 48 °C[ə] tokens from each talker. Using praat scripts, the tokens were extracted from the long sound file (Lennes, 2003) and scaled for peak intensity (Yoon, 2024). The second through fifth of these tokens were used in the experiment (the first and/or sixth tokens replaced the targeted tokens that were unusable due to extraneous noises, disfluencies, etc.).
Talker . | Age . | Gender . | “Native” languages and those spoken by parents/childhood caregivers . | Additional languages . |
---|---|---|---|---|
T1 | 24 | male | Telugu, Hindi, English | Sanskrit, Japanese |
T2 | 20 | male | Hindi, Haryanvi (Dialect), English | French, Punjabi |
T3 | 24 | female | Hindi, English | none |
T4 | 22 | female | Hindi, Marathi, English | none |
Talker . | Age . | Gender . | “Native” languages and those spoken by parents/childhood caregivers . | Additional languages . |
---|---|---|---|---|
T1 | 24 | male | Telugu, Hindi, English | Sanskrit, Japanese |
T2 | 20 | male | Hindi, Haryanvi (Dialect), English | French, Punjabi |
T3 | 24 | female | Hindi, English | none |
T4 | 22 | female | Hindi, Marathi, English | none |
2.3 Procedure
The online study lasted approximately 50 min and involved informed consent, a sound level adjustment task, three experimental tasks, and a background questionnaire. Participants first read a consent form (Qualtrics.com). Upon clicking a button labeled AGREE, they were redirected to the study tasks, which were built in PsychoPy (Peirce , 2019) and hosted by Pavlovia.org. Participants were instructed to adjust their volume to a comfortable listening level as audio recordings of high-frequency English words produced by a speaker of American English were presented (1 s ISI).
Next was the AXB task, which consisted of 128 trials (64 where X = A and 64 where X = B), with 32 trials for each of the four contrasts. Each of the 16 unique tokens (4 talkers × 4 tokens/talker) of each phoneme appeared once as X. Tokens that appeared as A and B were chosen randomly on each trial such that they were produced by the same talker but never the same (acoustically identical) token as X. A, X, and B were presented with an ISI of 1 s to prevent participants from relying on low-level differences between stimuli to perform the task. Trials were blocked by talker, with a participant-controlled break between blocks. Block order and trial order within blocks were randomized across participants. Participants pressed “1” on the keyboard if stimulus X was more like the first word of the triad, and “3” if X was more like the third. Participants had four seconds to respond to each trial before the program automatically continued. This task lasted approximately 15 min. Next was a perceptual assimilation task involving the same auditory stimuli (∼20 min; details and results of this task are not reported here) and a sound blending task (∼5 min). The sound blending task, which indexes individual variation in phonological awareness (Perrachione , 2011), involved 32 English nonword test trials in which the participant heard a sequence of phonemes produced in isolation (e.g., [v]…[ɚ]…[t]…[ɪ]…[n]), followed by a production that either matched (e.g., [vɚtɪn]) or mismatched (e.g., [vɚlɪn]) the nonword formed by connecting the phonemes in a continuous string. Participants registered match/mismatch responses by pressing the “f” or “j” key, respectively. The test trials were preceded by three practice trials involving real English words.
3. Results and analysis
Blend task responses were converted to d-prime for each participant. A log-linear correction was used to correct for extreme proportions following Hautus (1995).
Figure 1 shows mean proportion correct in the AXB discrimination task by contrast and by talker. We employed model comparison using likelihood ratio tests to determine the best-fitting model that included the maximal fixed and random effects structure justified by our research questions. We fit generalized linear mixed effects models with a binomial linking function (accurate responses = 1, inaccurate = 0) using the lme4 package [version 1.1–15 (Bates , 2015)] of r (version 4.3.1) and the bobyqa optimizer for speed and valid convergence. The best-fitting model included the sum-coded fixed effects Talker (four levels: T1, T2, T3, T4), Contrast (four levels: /d̪/-/ɖ/, /d̪ɦ/-/ɖɦ/, /t̪/-/ʈ/, /t̪h/-/ʈh/), and the interaction of Talker * Contrast, as well as a continuous covariate (dprimeBlend). We considered random slopes only when they involved five or more levels per random effect (Bolker , 2009); therefore the maximal model–which was also the best-fitting model—included random by-Item and by-Participant intercepts, and random by-Item slopes for dprimeBlend (see Table 2).
Model formula . | AIC . | Chisq . | Df . | Pr(>Chisq) . |
---|---|---|---|---|
Acc ∼ (1|Item) + (1|Part_ID) | 26 934 | |||
Acc ∼ Talker + (1|Item) + (1 |Part_ID) | 26911 | 29.4472 | 3 | <0.001 |
Acc ∼ Talker + Contrast + (1|Item) + (1|Part_ID) | 26907 | 10.3607 | 3 | 0.016 |
Acc ∼ Talker * Contrast + (1|Item) + (1|Part_ID) | 26901 | 23.3176 | 9 | 0.006 |
Acc ∼ Talker * Contrast + dprimeBlend + (1|Item) + (1|Part_ID) | 26898 | 5.6993 | 1 | 0.017 |
Acc ∼ Talker * Contrast + dprimeBlend + (1 + dprimeBlend|Item) + (1|Part_ID) | 26875 | 26.9493 | 2 | <0.001 |
Acc ∼ (1|Item) + (1|Part_ID) | 26 934 | |||
Acc ∼ Talker + (1|Item) + (1 |Part_ID) | 26911 | 29.4472 | 3 | <0.001 |
Acc ∼ Talker + Contrast + (1|Item) + (1|Part_ID) | 26907 | 10.3607 | 3 | 0.016 |
Acc ∼ Talker * Contrast + (1|Item) + (1|Part_ID) | 26901 | 23.3176 | 9 | 0.006 |
Acc ∼ Talker * Contrast + dprimeBlend + (1|Item) + (1|Part_ID) | 26898 | 5.6993 | 1 | 0.017 |
Acc ∼ Talker * Contrast + dprimeBlend + (1 + dprimeBlend|Item) + (1|Part_ID) | 26875 | 26.9493 | 2 | <0.001 |
Model formula . | AIC . | Chisq . | Df . | Pr(>Chisq) . |
---|---|---|---|---|
Acc ∼ (1|Item) + (1|Part_ID) | 26 934 | |||
Acc ∼ Talker + (1|Item) + (1 |Part_ID) | 26911 | 29.4472 | 3 | <0.001 |
Acc ∼ Talker + Contrast + (1|Item) + (1|Part_ID) | 26907 | 10.3607 | 3 | 0.016 |
Acc ∼ Talker * Contrast + (1|Item) + (1|Part_ID) | 26901 | 23.3176 | 9 | 0.006 |
Acc ∼ Talker * Contrast + dprimeBlend + (1|Item) + (1|Part_ID) | 26898 | 5.6993 | 1 | 0.017 |
Acc ∼ Talker * Contrast + dprimeBlend + (1 + dprimeBlend|Item) + (1|Part_ID) | 26875 | 26.9493 | 2 | <0.001 |
Acc ∼ (1|Item) + (1|Part_ID) | 26 934 | |||
Acc ∼ Talker + (1|Item) + (1 |Part_ID) | 26911 | 29.4472 | 3 | <0.001 |
Acc ∼ Talker + Contrast + (1|Item) + (1|Part_ID) | 26907 | 10.3607 | 3 | 0.016 |
Acc ∼ Talker * Contrast + (1|Item) + (1|Part_ID) | 26901 | 23.3176 | 9 | 0.006 |
Acc ∼ Talker * Contrast + dprimeBlend + (1|Item) + (1|Part_ID) | 26898 | 5.6993 | 1 | 0.017 |
Acc ∼ Talker * Contrast + dprimeBlend + (1 + dprimeBlend|Item) + (1|Part_ID) | 26875 | 26.9493 | 2 | <0.001 |
Type III Wald chisquare analysis of the best-fitting model revealed significant main effects of Talker [χ2(3) = 26.793, p < 0. 001], Contrast [χ2(3) = 12 685 p = 0.005), dprimeBlend [χ2(1) = 6.650, p = 0.010], and a significant interaction of Talker and Contrast [χ2(9) = 17.492, p = 0.042].
An exploratory descriptive analysis of individual listener variability suggests that in addition to group differences by talker and contrast, individual listeners varied in ways that appear to interact with talker and contrast. For example, Fig. 2 shows that the participants with the single highest (black dots) and lowest (gray dots) AXB accuracy rates overall (averaged across talker and contrast) did not always pattern with their associated median split by accuracy groups (box plots; High accuracy group n = 80; Low n= 89). Indeed, the two individual listeners even swap relative accuracy for the voiceless unaspirated contrast when it is produced by Talker 4, and exhibit similar patterns for all four contrasts when they are produced by Talker 2.
4. Discussion
In this study we explored the discrimination of Hindi dental-retroflex stop contrasts by American English speakers. As expected from previous research, these contrasts proved quite difficult for this listener group. With respect to our first research question, the main effect of Talker demonstrates that discrimination accuracy for these Hindi dental-retroflex contrasts depends on which of the four talkers produced the tokens. Future research might investigate the acoustic-phonetic properties of individual talker differences. Our second research question focused on the effect of the specific contrast under investigation. The main effect of Contrast suggests that there is variability among the four Hindi dental-retroflex contrasts such that the selection of only one of them for presentation in a study may not represent the entire set of dental-retroflex contrasts. Our third research question concerned the combined effects of talker and contrast variability. The interaction of talker and contrast suggests that discrimination accuracy depends both on the contrast and the talker who produced the stimuli. Our fourth research question concerned listener variability, as observed by a measure of phonological awareness. As dprimeBlend was a significant predictor of discrimination accuracy in our model, we find evidence that discrimination accuracy indeed varies by listener, and as our exploratory descriptive analysis reveals, individual listeners varied in ways that appear to interact with talker and contrast.
These findings have important theoretical and methodological implications for the study of speech perception. First, given the importance of who is producing materials, studies should include multiple talkers in order to avoid confounding properties of the language under investigation and properties of an individual talker's speech. Assuming that the objective is to elicit materials that are representative of a language, it could be argued that it is imperative to include multiple talkers, and to consider speech perception patterns by talker before drawing conclusions about the perception of speech contrasts in the language. Similarly, given the main effect and interaction involving the contrast variable, it is likewise not sufficient to select a single contrast to represent a set of contrasts. As noted above, the/d̪/-/ɖ/contrast is often selected to represent Hindi dental-retroflex contrasts; however, as we have seen here, perception of the Hindi dental-retroflex contrasts under investigation differ and interact with talker. Additionally, individual listeners differ more than the presentation of only group means suggests. Individual listener differences are often intentionally obscured using statistical techniques (for some types of research questions, this may be preferred), and the typically larger sample of listeners (relative to, e.g., talkers and contrasts) may justify interpreting aggregated results as representative of the behaviors of groups of listeners. However, the individual listener differences observed here lead to questions regarding the factors that control this variation, as well as the implications of this variation for our understanding of speech perception. Models of cross-language and learner speech perception aim to predict which novel segments and contrasts will be most difficult based on how novel phones map onto familiar speech categories using some measure of perceived similarity. These models often struggle to capture aspects of learner performance (Cebrian , 2021), perhaps because they assume that groups of listeners with the same prior language experience will exhibit the same speech perception patterns (Kogan and Mora, 2022). Nevertheless, understanding variability is not only an important piece of understanding speech perception, but is also essential for providing individualized language instruction to adult learners (Munro, 2021).
Finally, the types of variability documented here may contribute to null findings in studies looking only for group-level effects. Given a traditional publication bias against studies reporting null—or unexpectedly complicated—results, data with the potential to elucidate this variability may be tucked away in “file drawers” (Franco , 2014). This may account for the relative lack of literature documenting sources of variability in cross-language speech perception. However, the present findings paint a picture of cross-language speech perception as varying on multiple dimensions, requiring careful attention to the contributions of individual listeners, individual talkers, and individual contrasts (and likely more factors).
Acknowledgments
This work was supported by the University of Utah's College of Humanities Kickstart Grant awarded to the authors. We are grateful to Saket Bahuguna, Ben Slade, Seung Kyung Kim, Jack Silcox, study participants, Speech Acquisition Lab members, as well as audiences at PSLLT 2021 and (F)ASAL 2022 and the JASA-EL editor and reviewer.
Author Declarations
Conflict of Interest
The authors have no conflicts to disclose.
Ethics Approval
The research was approved by the Institutional Review Board at the University of Utah. Informed consent was obtained from all participants.
Data Availability
The data that support the findings of this study are openly available on the Open Science Framework at http://doi.org/10.17605/OSF.IO/UT6ZN.