Speech recognition by both humans and machines frequently fails in non-optimal yet common situations. For example, word recognition error rates for second-language (L2) speech can be high, especially under conditions involving background noise. At the same time, both human and machine speech recognition sometimes shows remarkable robustness against signal- and noise-related degradation. Which acoustic features of speech explain this substantial variation in intelligibility? Current approaches align speech to text to extract a small set of pre-defined spectro-temporal properties from specific sounds in particular words. However, variation in these properties leaves much cross-talker variation in intelligibility unexplained. We examine an alternative approach utilizing a perceptual similarity space acquired using self-supervised learning. This approach encodes distinctions between speech samples without requiring pre-defined acoustic features or speech-to-text alignment. We show that L2 English speech samples are less tightly clustered in the space than L1 samples reflecting variability in English proficiency among L2 talkers. Critically, distances in this similarity space are perceptually meaningful: L1 English listeners have lower recognition accuracy for L2 speakers whose speech is more distant in the space from L1 speech. These results indicate that perceptual similarity may form the basis for an entirely new speech and language analysis approach.
I. INTRODUCTION
Over the past several decades, a combination of theoretical and technological advances in speech analysis and processing have led to major breakthroughs in digitally mediated human-human communication (e.g., digital hearing aids and cochlear implants) and human-machine speech communication (e.g., Amazon's Alexa, Google's Home, Apple's Siri). While the success of these speech-based devices is a testament to current approaches to speech analysis, word recognition error rates for these devices often skyrocket under conditions that are less than optimal (Gilbert , 2013; Ji ., 2014; Mattys , 2013; McCrocklin , 2019; Tamati , 2022)—even though such situations are quite common. Most speech takes place in the presence of background noise, and many interactions involve second-language (L2) users as both talkers and hearers. Indeed, on a global scale, the majority of real-world interactions in English involve L2 talkers and/or hearers—yet digital speech processors frequently fail to correctly recognize L2 English, especially in suboptimal listening conditions (Harwell , 2018; Kim , 2024; McCrocklin and Edalatishams, 2020).
Currently, acoustic-phonetic analysis relies on either extraction of language- and theory-independent acoustic features (e.g., mel-frequency cepstral coefficients, MFCCs) or by measuring a small number of hypothesis-driven acoustic cues associated with a sequence of discrete linguistic elements (words, phonemes/phones) in pre-specified temporal windows. These analytical frameworks have allowed for detailed modeling of speech acoustics and have identified some acoustic sources of variation in intelligibility (i.e., word recognition accuracy: Bradlow , 1996; Hazan and Markham, 2004). However, a large portion of observed inter-talker variation in speech intelligibility remains unexplained (Ferguson and Kewley-Port, 2007; Goldwater , 2010; McCloy , 2015; Neel, 2008; Pommée , 2021).
In this work, we utilize an alternative approach to characterizing acoustic-phonetic differences that focuses on the notion of a perceptual similarity space: a multidimensional encoding of distinctions between speech samples, reflecting general auditory properties and the prior experience of language users (see Atagi and Bent, 2013; Bradlow , 2010, for discussion of this concept in the context of cross-language speech perception) (for discussion in the context of first language acquisition, see Feldman , 2021; Matusevych , 2023; Schatz , 2021). We examine whether this general approach to acoustic-phonetic analysis can shed light on variation in L2 speech and its consequences for speech intelligibility.
The structure of the perceptual similarity space is estimated via acoustic-phonetic representations extracted from a self-supervised machine learning model that (much like L1 English listeners) has been trained on English speech. By employing dimensionality reduction methods, which preserve certain aspects of the original space (e.g., distance or variance), representations of utterances with the same linguistic content (e.g., two different speakers' productions of the same sentence) are projected into a learned perceptual similarity space. Within this space, we can examine similarities and differences between L1 and L2 English talkers and analyze how variability across L2 talkers relates to variation in L2 speech intelligibility.
To provide an initial test of this general approach, we examine sentence-length speech recordings by L1 and L2 English talkers, accompanied by a clearly defined, objective measure of the intelligibility of the L2 materials (i.e., word recognition accuracy by L1 listeners). This is an ideal testing ground because the standard acoustic-phonetic approach has failed to characterize the full extent of variation in the acoustics of L1 and L2 speech and, relatedly, failed to account for the impact of the full range of variation on word recognition accuracy for both human and machine speech recognizers (Bent , 2007; Field, 2005; Winters and O'Brien, 2013).
To summarize, we propose a new method for analyzing acoustic-phonetic similarity based on a perceptual similarity space acquired via self-supervised learning. This method does not require alignment of the speech signal to a transcription at either training or test. That is, it does not require a priori identification of units of analysis or specific acoustic dimensions for measurement. Instead, this method analyzes projections of trajectories formed by whole utterances in the perceptual similarity space. Distances between whole-utterance trajectories are quantified and visualized in interpretable distance matrices. We apply this method to L1 and L2 English speech, and then demonstrate that this perceptual similarity space method captures two key characteristics of L1 versus L2 speech: (a) L2 talker's productions are more loosely clustered in the perceptual similarity space than L1 talkers' productions, and (b) distance of an L2 speaker from L1 speakers in the perceptual similarity space (i.e., average distance across a set of trajectory projections in the space) can predict relative intelligibility (word recognition accuracy). Crucially, this metric of trajectory distance in the perceptual similarity space substantially out-performs standard acoustic-phonetic features used in intelligibility modeling. Our acoustic analysis code and statistical analyses are publicly available.1
II. OVERVIEW OF OUR APPROACH
Our work examines acoustic-phonetic variation in L1 and L2 speech. Consider the speech waveforms and spectrograms shown in Fig. 1. These show the same utterance (“The lady wore a…”) by a first-language (L1) English talker (A) and a second-language (L2) English (L1: Korean) talker (B). Several differences between L1 and L2 English can be readily identified. Overall, the L2 speech is much slower. Over and above this slow speech rate, the L2 production of the word “wore” (enclosed by a yellow dashed box) shows quite distinct spectral patterns from the L1 production (likely reflecting the difficulty of producing /r/ for L1 Korean talkers). Standard analysis techniques would quantify such differences by aligning the speech to the text and then taking measures associated with whole words (e.g., the duration of the portion of the sound signal identified as “wore”) or individual sounds (e.g., the spectral properties of the signal associated with /r/, such as the frequency drop for the third formant). As noted previously, this approach has yielded some success but has failed to explain much of the variability in speech intelligibility. We hypothesize that this shortcoming results from the failure of the standard approach to capture both salient and subtle acoustic variations that fall outside of the scope of the pre-selected acoustic parameters, particularly for connected speech.
(Color online) Top: Waveform (A) and spectrogram (B) of L1 English and L2 English speakers pronouncing the same utterance. Bottom: Trajectories of speech samples in the perceptual similarity space acquired via self-supervised learning: (C) the L1 talker from (A) and the L2 talker from (B); (D): the L2 talker from (B) with another L2 talker; (E) the L1 talker from (A) with another L1 talker. The yellow shaded region in (C) highlights the portion of the trajectories corresponding to the signal shown in yellow dashed boxes in (A) and (B), showing how trajectories reflect acoustic deviations between talkers. An interactive visualization of the trajectories can be accessed at http://tinyurl.com/perceptSim.
(Color online) Top: Waveform (A) and spectrogram (B) of L1 English and L2 English speakers pronouncing the same utterance. Bottom: Trajectories of speech samples in the perceptual similarity space acquired via self-supervised learning: (C) the L1 talker from (A) and the L2 talker from (B); (D): the L2 talker from (B) with another L2 talker; (E) the L1 talker from (A) with another L1 talker. The yellow shaded region in (C) highlights the portion of the trajectories corresponding to the signal shown in yellow dashed boxes in (A) and (B), showing how trajectories reflect acoustic deviations between talkers. An interactive visualization of the trajectories can be accessed at http://tinyurl.com/perceptSim.
We propose an alternative approach that aims to address these shortcomings. Acoustic-phonetic similarity is measured not by pre-defined parameters, but by distance within a perceptual similarity space acquired via self-supervised learning. Rather than representing acoustic similarity in terms of language-independent phonetic properties, the similarity space is shaped by the distributional properties of the language (e.g., decreasing the similarity of sounds that correspond to meaningful distinctions in the language, enhancing their distinctiveness). Learning of the similarity space is self-supervised in that it does not require pre-determination of the units of analysis; acoustic-phonetic distinctions are not constrained to specific, pre-defined, discrete linguistic elements. Similarly, our analytic technique is not restricted to individual words or speech sound categories (cf. Martin , 2023; Matusevych , 2023; Schatz , 2021). We instead analyze the trajectories formed by whole utterances in the perceptual similarity space. This technique does not assume a particular structure of the signal, nor does it rely on the alignment of the speech to the textual content, in contrast to similar work that utilized unsupervised learning to account for the variability in speech production (cf. Bartelds , 2022). Furthermore, our method focuses on the dynamics of the speech signal as a whole rather than focusing on discrete temporal units. This approach has proved useful in more traditional acoustic analyses (Jin and Liu, 2013; Stone , 2010) and has been argued to form the basis for learning sequential dependencies in speech (Wang , 2019).
Figures 1(C)–1(E) depict the trajectories in the perceptual similarity space for three pairs of speakers, an L1 vs an L2 speaker [Fig. 1(C)], two L2 speakers [Fig.1(D)], and two L1 speakers [Fig. (1E)]. In these plots, the trajectory color transitions from lightest (earlier in the signal) to darkest (later in the signal). Additionally, every temporal text annotation in the figures indicates the time points of the corresponding units. For instance, Fig. 1(C) shows the trajectories in the perceptual similarity space for the L1 (blue) and L2 (red) utterances shown in Figs. 1(A) and 1(B), respectively. The yellow dotted line in Fig. 1(C) highlights the portions of the trajectories where they deviate most notably; these portions correspond to the signal portions shown in yellow boxes in Figs. 1(A) and 1(B). An interactive visualization illustrating the technique is available at https://tinyurl.com/percepsim.
We hypothesize that the distances between trajectories in perceptual similarity space index differences in how talkers plan and produce utterances and differences in how listeners perceive utterances. This makes two key predictions:
-
Due to substantial variation in English proficiency amongst L2 talkers, trajectories for L2 talkers' productions are predicted to be more loosely clustered in perceptual similarity space than L1 talkers' trajectories.
-
Given that accurate speech perception (by humans and machines) is grounded in prior exposure to the range of between-talker variability in the L1, trajectories for L2 speech lying outside of the typical range for L1 speech trajectories will be more difficult to understand than trajectories that are within or close to the typical L1 range.
The first prediction is based on the substantial variation between L2 talkers. This difference in variability reflects many factors: variation across L2 speakers in proficiency with the language-being-spoken (here, English); the dynamic interactions between L1-L2 (which differ across different L1-L2 pairs); and a whole host of individual trait characteristics, including anatomy, physiology, memory, cognition, and personality, that affect L2 learning and production independently from their influences on L1 function (Chang, 2019). We can test this prediction by comparing the distances between trajectories within each group. This is illustrated in Figs. 1(D) and 1(E), which show trajectories in the perceptual similarity space for two L2 talkers [Fig. 1(D)] and two L1 talkers [Fig. 1(E)], respectively. Consistent with the prediction of looser clustering for L2 vs L1 talkers, the L2 trajectories in Fig. 1(D) deviate at several points, whereas the two L1 trajectories in Fig. 1(E) track each other very closely.
The second prediction can be tested by predicting intelligibility using the distance between L1 vs L2 trajectories for the same speech materials [as shown in Fig. 1(C)]. Specifically, the further an L2 trajectory is from the typical L1 trajectory, the lower its expected intelligibility for L1 listeners. Note that we focus on a clearly defined, objective measure of intelligibility, word recognition accuracy, rather than vague, subjective constructs such as L2 “accentedness” (the greater the distance, the “heavier” the “accent”; Bartelds , 2022), “comprehensibility” (the greater the distance, the lower the “mean opinion score,” or “sound quality” (Arehart , 2022; Levis, 2018; Munro and Derwing, 2020), which are typically measured through rating scales (although see recent efforts to assess listening effort through objective, physiological measures such as pupil dilation; Borghini and Hazan, 2020; McLaughlin and Van Engen, 2020). Importantly, prior work (Arehart , 2022; Levis, 2020, 2005) has established that intelligibility (proportion of words correctly recognized), comprehensibility (a rating from “extremely easy to understand” to “impossible to understand”), and degree of foreign accent (a rating from “no foreign accent” to “very strong foreign accent”) do not always converge. L2 speech samples that are equally intelligible in terms of words correctly recognized (intelligibility) may not be judged as similar on the other two dimensions (comprehensibility and foreign-accentedness). Thus, by assessing intelligibility in terms of recognition accuracy, we avoid confounds of perceived proficiency and experience with the target language and focus our analyses on a less-biased index of speech reception.
III. COMPUTATIONAL METHODS
Our goal is to perform a holistic comparison between two speech utterances that have identical linguistic content. To achieve this, we employ a self-supervised learned representation to map each speech utterance into a space that retains perceptual distinctions. This transforms a speech utterance into a trajectory, making the task of comparison akin to computing the distance between trajectories. The trajectory conveys essential information for distinguishing between speech productions. Additionally, we reduce the dimensionality of the represented speech, allowing for visual representation. Our pipeline is summarized in Fig. 2.
(Color online) Analysis pipeline for comparing two speech samples. Raw waveforms (bottom) are passed through the SSL representation model (in this work, we utilize HuBERT, Hsu , 2021). The learned high-dimensional representations for each speech sample are then projected into a three-dimensional space. Finally, the distance between the trajectories is calculated using DTW.
(Color online) Analysis pipeline for comparing two speech samples. Raw waveforms (bottom) are passed through the SSL representation model (in this work, we utilize HuBERT, Hsu , 2021). The learned high-dimensional representations for each speech sample are then projected into a three-dimensional space. Finally, the distance between the trajectories is calculated using DTW.
We turn to a rigorous description of our method, followed by the presentation of the actual implementation details in the subsequent section. Let represent a finite-duration speech waveform, where denotes the space of all possible samples, and denotes the space of all finite-length sample sequences. Our objective is to determine a perceptual similarity distance, denoted as , for comparing two speech waveforms that are not necessarily of the same duration.
To accomplish our goal, we begin by feeding the speech utterance into a self-supervised learning (SSL) model. This model converts the raw waveform into T frames and assigns a representation vector to each frame. Formally, let be the space of E-dimensional vectors, and let represent the domain of all fix-length sequences of these vectors. The SSL model is a transformer-based deep neural network with a set of trained parameters θ. In the SSL paradigm, θ is obtained by predicting latent information from the untranscribed speech input. Particularly, since the speech data is unlabeled, pseudo-labels are defined using pretext tasks that allow the discovery of inherent structure in the input signal. The output of the model is a sequence of E-dimensional vectors across T time stamps. Each sequence belongs to the domain . We denote these embedded vector sequence by , where for all .
Subsequently, we work with pairs of utterances, and , that represent the same content but are produced by different speakers. Following the procedure noted previously, we use their mappings and and project them to a lower dimensional space, which we refer to as the perceptual similarity space. Let be a function with parameters that projects the embedding to , such that , where and . The sequence is composed of T vectors, which creates a trajectory in the perceptual similarity space.
IV. ACOUSTIC MATERIALS AND METHODS
A. Speech materials
We analyzed data from three groups of L2 English speakers: L1 Korean speakers (N = 10), L1 Mandarin speakers (N = 14), and L1 Spanish speakers (N = 11). Note that these L1 Spanish speakers are heritage speakers who learned Spanish at home and then transitioned to the dominant language of the United States. These are all freely available online via SpeechBox https://speechbox.linguistics.northwestern.edu/#!/home.). Each group of L2 English talkers was compared to a set of L1 English talkers who produced the same sentences. The L1 Korean talkers and matched L1 talkers (N = 10 each group) were taken from an experimental study (Bradlow , 2018) and are available on Speechbox as the project Korean-English Intelligibility in the Scripted Speech Corpora section of SpeechBox. The L1 Mandarin (N = 14) and L1 Spanish sets (N = 11) and a set of matched L1 English talkers (N = 26) are from the ALLSSTAR Corpus section of SpeechBox (Bradlow, 2020).
All talkers were recorded in the Phonetics Laboratory of the Linguistic Department at Northwestern University (Evanston, IL). The L1 Mandarin talkers and L1 Korean talkers were all recruited from the graduate student population at Northwestern University. They were all educated through university (up to their undergraduate degree) in their L1 (Mandarin or Korean, respectively). The English language experiences of these L2 English talkers were primarily in their home countries before arriving in the USA for graduate study. The L1 Spanish talkers were recruited from the undergraduate student population at Northwestern University. All acquired Spanish at birth (age 0) and English between ages five and eight years old. They were all born in the USA and reported exclusive usage of Spanish at home during early childhood (before age five), were schooled entirely in English, and reported less than 20% Spanish usage during adulthood. The L1 English talkers were also recruited from the Northwestern University student population which includes individuals from across the USA.
Previous studies included intelligibility data for each of the 35 L2 talkers. This intelligibility data is based on sentence recognition accuracy scores from L1 English listeners who were presented with sentences mixed with broad-band noise at two signal-to-noise ratios (SNRs). These listeners were L1 English speakers who were raised and educated in the USA. They were recruited from the undergraduate student population at Northwestern University which includes individuals from across the USA. None of the listeners who participated in the intelligibility tests with speech by the Mandarin or Korean talkers had any extended previous experience with Mandarin or Korean, respectively. Some of the listeners who participated in the intelligibility tests with speech by Spanish Heritage Speakers (L1 Spanish, now dominant in their L2, English) had some exposure to Spanish (e.g., Spanish courses in high school), yet none reported acquiring any second language before age 8. (For additional information on both the talkers and listeners, please see Blasingame and Bradlow, 2021; Bradlow , 2018.)
B. Traditional acoustic analysis methods
To provide a baseline against which our novel method can be compared, we calculated several acoustic properties drawn from the intelligibility literature. This work has identified a relatively small set of acoustically and perceptually salient dimensions as potential predictors of speech intelligibility (e.g., Bradlow , 1996; Han , 2021; Hazan and Markham, 2004; Paulus , 2020; Pisoni and Remez, 2005). Decades of research have failed to establish a straightforward and consistent relationship between intelligibility and acoustic variation at either the global or segment-level. Instead, available evidence indicates that high intelligibility speech is associated with various combinations of traditional phonetic dimensions including both global properties (e.g., speaking rate) and segmental properties (e.g., vowel space expansion; Bradlow , 1996; Han , 2021; Hazan and Markham, 2004; Paulus , 2020; Pisoni and Remez, 2005). In the present study, we focused primarily on global (i.e., sentence-level) parameters because measures of segment-level articulation (e.g., voice onset times, fricative spectra, etc.) are better addressed with specifically designed word-sized materials where variation due to connected speech phenomena is eliminated or better controlled than in sentence-length utterances such as those in our dataset. We do, however, include a measure of vowel articulation (vowel space expansion) which can be assessed based on vowel formant frequencies across a given talker's full set of sentence productions.
Analyses were performed in Praat (Boersma and Weenink, 2022). After normalizing the files for intensity (as was done when testing intelligibility), we used version 2 of the Praat Syllable Nuclei script (De Jong and Wempe, 2009) to annotate all acoustic syllable nuclei (i.e., local intensity peaks) and the boundaries of all pauses (i.e., silences). Pauses were defined as silent regions of at least 0.02 s in duration. From these annotations, we calculated for each talker the total number of pauses, the syllable reduction rate (number of acoustic syllable nuclei/number of orthographic syllables in the target sentence text calculated via a Praat script: Kendall, 2013), and the articulation rate (number of acoustic syllable nuclei/duration speech excluding pauses). Using standard Praat functions, for each production pitch measurements (in Hertz) were obtained at each of the voiced acoustic syllable nuclei identified by the Praat Syllable Nuclei Script. We then calculated for each talker the mean pitch across all sentences as well as the mean (across sentences) of the coefficient of variation of pitch (standard deviation of pitch / mean pitch). Finally, mean vowel dispersion was calculated for each talker using the Euclidean distance of each acoustic syllable nuclei's first (F1) and second (F2) formants (in Bark) from the talker's mean F1 and F2 across all sentences.
Additionally, we extracted the MFCCs of the audio and then measured the distance between them with DTW (Abel and Babel, 2017). We denote this method as MFCC distance. Such a method works better when evaluated on a word level. For a fair comparison, we split the speech into words using the Montreal Forced Aligner (McAuliffe , 2017) using a window of 20 ms and a hop length of 10 ms and compared the distance between words for every pair of speakers. The MFCC distance used in the analysis to come was the average distance, collapsing across all words in all sentences, for each talker.
C. Implementation details
We turn to describe the implementation details of our method. For the self-supervised learning system, we utilize HuBERT (specifically HuBERT-base; Hsu , 2021), a deep learning model trained in a self-supervised fashion on 960 h of unannotated English speech. HuBERT-base uses a seven-layer convolutional neural network (CNN) waveform encoder followed by 12 bi-directional transformer layers. The training follows BERT-style models (Devlin , 2018); the transformer layers are trained to predict codebook labels for masked regions of the CNN output. The codebook is initialized with an ensemble of k-means clusters of MFCCs at various granularities. It is then iteratively refined during training using the representation of one of the transformer layers. The representation vectors are of size E = 768.
There are a number of SSL speech representation models (such as HuBERT; Hsu , 2021), wav2vec (Baevski , 2020), Conformer (Gulati , 2020, and many others). Although these models exhibit similar levels of performance on many downstream tasks (Ji , 2022; Pasad , 2023a; Yang , 2021), we chose to focus on HuBERT as it is the best-performing model on most tasks (Hsu , 2021).
The mapping to a lower-dimensional space was accomplished using t-SNE (t-Distributed Stochastic Neighbor Embedding; Van Der Maaten and Hinton, 2008) (here e = 3). Given the substantial amount of data, it was not feasible to project the entire dataset simultaneously. Instead, we projected arbitrary subsets, ensuring that sentences with similar content remained within the same subset. In Appendix B, we present results for alternative state-of-the-art dimensionality reduction methods, including Kernel-PCA (principal component analysis) (Schölkopf , 1998) and UMAP (uniform manifold approximation and projection) (McInnes , 2018). It is worth noting that while t-SNE focuses on preserving local distances, UMAP prioritizes the preservation of global distances, and Kernel-PCA maintains variances in a kernel space. Interestingly, all of these methods yield nearly identical results.
Finally, we compare the low-dimensional representation of the same sentence spoken by two talkers using DTW. Prior to performing DTW, we mutually normalize the sequences across each dimension (i.e., normalization based on the combined statistics of each dimension). Then, we assess the normalized trajectories derived from representations obtained through self-supervised learning. By doing so, we can compute a metric that quantifies the discrepancy between the productions of the same speech content by two speakers. Last, we also normalize the distance by the combined length of both sequences to account for potential differences in duration.
V. RESULTS
In this section, we estimate the perceptual similarity space. In the experiments detailed in the following, we used the 12-th transformer layer, as it is the closest to the training objective and performed well across various tests. Nevertheless, as shown in Appendix A, nearly any other layer could have been employed with comparable results.
Figures 3–5 show the average distance (collapsing across all sentences within the set) between the trajectories of talker pairs within each group of speech materials. Comparison across figures shows a clear global difference between groups; different L2 groups lie at different distances from their L1 counterparts. Looking at the upper left and lower right quadrants (L1–L2 comparisons), the L1 Chinese and L1 Korean talkers show clear differences from their L1 baseline (i.e., very dark shading) while the L1 Spanish talkers show the smallest difference (i.e., less dark shading). This is expected, as the L1 Spanish Heritage Speakers acquired English at a very young age, while the L1 Chinese and Korean talkers acquired English later in life.
Average distance of trajectories between each pair of talkers (collapsing across sentences), L1 Chinese –L2 English talkers and L1 English talkers. Distance from self (diagonal) not shown (= 0). L1 English talkers are ordered by mean distance across all pairs of L1 talkers. L2 talkers are ordered by intelligibility. Dotted lines show divisions between talker groups. Note: 1 L2 talker with very high average distance (>0.48) excluded.
Average distance of trajectories between each pair of talkers (collapsing across sentences), L1 Chinese –L2 English talkers and L1 English talkers. Distance from self (diagonal) not shown (= 0). L1 English talkers are ordered by mean distance across all pairs of L1 talkers. L2 talkers are ordered by intelligibility. Dotted lines show divisions between talker groups. Note: 1 L2 talker with very high average distance (>0.48) excluded.
Average distance of trajectories between each pair of talkers (collapsing across sentences), L1 Korean–L2 English talkers, and L1 English talkers. Distance from self (diagonal) not shown (= 0). L1 English talkers are ordered by mean distance across all pairs of L1 talkers. L2 talkers are ordered by intelligibility. Dotted lines show divisions between talker groups.
Average distance of trajectories between each pair of talkers (collapsing across sentences), L1 Korean–L2 English talkers, and L1 English talkers. Distance from self (diagonal) not shown (= 0). L1 English talkers are ordered by mean distance across all pairs of L1 talkers. L2 talkers are ordered by intelligibility. Dotted lines show divisions between talker groups.
Average distance of trajectories between each pair of talkers (collapsing across sentences), L1 Spanish–L2 English talkers, and L1 English talkers. Distance from self (diagonal) not shown (= 0). L1 English talkers are ordered by mean distance across all pairs of L1 talkers. L2 talkers are ordered by intelligibility. Dotted lines show divisions between talker groups.
Average distance of trajectories between each pair of talkers (collapsing across sentences), L1 Spanish–L2 English talkers, and L1 English talkers. Distance from self (diagonal) not shown (= 0). L1 English talkers are ordered by mean distance across all pairs of L1 talkers. L2 talkers are ordered by intelligibility. Dotted lines show divisions between talker groups.
Separate from these global differences between figures, we find highly similar patterns within each figure that speak to the predictions made by our approach; we turn to each of these in the following subsections.
A. Inter-talker variability in L2 vs L1
As discussed previously, if the distances between trajectories in the perceptual similarity space index differ in how speakers plan and produce speech, L2 talkers (with highly variable proficiency in English) should exhibit greater inter-talker variability than L1 speech.
This difference can be seen in Figs. 3–5 by comparing the lower left quadrant (showing distances between pairs of L2 talkers) and the upper right quadrant (showing distances between pairs of L1 talkers). In each figure, the within-L1 quadrant is dominated by white (small distances) while the within-L2 quadrant shows more gray and black (greater distances); L1 talkers show, on average, smaller distances.
To quantify these observations, for each L2 group we set the L1 English group as a benchmark and then calculated each L2 group's average distance with respect to the L1 English benchmark (average L2 distance/average L1 distance). Confidence intervals for mean distances were calculated using 1000 bootstrap replicates. By-participant means were randomly sampled (with replacement) within each group; the distribution of the ratio of these bootstrap samples provided an estimate of the 95% confidence interval.
Consistent with our qualitative description of the figures, all within-L2: within-L1 distance ratios were significantly greater than 1, indicating that inter-talker variation is larger for L2 vs L1 English talkers. As shown in Fig. 6, the mean ratio of Mandarin L1 distances/English L1 distances was 2.25 [95% confidence interval (CI) 2.15, 2.35]. The Korean L1/English L1 mean ratio was 1.88 (95% CI 1.76, 2.01), and the Spanish L1/English L1 ratio was 1.14 (95% CI 1.08, 1.20). The smaller difference for Spanish L1 likely reflects their status as heritage language speakers who acquired English at a much earlier age than the Chinese and Korean L1 speakers.
Average distance within each group of L2 talkers, divided by average distance within the corresponding group of L1 English talkers. Error bars show bootstrapped 95% confidence intervals. The dotted line shows equal variability within L2 and L1 speakers; values higher than this line show more variability within L2 vs L1 speakers.
Average distance within each group of L2 talkers, divided by average distance within the corresponding group of L1 English talkers. Error bars show bootstrapped 95% confidence intervals. The dotted line shows equal variability within L2 and L1 speakers; values higher than this line show more variability within L2 vs L1 speakers.
B. Prediction of intelligibility by distance in the perceptual similarity space
Having established that distances between sentence trajectories in the perceptual similarity space reflect expected group-level differences, we next asked if trajectory distances also index variation in overall intelligibility within the L2 groups. Specifically, are L2 talkers that are more vs less distant from L1 English talkers also less intelligible to L1 English listeners?
This difference can be seen in Figs. 3–5 examining the upper left quadrant (showing distances between pairs of L2 and L1 talkers; as the matrix is symmetric, the same information is shown in the lower right hand quadrant). The L2 talkers are ordered from least intelligible (leftmost on the x-axis) to most intelligible (the rightmost L2 talker on the x-axis). Moving from right to left, we can see that as intelligibility goes down, the upper left quadrant becomes progressively darker; less intelligible speakers tend to be more distant from L1 talkers.
To quantitatively assess this prediction, we calculated the mean distance-from-L1 for each L2 talker (collapsing across sentences and L1 talkers). The full set of intelligibility scores across all L2 groups was analyzed using a mixed-effects beta regression (Brooks , 2017). Control factors included SNR (perception is more difficult at lower SNR levels) and a random intercept to control for idiosyncratic differences between talkers. Distance and SNR were centered. As shown in Table I, as predicted L2 talkers whose trajectories were farther away from L1 talkers were significantly less intelligible.
Beta regression results: Prediction of intelligibility by distance in the perceptual similarity space (at transformer layer 12), controlling for SNR (note: β, standard error, and rounded to 2nd decimal place; p value rounded to 5th decimal place).
Factor . | β . | Standard error, s.e. β . | (1) . | . |
---|---|---|---|---|
SNR | 0.20 | 0.02 | 76.35 | <0.000 01 |
Perceptual similarity space distance | −8.09 | 1.77 | 16.95 | <0.000 04 |
Factor . | β . | Standard error, s.e. β . | (1) . | . |
---|---|---|---|---|
SNR | 0.20 | 0.02 | 76.35 | <0.000 01 |
Perceptual similarity space distance | −8.09 | 1.77 | 16.95 | <0.000 04 |
C. Prediction of intelligibility compared to pre-defined features
As points of comparison for the strength of our distance metric as a predictor of intelligibility, we performed a second set of regressions that included traditional acoustic analysis measures of the global properties of talkers. These factors were included along with our perceptual similarity space distance measure into a mixed-effects beta regression (again including SNR and a random by-subject intercept as controls). Table II shows the beta regression results using raw values of the predictors. We again find a significant effect of distance in the perceptual similarity space; L2 speakers that are more distant from L1 controls are less intelligible. As shown in Table III, a beta regression using standardized predictors shows that perceptual similarity space distance exhibits a larger effect size than any other significant predictor.
Beta regression results: Prediction of intelligibility by full set of (unscaled) factors (note: β, standard error, and rounded to 2nd decimal place; p-value rounded to 5th decimal place).
Factor . | β . | s.e. β . | (1) . | . |
---|---|---|---|---|
SNR | 0.21 | 0.02 | 44.54 | <0.000 01 |
Perceptual Similarity Space Dist. | −10.69 | 2.01 | 22.14 | <0.000 01 |
MFCC Dist. | −0.04 | 0.03 | 2.02 | 0.154 99 |
Num. Pauses | 0.03 | 0.01 | 14.07 | 0.000 18 |
Artic. Rate | −0.54 | 0.55 | 0.94 | 0.332 21 |
Mean Pitch (Hz) | 0 | 0 | 0.43 | 0.513 82 |
Mean Coeff. Var. Pitch | 0.03 | 2.29 | 0 | 0.988 14 |
Vowel Disp. (Bark) | −0.29 | 0.42 | 0.46 | 0.495 41 |
Syl. Reduct. Rate | 4.13 | 2.88 | 2.01 | 0.155 91 |
Factor . | β . | s.e. β . | (1) . | . |
---|---|---|---|---|
SNR | 0.21 | 0.02 | 44.54 | <0.000 01 |
Perceptual Similarity Space Dist. | −10.69 | 2.01 | 22.14 | <0.000 01 |
MFCC Dist. | −0.04 | 0.03 | 2.02 | 0.154 99 |
Num. Pauses | 0.03 | 0.01 | 14.07 | 0.000 18 |
Artic. Rate | −0.54 | 0.55 | 0.94 | 0.332 21 |
Mean Pitch (Hz) | 0 | 0 | 0.43 | 0.513 82 |
Mean Coeff. Var. Pitch | 0.03 | 2.29 | 0 | 0.988 14 |
Vowel Disp. (Bark) | −0.29 | 0.42 | 0.46 | 0.495 41 |
Syl. Reduct. Rate | 4.13 | 2.88 | 2.01 | 0.155 91 |
Beta regression results: Prediction of intelligibility by the full set of standardized predictors; all predictors are scaled to yield a mean of 0 and standard deviation of 1 (note: β, standard error, and rounded to 2nd decimal place; p-value rounded to 5th decimal place).
Factor . | Standardized β . | s.e. β . | (1) . | . |
---|---|---|---|---|
SNR | 0.81 | 0.09 | 44.54 | <0.000 01 |
Perceptual Similarity Space Dist. | −0.81 | 0.15 | 22.14 | <0.000 01 |
MFCC Dist. | −0.32 | 0.22 | 2.02 | 0.154 99 |
Num. Pauses | 0.65 | 0.16 | 14.07 | 0.000 18 |
Artic. Rate | −0.15 | 0.16 | 0.94 | 0.332 21 |
Mean Pitch (Hz) | 0.07 | 0.11 | 0.43 | 0.513 82 |
Mean Coeff. Var. Pitch | 0 | 0.1 | 0 | 0.988 15 |
Vowel Disp. (Bark) | −0.08 | 0.11 | 0.46 | 0.495 41 |
Syl. Reduct. Rate | 0.34 | 0.24 | 2.01 | 0.155 91 |
Factor . | Standardized β . | s.e. β . | (1) . | . |
---|---|---|---|---|
SNR | 0.81 | 0.09 | 44.54 | <0.000 01 |
Perceptual Similarity Space Dist. | −0.81 | 0.15 | 22.14 | <0.000 01 |
MFCC Dist. | −0.32 | 0.22 | 2.02 | 0.154 99 |
Num. Pauses | 0.65 | 0.16 | 14.07 | 0.000 18 |
Artic. Rate | −0.15 | 0.16 | 0.94 | 0.332 21 |
Mean Pitch (Hz) | 0.07 | 0.11 | 0.43 | 0.513 82 |
Mean Coeff. Var. Pitch | 0 | 0.1 | 0 | 0.988 15 |
Vowel Disp. (Bark) | −0.08 | 0.11 | 0.46 | 0.495 41 |
Syl. Reduct. Rate | 0.34 | 0.24 | 2.01 | 0.155 91 |
Another means of assessing the degree to which perceptual similarity space distance contributes to our ability to account for talker variation is to examine the degree to which model predictions are correlated with the observed data. Cribari-Neto and Zeileis (2010) provide a pseudo-r2 measure for beta regression models without random effects. This quantifies the degree to which statistical model predictions are correlated with the observed data (in logits); as with standard r2 this varies from 0 (no correlation) to 1 (perfect correlation). (The measure is “pseudo” to distinguish it from the r2 statistic used to quantify the performance of linear regressions.) We calculated this measure for two models. Our baseline model included SNR and all of our acoustic measures; this characterizes the ability of these baseline measures to account for participant performance, without allowing for any talker-specific random variation in performance. This yielded a pseudo-r2 of 0.44. In contrast, a model that included all of the baseline predictors plus perceptual similarity space distance (calculated at transformer layer 12) yielded a much higher pseudo-r2, 0.61. This confirms that perceptual similarity space distance dramatically increases our ability to model variation in intelligibility.
Finally, as shown in Fig. 7, our perceptual similarity space distance metric made the largest unique contribution to the increase in likelihood2 of the regression model (n.b. scaling of predictors has no impact on the contribution of a predictor to likelihood). Our approach substantially improves our ability to capture the perceptual consequences of variation in L2 speakers.
Change in log likelihood resulting from excluding each factor from the full beta regression model, relativized to the sum total of changes in log likelihood. Measures include: distance in the learned perceptual similarity space; total number of pauses; rate at which syllables are reduced (total number of syllables in actual speech/orthographic syllables); mean distance (measured by DTW) of sentences' MFCC representations; mean acoustic dispersion (measured in Bark) of vowels' first and second formants (F1/F2) at midpoint; articulation rate (speech rate excluding pauses); mean pitch (F0; Hertz); variation in pitch normalized to mean pitch.
Change in log likelihood resulting from excluding each factor from the full beta regression model, relativized to the sum total of changes in log likelihood. Measures include: distance in the learned perceptual similarity space; total number of pauses; rate at which syllables are reduced (total number of syllables in actual speech/orthographic syllables); mean distance (measured by DTW) of sentences' MFCC representations; mean acoustic dispersion (measured in Bark) of vowels' first and second formants (F1/F2) at midpoint; articulation rate (speech rate excluding pauses); mean pitch (F0; Hertz); variation in pitch normalized to mean pitch.
We briefly note that the number of pauses made a substantial additional contribution to the model of intelligibility (albeit smaller in magnitude than perceptual similarity space). Inspection of the data revealed three clear outliers with pause rates substantially higher than other speakers in the dataset. When these speakers were excluded, the effect of the number of pauses was no longer significant (p < 0.11). Further research with a larger set of L2 speakers would allow a more careful examination of this effect. Note that excluding the four speakers with the three greatest distances in perceptual similarity space did not qualitatively change the results (in fact, the p-value decreased to less than 0.000 01).
D. Are the results limited to particular t-SNE parameters?
1. Does a three-dimensional projection preserve critical information?
Our analysis uses t-SNE to project the representation to a three-dimensional trajectory. This dimensionality was selected based on inspection of a small set of sample trajectories at two and three dimensions. Three dimensions appeared to better preserve critical distinctions between trajectories while remaining a maximally simple representation for visualization and computation. To further investigate this choice, we repeated the analyses, examining trajectories without any t-SNE projection (i.e., using the 768 dimensions in the original transformer embedding space). Without any dimensionality reduction, distance ratios qualitatively replicated the primary analyses, showing that the within-L2 talker variation significantly exceeded that of the within-L1 talker variation. However, the distance ratios were smaller than in the primary analysis. At transformer layer 12, the mean ratio of Mandarin L1 distances/English L1 distances was 1.26 (95% CI) 1.24, 1.29), compared to 2.25 using t-SNE. The Korean L1/English L1 mean ratio was 1.18 (95% CI 1.14, 1.22) vs 1.88 using t-SNE, and the Spanish L1/English L1 ratio was 1.02 (95% CI 1.00, 1.04) vs 1.14 using t-SNE.
When distance was calculated without any dimensionality reduction, L2 talkers whose trajectories were farther away from L1 talkers were significantly less intelligible (mixed-effects beta regressions, structured as in the main analysis, all showed ps <0.05 except for transformer layer 2). We also repeated our analysis comparing our distance metric to traditional acoustic measures of the global properties of talkers. As in the main analysis, distance accounted for the greatest relative change in likelihood at each transformer layer. However, as shown in Table IV, the relative gains for the t-SNE projection vs full dimensionality show different patterns across layers. The t-SNE projection made a greater relative contribution to changes in likelihood at lower layers; at upper layers, full dimensionality showed greater contributions than the t-SNE projection.
Relative change in log likelihood, calculated using the full dimensionality of each transformer layer vs the t-SNE projection of that layer, when excluding perceptual similarity space distance from a model including traditional global acoustic features when using the full-dimension of the features and when using t-SNE projection.
Transformer layer . | Relative improvement in likelihood . | |
---|---|---|
Full dimensionality . | t-SNE . | |
1 | 0.30 | 0.42 |
2 | 0.29 | 0.45 |
3 | 0.35 | 0.51 |
4 | 0.38 | 0.52 |
5 | 0.42 | 0.51 |
6 | 0.46 | 0.51 |
7 | 0.49 | 0.50 |
8 | 0.53 | 0.48 |
9 | 0.56 | 0.48 |
10 | 0.58 | 0.48 |
11 | 0.53 | 0.53 |
12 | 0.53 | 0.53 |
Transformer layer . | Relative improvement in likelihood . | |
---|---|---|
Full dimensionality . | t-SNE . | |
1 | 0.30 | 0.42 |
2 | 0.29 | 0.45 |
3 | 0.35 | 0.51 |
4 | 0.38 | 0.52 |
5 | 0.42 | 0.51 |
6 | 0.46 | 0.51 |
7 | 0.49 | 0.50 |
8 | 0.53 | 0.48 |
9 | 0.56 | 0.48 |
10 | 0.58 | 0.48 |
11 | 0.53 | 0.53 |
12 | 0.53 | 0.53 |
The observed difference in performance can be explained by the type of information encoded in the different transformer layers. The lower layers, being closer to the convolutional acoustic encoder, tend to contain more low-level acoustic features of the input; higher layers capture more abstract, task-driven relationships between utterances (Hsu , 2021; Pasad , 2023b). By emphasizing relationships that hold between the utterances of the talkers in the analysis, while preserving the overall relationships between these utterances, t-SNE's dimensionality reduction increases performance at lower layers. At later layers, there is little to gain from reduction, as the transformer layer encodes the relevant abstract relationships between talkers.
Critically, these findings show that the three-dimensional t-SNE projection used in the main analysis captures the relevant distinctions between talkers that are encoded in the full dimensionality space. t-SNE allows us to efficiently capitalize on the insights HuBERT provides into the structure of the perceptual similarity space.
2. Results are robust when t-SNE parameters are set independent of test data
The t-SNE algorithm has a number of parameters and is stochastic (with the stochastic process driven by a particular seed value). The parameters and seed value used previously were determined based on the best fit to the data. The data were divided into development and test sets to examine whether the results were limited to these particular parameters.
We first fit parameters on the Korean data by maximizing the ability of our perceptual similarity space distance measure to account for variation in intelligibility. Specifically, we varied the following t-SNE parameters3: learning rate (100, 200, or “auto”), early-exaggeration (cluster tightness; 2, 12), number of iterations (300, 1000, 2000), and initialization method (random or PCA). The performance of the combination of parameters was measured by the gain in model likelihood when adding perceptual similarity space distance to the beta regression predicting intelligibility from SNR. It is important to mention that three random seeds were used at each parameter combination, but these did not impact the ranking of the relative fits of each parameter combination. Furthermore, we tested different perplexity values (ranging from 5 to 50) and, using beta regressions, found that values above 35 resulted in clearly better fits to the Korean intelligibility data (performance above 35 was roughly equivalent). Additionally, it should be noted that as there were multiple possible parameter combinations, we began by fixing cluster tightness and initialization methods. We found that the values that worked best were 12.0 for cluster tightness, and “PCA” initialization worked best. Hence, in this section, we report the impact of the other parameters.
After selecting the best results for the Korean talkers, we examined performance on the independent test set: the Chinese and Spanish data. As shown in Table V, the inter-talker variability results were qualitatively similar to the primary analysis; both Chinese and Spanish talkers showed within-L2: within-L1 distance ratios significantly greater than 1. The values for Chinese talkers were slightly lower than in the primary analysis, while the Spanish values fell well within the 95% confidence interval for the primary analysis.
Ratio of within-L2 talker distances to within-L1 talker distances (transformer layer 12; 95% confidence interval shown in parentheses), separated by L1 language of L2 talkers, based on parameters optimized for the Korean data. Rows show results for different parameterization of the t-SNE algorithm for comparison against the results for the primary analysis.
Language . | Random seed value . | Iterations . | L2:L1 inter-talker distance ratio . |
---|---|---|---|
Chinese | 992 738 | 1000 | 2.18 (2.09, 2.26) |
0 | 300 | 2.09 (2.01, 2.19) | |
992 738 | 300 | 2.08 (2, 2.17) | |
Primary analysis | 2.25 (2.15, 2.35) | ||
Spanish | 992 738 | 1000 | 1.17 (1.11, 1.23) |
0 | 300 | 1.16 (1.1, 1.21) | |
992 738 | 300 | 1.16 (1.1, 1.21) | |
Primary analysis | 1.14 (1.08, 1.20) |
Language . | Random seed value . | Iterations . | L2:L1 inter-talker distance ratio . |
---|---|---|---|
Chinese | 992 738 | 1000 | 2.18 (2.09, 2.26) |
0 | 300 | 2.09 (2.01, 2.19) | |
992 738 | 300 | 2.08 (2, 2.17) | |
Primary analysis | 2.25 (2.15, 2.35) | ||
Spanish | 992 738 | 1000 | 1.17 (1.11, 1.23) |
0 | 300 | 1.16 (1.1, 1.21) | |
992 738 | 300 | 1.16 (1.1, 1.21) | |
Primary analysis | 1.14 (1.08, 1.20) |
As shown in Table VI, the intelligibility results were qualitatively similar to the primary analysis; distance in the perceptual similarity space significantly predicted intelligibility. The effect size estimates fell within the 95% confidence interval estimated using parameters in the primary analysis. The results reported in the primary analysis are replicated when there is a clear separation between data used to parameterize the t-SNE projection and data used to evaluate its performance.
Beta regression results: Prediction of intelligibility by perceptual similarity space distance at transformer layer 12, Chinese and Spanish talkers only, based on parameters optimized for the Korean data. Rows show results for different parameterizations of the t-SNE algorithm for comparison against the results using parameters from the primary analysis (note: β, standard error, and rounded to 2nd decimal place; p-value rounded to 5th decimal place).
Random seed value . | Iterations . | β Perceptual similarity space distance . | s.e. β . | (1) . | . |
---|---|---|---|---|---|
992 738 | 1000 | −11.98 | 2.49 | 16.53 | 0.000 05 |
0 | 300 | −12.75 | 2.65 | 16.51 | 0.000 05 |
992 738 | 300 | −12.75 | 2.7 | 16.05 | 0.000 06 |
Primary Analysis | −8.09 | 1.77 | 16.95 | 0.000 04 |
Random seed value . | Iterations . | β Perceptual similarity space distance . | s.e. β . | (1) . | . |
---|---|---|---|---|---|
992 738 | 1000 | −11.98 | 2.49 | 16.53 | 0.000 05 |
0 | 300 | −12.75 | 2.65 | 16.51 | 0.000 05 |
992 738 | 300 | −12.75 | 2.7 | 16.05 | 0.000 06 |
Primary Analysis | −8.09 | 1.77 | 16.95 | 0.000 04 |
VI. DISCUSSION
These results provide the first evidence that a perceptual similarity space based on self-supervised speech representations can provide new insight into phonetic differences between speech samples. Critically, in contrast to earlier work, it does so without requiring pre-defined acoustic features or cues, nor requiring a transcript of the sentence. Furthermore, the existence of a general method (self-supervised learning) for acquiring the structure of the perceptual similarity space means that, provided sufficient training data is available, this same approach could be applied to any language. The technique is also not limited to examining differences related to linguistic backgrounds; it may provide a sensitive measure of acoustic differences between speech samples from individuals with and without neurological impairments impacting speech motor control (Hitczenko , 2021; Martínez-Nicolás , 2021), and between speech samples presented to listeners through various signal delivery systems (e.g., hearing aids and broadcasting devices).
Recent studies have explored perceptual similarity spaces as a model of first language learning. Speech discrimination abilities change over the first year of life, such that infants show a decreased ability to discriminate sounds that are not contrastive in their first language and an increased ability to discriminate sounds that are contrastive. Previous work has modeled such effects as reflecting the acquisition of speech sound categories over traditionally-defined acoustic cues (see Feldman , 2021, for a review). However, recent studies have provided evidence that these developmental data are better captured as reflecting the acquisition of a perceptual similarity space, shaped by the perceptual experiences of the infant (Feldman , 2021; Matusevych , 2023; Schatz , 2021). This suggests that both infants and adult speech perception is better modeled via a perceptual similarity space.
In addition to the global distance between trajectories in the perceptual similarity space, in future work we plan to examine the particular regions of the sound signal where there are points of greatest divergence between signals (e.g., the pronunciation of “wore” in Fig. 1). Given more granular data concerning the location and nature of word errors (vs the whole sentence accuracy data analyzed here), we can examine if the divergence in the perceptual similarity space can also predict the location and nature of listener misperceptions. This more granular analysis will also allow a closer analysis of the signal itself, helping us determine what acoustic properties self-supervised learning has converged upon as functionally meaningful—specifically, the properties relevant for accurate speech recognition.
It is important to note that this approach is necessarily limited by the information in the self-supervised learning system's training data. For example, a system provided only with information about what was said but not the cultural context in which communication occurs will presumably be insensitive to the social meaning of acoustic variation (which can impact speech intelligibility and interpretation; Babel and Russell, 2015). Further testing of this approach with a varied sample of talkers will help reveal limitations of our perceptual similarity space and suggest areas for future development of our data sources and computational architecture.
By providing an automatic, unsupervised, computational approach to speech variation and intelligibility modeling, our perceptual similarity space approach has the potential to improve technologies using speech. A precise understanding of which types of acoustic signals a speech recognition system is likely to fail on provides a clear path forward for improving the technology (e.g., through targeted training on particular types of L2 talker errors). Additionally, the ability of our distance measure to predict the intelligibility of foreign-accented speech without requiring the collection of data from human listeners holds great promise for a wide range of applications including, for example, the design of architectural spaces, assistive hearing devices, and public address systems all of which could benefit from optimization of speech understanding for a diverse array of talkers and listeners.
ACKNOWLEDGMENTS
This work was supported by the National Science Foundation (Grant No. DHR2219843) and the Binational Science Foundation (Grant No. 2022618).
AUTHOR DECLARATIONS
Conflict of Interest
Aside from the funding mentioned prior, the authors have no conflicts to disclose.
DATA AVAILABILITY
The data that support the findings of this study is freely available online via SpeechBox https://speechbox.linguistics.northwestern.edu/#!/home
APPENDIX A: PERFORMANCE ACROSS LAYERS
Similar performance across our key results (greater inter-talker variability for L2 vs L1 groups; prediction of intelligibility by distance in the perceptual similarity space) was observed across HuBERT transformer layers.
1. Inter-talker variability
Table VII shows that the difference in inter-talker variation across layers is consistently greater than 1 for all groups. Note that the difference is greatest for the Chinese- and Korean-L1 talker data (as noted in the text, this likely reflects the higher variability of these groups relative to the Spanish Heritage Speakers).
Ratio of within-L2 talker distances to within-L1 talker distances, separated by L1 language of L2 talkers and the transformer layer. 95% confidence interval, estimated via bootstrap, shown in parentheses.
Layer . | Chinese . | Korean . | Spanish . |
---|---|---|---|
1 | 1.62 (1.55, 1.69) | 1.29 (1.17, 1.41) | 1.83 (1.77, 1.89) |
2 | 1.85 (1.76, 1.94) | 1.4 (1.31, 1.5) | 1.2 (1.14, 1.27) |
3 | 1.87 (1.79, 1.94) | 1.45 (1.37, 1.55) | 1.18 (1.12, 1.23) |
4 | 1.97 (1.89, 2.06) | 1.57 (1.47, 1.67) | 1.18 (1.11, 1.24) |
5 | 2.1 (2, 2.2) | 1.77 (1.65, 1.88) | 1.22 (1.14, 1.29) |
6 | 2.17 (2.04, 2.29) | 1.81 (1.69, 1.94) | 1.25 (1.17, 1.33) |
7 | 2.2 (2.07, 2.32) | 1.88 (1.75, 2.02) | 1.24 (1.16, 1.32) |
8 | 2.15 (2.02, 2.27) | 1.85 (1.71, 1.99) | 1.22 (1.15, 1.3) |
9 | 2.17 (2.04, 2.31) | 1.81 (1.67, 1.94) | 1.2 (1.13, 1.28) |
10 | 2.18 (2.05, 2.31) | 1.79 (1.67, 1.92) | 1.2 (1.13, 1.28) |
11 | 2.29 (2.17, 2.41) | 1.92 (1.8, 2.06) | 1.19 (1.13, 1.26) |
12 | 2.25 (2.15, 2.35) | 1.88 (1.76, 2.01) | 1.14 (1.08, 1.2) |
Layer . | Chinese . | Korean . | Spanish . |
---|---|---|---|
1 | 1.62 (1.55, 1.69) | 1.29 (1.17, 1.41) | 1.83 (1.77, 1.89) |
2 | 1.85 (1.76, 1.94) | 1.4 (1.31, 1.5) | 1.2 (1.14, 1.27) |
3 | 1.87 (1.79, 1.94) | 1.45 (1.37, 1.55) | 1.18 (1.12, 1.23) |
4 | 1.97 (1.89, 2.06) | 1.57 (1.47, 1.67) | 1.18 (1.11, 1.24) |
5 | 2.1 (2, 2.2) | 1.77 (1.65, 1.88) | 1.22 (1.14, 1.29) |
6 | 2.17 (2.04, 2.29) | 1.81 (1.69, 1.94) | 1.25 (1.17, 1.33) |
7 | 2.2 (2.07, 2.32) | 1.88 (1.75, 2.02) | 1.24 (1.16, 1.32) |
8 | 2.15 (2.02, 2.27) | 1.85 (1.71, 1.99) | 1.22 (1.15, 1.3) |
9 | 2.17 (2.04, 2.31) | 1.81 (1.67, 1.94) | 1.2 (1.13, 1.28) |
10 | 2.18 (2.05, 2.31) | 1.79 (1.67, 1.92) | 1.2 (1.13, 1.28) |
11 | 2.29 (2.17, 2.41) | 1.92 (1.8, 2.06) | 1.19 (1.13, 1.26) |
12 | 2.25 (2.15, 2.35) | 1.88 (1.76, 2.01) | 1.14 (1.08, 1.2) |
2. (Relative) prediction of intelligibility by perceptual similarity space distance at multiple layers
As shown in Table VIII, variation in L1 listeners' word recognition accuracy across L2 talkers was significantly predicted by the mean distance (in the perceptual similarity space) of each L2 talker's sentence trajectories from the respective L1 standards. L2 talkers whose speech was more distant from the L1 standards had significantly lower word recognition accuracy.
Beta regression results: Prediction of intelligibility (word recognition accuracy) by distance in the perceptual similarity space at each transformer layer. At each layer, more distant talkers show significantly lower intelligibility scores (note: β, standard error, and rounded to 2nd decimal place; p-value rounded to 5th decimal place).
Transformer layer . | β Perceptual similarity space distance . | s.e. β . | (1) . | . |
---|---|---|---|---|
1 | −10.24 | 2.64 | 11.85 | 0.000 58 |
2 | −9.56 | 2.51 | 11.49 | 0.000 7 |
3 | −11.02 | 2.49 | 14.55 | 0.000 14 |
4 | −10.82 | 2.54 | 13.75 | 0.000 21 |
5 | −11.21 | 2.75 | 12.8 | 0.000 35 |
6 | −11.93 | 3.24 | 10.88 | 0.000 97 |
7 | −12.72 | 3.58 | 10.27 | 0.001 35 |
8 | −13.7 | 4.47 | 8 | 0.004 69 |
9 | −13.1 | 4.45 | 7.47 | 0.006 26 |
10 | −13.17 | 4.22 | 8.25 | 0.004 09 |
11 | −12.29 | 2.94 | 13.36 | 0.000 26 |
12 | −10.09 | 2.1 | 16.48 | 0.000 05 |
Transformer layer . | β Perceptual similarity space distance . | s.e. β . | (1) . | . |
---|---|---|---|---|
1 | −10.24 | 2.64 | 11.85 | 0.000 58 |
2 | −9.56 | 2.51 | 11.49 | 0.000 7 |
3 | −11.02 | 2.49 | 14.55 | 0.000 14 |
4 | −10.82 | 2.54 | 13.75 | 0.000 21 |
5 | −11.21 | 2.75 | 12.8 | 0.000 35 |
6 | −11.93 | 3.24 | 10.88 | 0.000 97 |
7 | −12.72 | 3.58 | 10.27 | 0.001 35 |
8 | −13.7 | 4.47 | 8 | 0.004 69 |
9 | −13.1 | 4.45 | 7.47 | 0.006 26 |
10 | −13.17 | 4.22 | 8.25 | 0.004 09 |
11 | −12.29 | 2.94 | 13.36 | 0.000 26 |
12 | −10.09 | 2.1 | 16.48 | 0.000 05 |
As shown in Table IX, distance within the perceptual similarity space at each layer accounted for the largest proportion of changes in likelihood. This was greatest at the topmost layer but never dropped below 51%. Note as well that the proportional contribution of each factor beyond the two dominant measures (perceptual similarity space distance and number of pauses) never exceeds 0.06 after the second layer.
Change in log likelihood resulting from excluding each factor from the full beta regression model, relativized to the sum total of changes in likelihood (note: rounded to 2nd decimal place).
Transformer layer . | Perceptual dist. . | Num. pauses . | Syl. reduct. rate . | MFCC dist. . | Artic. rate . | Vowel disp. (Bark) . | Mean pitch (Hz) . | Mean coeff. var. pitch . |
---|---|---|---|---|---|---|---|---|
1 | 0.53 | 0.31 | 0.03 | 0.01 | 0.06 | 0.04 | 0.02 | 0.02 |
2 | 0.52 | 0.32 | 0.02 | 0.01 | 0.05 | 0.04 | 0.03 | 0.01 |
3 | 0.55 | 0.31 | 0.01 | 0.01 | 0.02 | 0.06 | 0.04 | 0.01 |
4 | 0.54 | 0.32 | 0 | 0 | 0.01 | 0.05 | 0.05 | 0.01 |
5 | 0.54 | 0.34 | 0 | 0.01 | 0.01 | 0.05 | 0.05 | 0 |
6 | 0.54 | 0.38 | 0.01 | 0.01 | 0 | 0.04 | 0.03 | 0 |
7 | 0.53 | 0.38 | 0 | 0 | 0 | 0.05 | 0.03 | 0 |
8 | 0.52 | 0.4 | 0.01 | 0.01 | 0 | 0.05 | 0.02 | 0 |
9 | 0.51 | 0.41 | 0.01 | 0.01 | 0 | 0.04 | 0.02 | 0 |
10 | 0.51 | 0.39 | 0 | 0 | 0 | 0.06 | 0.02 | 0 |
11 | 0.54 | 0.35 | 0 | 0 | 0 | 0.06 | 0.04 | 0.01 |
12 | 0.57 | 0.31 | 0.01 | 0.02 | 0.01 | 0.04 | 0.05 | 0 |
Transformer layer . | Perceptual dist. . | Num. pauses . | Syl. reduct. rate . | MFCC dist. . | Artic. rate . | Vowel disp. (Bark) . | Mean pitch (Hz) . | Mean coeff. var. pitch . |
---|---|---|---|---|---|---|---|---|
1 | 0.53 | 0.31 | 0.03 | 0.01 | 0.06 | 0.04 | 0.02 | 0.02 |
2 | 0.52 | 0.32 | 0.02 | 0.01 | 0.05 | 0.04 | 0.03 | 0.01 |
3 | 0.55 | 0.31 | 0.01 | 0.01 | 0.02 | 0.06 | 0.04 | 0.01 |
4 | 0.54 | 0.32 | 0 | 0 | 0.01 | 0.05 | 0.05 | 0.01 |
5 | 0.54 | 0.34 | 0 | 0.01 | 0.01 | 0.05 | 0.05 | 0 |
6 | 0.54 | 0.38 | 0.01 | 0.01 | 0 | 0.04 | 0.03 | 0 |
7 | 0.53 | 0.38 | 0 | 0 | 0 | 0.05 | 0.03 | 0 |
8 | 0.52 | 0.4 | 0.01 | 0.01 | 0 | 0.05 | 0.02 | 0 |
9 | 0.51 | 0.41 | 0.01 | 0.01 | 0 | 0.04 | 0.02 | 0 |
10 | 0.51 | 0.39 | 0 | 0 | 0 | 0.06 | 0.02 | 0 |
11 | 0.54 | 0.35 | 0 | 0 | 0 | 0.06 | 0.04 | 0.01 |
12 | 0.57 | 0.31 | 0.01 | 0.02 | 0.01 | 0.04 | 0.05 | 0 |
APPENDIX B: RESULTS FROM ALTERNATIVE STATE-OF-THE-ART DIMENSIONALITY REDUCTION METHODS
1. UMAP
While t-SNE (reported in the main text) focuses on preserving local distances, UMAP (McInnes , 2018) prioritizes the preservation of global distances. This difference did not impact the results. In the experiments, we used the three best performing combinations of parameters: (I) initialization: PCA, minimal distance: 0.1 and number of neighbors: 15 (II) initialization: Spectral, minimal distance: 0.001 and number of neighbors: 15 (III) initialization: PCA, minimal distance: 0.001 and number of neighbors: 50. As shown in Table X, similar to t-SNE, the difference in inter-talker variation is consistently greater than 1 for all groups. As shown in Table XI, L2 talkers whose speech was more distant from the L1 standards had significantly lower word recognition accuracy. Finally, Table XII shows that distance within the perceptual similarity space at each layer accounted for the largest proportion of changes in likelihood.
Ratio of within-L2 talker distances to within-L1 talker distances, estimated by UMAP. Values are separated by L1 language of L2 talkers and UMAP parameters. 95% confidence interval, estimated via bootstrap, shown in parentheses.
Language . | UMAP Parameter Group . | L2:L1 Inter-Talker Distance Ratio . |
---|---|---|
Chinese | I | 2.64 (2.49, 2.8) |
Chinese | II | 2.72 (2.57, 2.89) |
Chinese | III | 2.65 (2.49, 2.81) |
Korean | I | 1.88 (1.75, 2.01) |
Korean | II | 1.9 (1.77, 2.03) |
Korean | III | 1.88 (1.75, 2.01) |
Spanish | I | 1.3 (1.19, 1.39) |
Spanish | II | 1.31 (1.21, 1.42) |
Spanish | III | 1.3 (1.21, 1.4) |
Language . | UMAP Parameter Group . | L2:L1 Inter-Talker Distance Ratio . |
---|---|---|
Chinese | I | 2.64 (2.49, 2.8) |
Chinese | II | 2.72 (2.57, 2.89) |
Chinese | III | 2.65 (2.49, 2.81) |
Korean | I | 1.88 (1.75, 2.01) |
Korean | II | 1.9 (1.77, 2.03) |
Korean | III | 1.88 (1.75, 2.01) |
Spanish | I | 1.3 (1.19, 1.39) |
Spanish | II | 1.31 (1.21, 1.42) |
Spanish | III | 1.3 (1.21, 1.4) |
Beta regression results: Prediction of intelligibility (word recognition accuracy) by distance in the perceptual similarity space (as estimated by UMAP at several parameter settings). More distant talkers show significantly lower intelligibility scores (note: β, standard error, and rounded to 2nd decimal place; p-value rounded to 5th decimal place).
UMAP Parameter Group . | β Perceptual similarity space distance . | s.e. β . | (1) . | . |
---|---|---|---|---|
I | −11.72 | 2.97 | 12.18 | 0.000 48 |
II | −11.42 | 2.97 | 11.66 | 0.000 64 |
III | −11.26 | 2.92 | 11.7 | 0.000 63 |
UMAP Parameter Group . | β Perceptual similarity space distance . | s.e. β . | (1) . | . |
---|---|---|---|---|
I | −11.72 | 2.97 | 12.18 | 0.000 48 |
II | −11.42 | 2.97 | 11.66 | 0.000 64 |
III | −11.26 | 2.92 | 11.7 | 0.000 63 |
Change in log-likelihood resulting from excluding each factor from the full beta regression model, relativized to the sum total of changes in likelihood, distance estimated by UMAP (note: rounded to 2nd decimal place).
UMAP parameter group . | Perceptual dist. . | Num. pauses . | Syl. reduct. rate . | MFCC dist. . | Artic. rate . | Vowel disp. (Bark) . | Mean pitch (Hz) . | Mean coeff. var. pitch . |
---|---|---|---|---|---|---|---|---|
I | 0.53 | 0.01 | 0.35 | 0.01 | 0.05 | 0 | 0.04 | 0 |
II | 0.52 | 0.01 | 0.36 | 0 | 0.05 | 0 | 0.05 | 0 |
III | 0.53 | 0.01 | 0.36 | 0 | 0.04 | 0 | 0.04 | 0 |
UMAP parameter group . | Perceptual dist. . | Num. pauses . | Syl. reduct. rate . | MFCC dist. . | Artic. rate . | Vowel disp. (Bark) . | Mean pitch (Hz) . | Mean coeff. var. pitch . |
---|---|---|---|---|---|---|---|---|
I | 0.53 | 0.01 | 0.35 | 0.01 | 0.05 | 0 | 0.04 | 0 |
II | 0.52 | 0.01 | 0.36 | 0 | 0.05 | 0 | 0.05 | 0 |
III | 0.53 | 0.01 | 0.36 | 0 | 0.04 | 0 | 0.04 | 0 |
2. Kernel-PCA
Kernel-PCA (Schölkopf , 1998) represents another method for dimensionality reduction; it maintains variances in a kernel space. With one exception, this difference did not impact the results. As shown in Table XIII, similar to t-SNE and UMAP, the difference in inter-talker variation is consistently greater than 1 for L1 Chinese and Korean talkers. However, for Spanish Heritage talkers, the ratio was not significantly different from 1. Recall that in all other analyses, this group showed ratios closest to 1, reflecting the early age of acquisition of English for Spanish Heritage Speakers. The other results were qualitatively similar to those observed with other dimensionality reduction techniques (see Tables XIV and XV).
Ratio of within-L2 talker distances to within-L1 talker distances, estimated by Kernel-PCA. Values are separated by L1 language of L2 talkers. 95% confidence interval, estimated via bootstrap, shown in parentheses.
Language . | L2:L1 Inter-Talker Distance Ratio . |
---|---|
Chinese | 1.54 (1.5, 1.58) |
Korean | 1.41 (1.35, 1.47) |
Spanish | 1 (0.98, 1.03) |
Language . | L2:L1 Inter-Talker Distance Ratio . |
---|---|
Chinese | 1.54 (1.5, 1.58) |
Korean | 1.41 (1.35, 1.47) |
Spanish | 1 (0.98, 1.03) |
Beta regression results: Prediction of intelligibility (word recognition accuracy) by distance in the perceptual similarity space (as estimated by Kernel-PCA). More distant talkers show significantly lower intelligibility scores (note: β, standard error, and rounded to 2nd decimal place; p-value rounded to 5th decimal place).
β Perceptual similarity space distance . | s.e. β . | (1) . | . |
---|---|---|---|
−15.37 | 2.46 | 23.74 | <0.000 01 |
β Perceptual similarity space distance . | s.e. β . | (1) . | . |
---|---|---|---|
−15.37 | 2.46 | 23.74 | <0.000 01 |
Change in log likelihood resulting from excluding each factor from the full beta regression model, relativized to the sum total of changes in likelihood, distance estimated by Kernel-PCA (note: rounded to 2nd decimal place).
Perceptual dist. . | Num. pauses . | Syl. reduct. rate . | MFCC dist. . | Artic. rate . | Vowel disp. (Bark) . | Mean pitch (Hz) . | Mean coeff. var. pitch . |
---|---|---|---|---|---|---|---|
0.66 | 0.01 | 0.2 | 0.01 | 0.06 | 0.01 | 0.04 | 0 |
Perceptual dist. . | Num. pauses . | Syl. reduct. rate . | MFCC dist. . | Artic. rate . | Vowel disp. (Bark) . | Mean pitch (Hz) . | Mean coeff. var. pitch . |
---|---|---|---|---|---|---|---|
0.66 | 0.01 | 0.2 | 0.01 | 0.06 | 0.01 | 0.04 | 0 |
Note that this measure is no “variance explained” but rather the degree to which the predictor makes a unique contribution to the model likelihood.
More information on the parameters can be found in the package documentation https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html.