This study investigates Whisper's automatic speech recognition (ASR) system performance across diverse native and non-native English accents. Results reveal superior recognition in American compared to British and Australian English accents with similar performance in Canadian English. Overall, native English accents demonstrate higher accuracy than non-native accents. Exploring connections between speaker traits [sex, native language (L1) typology, and second language (L2) proficiency] and word error rate uncovers notable associations. Furthermore, Whisper exhibits enhanced performance in read speech over conversational speech with modifications based on speaker gender. The implications of these findings are discussed.
1. Introduction
The widespread adoption of video conferencing and virtual meeting platforms in recent years, particularly during the COVID-19 pandemic, has amplified the need for inclusive and accessible communication systems. To facilitate comprehension for diverse audiences, automatic speech recognition (ASR) technology has been leveraged to introduce live transcription services offered by multiple providers. Whereas some studies have reported advancements in state-of-the-art ASR models (e.g., Xiong , 2018), research has consistently shown that the performance of ASR systems varies as a function of speaker characteristics such as dialect (Wheatley and Picone, 1991; Meyer , 2020), gender background (Tatman and Kasten, 2017; Tatman, 2017), racial backgrounds (Koenecke , 2020; Martin and Tang, 2020), and the linguistic structure of their native language (L1; Chan , 2022).
For instance, Koenecke (2020) reviewed ASR systems sold by Amazon, Google, IBM, Apple, and Microsoft and highlighted vast disparities in the word error rates (WERs), the standard measure of ASR performance, calculated for black speakers of American English as compared to white speakers. Moreover, recognition rates of non-native speech are generally poor (Bouselmi , 2005; Baykal and Erdogan, 2013; Knill , 2018). In large part, this is the result of speech recognisers being trained on predominantly native, standard varieties of the target language. The evaluation of ASR performance on non-native English accents remains a critical yet understudied area. This is of paramount importance, considering the difficulty in generating accurate transcriptions due to the unique challenges posed by signal degradation or unfamiliarity with specific English variants. Markl (2022) examined commercial ASR systems and suggested that they may “reproduce structural oppression when they perform worse for marginalised language communities.” The author states that speakers of stigmatised variants of English spoken in the British Isles are particularly vulnerable to such representational harms. Furthermore, the impact of speakers' L1 on the comprehensibility of their second language (L2), English, combined with variables such as their proficiency, adds an additional layer of complexity to the understanding of speakers with diverse language backgrounds.
It is clear from previous studies that the challenges ASR systems encounter in addressing accent diversity in speech are intricate and multifaceted. Although complete elimination of errors in ASR systems may not be attainable, an understanding of why those errors happen will allow us to create better mitigation strategies and improve system performance (Knill , 2018). This is particularly relevant in safety-critical contexts, such as telemedicine and air traffic control, among many other ASR applications.
The current study aims to address these challenges by examining the performance of Whisper (Radford , 2022) on various English accents, where the main focus is on non-native accents. Whisper is an advanced ASR system developed by OpenAI, boasting high accuracy in transcribing audio into text. Trained on an extensive 680 000 h of multilingual and multitask data collected from the web, the developers claim that Whisper demonstrates improved robustness to accents, background noise, and technical language. Whisper was chosen as the focal point of evaluation due to being open-source, its prominent usage in ASR research, and its potential applications involving live transcription and related ASR services. Moreover, evaluating the performance of Whisper allows us to explore the generalisability of the results to other deep learning-based ASR systems and provide insights into the broader landscape of non-native accent recognition.
This study set out to achieve three primary research objectives (ROs): RO1a involves a comparison of Whisper's performance across native English accents of four countries: the US, Canada, Britain, and Australia. The dependent variable (match error rate, MER) was also regressed against speakers' age and sex with the goal of establishing a performance baseline for these phonetically distinct variants of native English accents. In RO1b, we report the median scores for all accent variants represented in the dataset and rank them in order of increasing MER.
Research shows a link between L2 speech intelligibility and the characteristics of the L1, including the vowel system (e.g., Bohn and Flege, 1990; Munro and Derwing, 2008) and the prosodic system (see Graham , 2006 for an overview). Research (e.g., Iverson and Evans, 2009) also shows that the vowel inventory size (i.e., the number of vowels) of the L1 of a speaker may influence their performance on vowel perception tasks in English. Iverson and Evans (2009) found that the larger the vowel inventory, the better the performance. RO2, therefore, explores the association between Whisper's performance and the speakers' English proficiency and the linguistic structure of their L1 (prosodic typology, stress-accent vs pitch-accent vs tone; vowel inventory, the number of vowels in the L1). We also evaluate the influence of demographic information, such as the age and biological sex, of the speaker.
RO3 evaluates Whisper's performance using a conversational speech corpus, characterised by its spontaneity compared to the read speech corpus within the Speech Accent Archive (SAA). This assessment aims to provide a more authentic representation of Whisper's capabilities in real-time transcription scenarios.
2. Method
2.1 Dataset
There are two data sources used in this study: the Speech Accent Archive and the in-house Cambridge English corpus for non-native English.
2.1.1 Speech Accent Archive (SAA)
SAA (Weinberger, 2015) is an online corpus, which consists of recordings of native and non-native speakers of English with different accents. The recordings are roughly 30 s in length and consist of individual recordings of the same English paragraph (see the supplementary material). The fact that all speakers read the same speech makes it possible to evaluate the acoustic model without having to grapple with variability in lexical choice and syntactic structure. There was roughly an even split between male and female speakers.
Available demographic information, including the speakers' age, L1, and age of onset of English learning, makes it possible to do a detailed analysis of the influence of speaker background on the performance of the ASR system. The number of speakers in each L1 is as follows in parentheses: American English (442), British English (61), Canadian English (48), Australian English (45), Arabic (102), Russian (48), Dutch (46), French (26), German (28), Italian (32), Japanese (27), Korean (52), Mandarin (65), Polish (34), Spanish (16), Swedish (20), Vietnamese (22), and Thai (15).
2.1.2 Cambridge English Corpus (CEC)
This is a relatively small corpus of semi-spontaneous English. These were recorded in scenarios in which an English exam candidate is prompted by an examiner to discuss various topics. A total of 133 speakers met the conditions for analysis (e.g., having an equal distribution of speakers across target L1s): Arabic (21 speakers), Dutch (24 speakers), French (21 speakers), Polish (24 speakers), Thai (21 speakers), and Vietnamese (22 speakers). This matches six of the languages analysed in the SAA. The age range was 20–67 years old with a roughly equal split between males and females. The proficiency scores of these non-native speakers were determined by trained assessors based on the CEFR scale (Council of Europe, 2001). The dataset has been manually annotated by two trained phonetic transcribers with a 95% agreement. Disagreements among the annotators were resolved by C.G., who is a trained phonetician. The annotations and data processing were conducted using (Boersma and Weenink, 2010) a software tool for speech analysis.
2.2 Data processing
2.2.1 SAA
For each selected file, the audio (in the form of MP3) is extracted using the SAA online corpus. The BeautifulSoup package (Richardson, 2007) was used to parse the data to html. We aggregated the sound file from the SAA and initialised the most recent version of Whisper (version 3) on an NVIDIA A100 GPU (graphics processing unit) to automatically transcribe each audio file.
We created a function that takes in the raw audio and outputs Whisper's transcription and then iterates that over all the audio files (saving the output as a list). From that list, we created a column of the Pandas dataframe (Reback , 2020). Next, we used the Jiwer (Vaessen, 2022) package to calculate the WER and MER, although only the latter is reported in this study.
2.2.2 CEC
The procedure was the same as that for the SAA with the main exception being that the sound files are in .wav format and the accompanying Praat Textgrids transcription was different for each recording and had to be read into the parser each time.
2.3 Quantifying English experience
Assessing the English proficiency of speakers in the study using dedicated proficiency metrics was deemed infeasible because of the large number of speakers and limited amount of speech data available for each individual. Additionally, all speakers in the study read the same set text, which further constrained the ability to gauge language proficiency based on a standard scale such as the CEFR scale (Council of Europe, 2001). In this research, we devised an algorithm to evaluate the “English experience” (i.e., proficiency) of speakers in the SAA. The algorithm comprised the following steps:
-
Each speaker was initially assigned a score of 100, from which we subtract their age of English learning onset. For example, after this step, native speakers were assigned a score of 100 (100 − 0 = 100), whereas a speaker who began learning English at the age of 20 years old received a score of 80 (100 − 20 = 80); and
-
to account for the impact of age of onset on language proficiency, we implemented penalties. A penalty of –5 was applied for an age of onset falling between 11 and 20 years old, –10 for an age of onset falling between 21 and 30, –15 for an age of onset falling between 31 and 40, and so on. This is in accordance with findings in second language acquisition theory, which establishes an inverse relationship between earlier language learning and higher proficiency. For instance, a speaker who commenced English study at 20 years old with an initial score of 80 would incur a penalty of –10, leading to a final score of 70. No penalty was applied for the first decade (0–10 years old).
2.4 Variables and coding
In the analyses discussed in this study, MER is the dependent variable. The variables used as predictors include English experience (a numerical variable) and various characteristics of the speakers, such as their L1, typology of the speakers' L1, sex, age, and the number of vowels in the speakers' L1. Categorical variables, such as typology, have been one-hot-encoded into binary format, where the presence of a feature is denoted as one and absence of a feature is denoted as zero. For instance, stress-accent is represented as 0,0; pitch-accent as 1,0; and tone as 0,1.
3. Results
3.1 RO1: MER of native and non-native English accents
RO1A examines whether there are any statistically significant differences in Whisper's performance among four primary English accent groupings: American (reference level), Australian, British, and Canadian. We conducted a linear mixed effects model to investigate the impact of native English accents, speaker sex, and age on Whisper's performance, which was measured by MER. Speakers were treated as the random variable, i.e., MER ∼ English accents * age * sex + ( ).
We found that compared to American speakers, MER was significantly higher for Australian English [β = 0.013, standard error (SE) = 0.005, z = 2.401, p = 0.016] and British English speakers (β = 0.031, SE = 0.007, z = 4.571, p = 0.000). Notably, Canadian speakers did not demonstrate a statistically significant difference in MER compared to the American speakers (β = 0.005, SE = 0.009, z = 0.596, p = 0.551). Additionally, age did not exhibit a statistically significant association with MER (β = 0.000, SE = 0.000, z = –1.139, p = 0.255). Sex also did not show a significant association with MER (β = –0.003, SE = 0.004, z = –0.825, p = 0.409). There were no significant interactions between any of the variables. The results are displayed in Fig. 1.
RO1B computes Whisper's average MER across the 18 accent variants of English. The ranking by median MER of the accents is shown in Fig. 2.
3.2 RO2: MER and speaker characteristics
In RO2, we conducted an ordinary least squares (OLS) regression analysis to examine the relationship among the response variable, MER, and a combination of continuous and categorical predictor variables. This is based on the SAA dataset.
The analysis revealed the following key findings:
-
English experience showed a statistically significant negative relationship with MER (β = −0.0036, SE = 0.001, z = −5.391, p = 0.000). An increase in English experience was associated with a decrease in MER. This result is depicted in Fig. 3;
-
vowels exhibited a statistically significant negative relationship with MER (β = −0.0054, SE = 0.002, z = −3.564, p = 0.000). Lower values of vowels were associated with higher MER;
-
in relation to speaker sex, female speakers exhibited a statistically significant lower MER level than male speakers (β = −0.0310, SE = −0.009, z = −3.289, p = 0.001). This indicates that Whisper was better at recognising English spoken by female speakers than that spoken by male speakers; and
-
in relation to the L1 typology of speakers, the MER of pitch-accent did not exhibit a statistically significant difference compared to the reference level (stress-accent; β = 0.040, SE = 0.035, z = 1.156, p = 0.248). In contrast, speakers of tone languages showed a significantly higher MER than those of stress-accent languages (β = 0.031, SE = 0.014, z = 2.189, p = 0.029). This relationship is visually supported by Fig. 2, where Mandarin, Vietnamese, and Thai (all tone languages) cluster together as languages with the highest median MER values. This observed trend persisted even when native English speakers were excluded from the dataset, including only non-native English speakers. In this case, the comparison between stress-accent and pitch-accent yielded (β = 0.037, SE = 0.036, z = 1.033, p = 0.302), whereas the difference between stress-accent and tone was statistically significant (β = 0.0316, SE = 0.014, z = 3.938, p = 0.000).
The results for vowels, speaker sex, and L1 typology are summarised in Table 1.
-
Other predictor variables did not show statistically significant relationships with MER (p > 0.05).
. | . | MER . |
---|---|---|
Sex | Female | 0.115 |
Male | 0.140 | |
Typology | Pitch-accent | 0.180 |
Stress-accent | 0.120 | |
Tone | 0.161 | |
Number of vowels | 0–5 | 0.257 |
5–10 | 0.175 | |
10–15 | 0.085 | |
15–20 | 0.110 |
. | . | MER . |
---|---|---|
Sex | Female | 0.115 |
Male | 0.140 | |
Typology | Pitch-accent | 0.180 |
Stress-accent | 0.120 | |
Tone | 0.161 | |
Number of vowels | 0–5 | 0.257 |
5–10 | 0.175 | |
10–15 | 0.085 | |
15–20 | 0.110 |
3.3 RQ3: Whisper's performance on conversational data
We examined the performance of Whisper on the conversational CEC with six languages and two typologies: stress-accent languages (Arabic, Dutch, French, and Polish) and tone languages (Thai and Vietnamese). A linear regression model was fitted to examine the relationship between MER and the predictor variables: L1 typology, sex, and (proficiency) score. As in previous analyses, the predictor variables were one-hot-encoded.
Among the individual predictor variables, proficiency score (β = –0.011, SE = 0.003, z = –4.144, p = 0.000) and age (β = –0.0045, SE = 0.002, z = –2.711, p = 0.007) showed significant associations with MER. Specifically, a one-unit increase in the score was associated with a decrease of approximately 0.011 units in MER, whereas a 1-yr increase in age was associated with a decrease of approximately 0.0045 units in MER. The difference between the tone languages and stress-accent languages, although not significant, approached the 5% level of significance (β = –0.058, SE = 0.033, z = –1.781, p = 0.075). There were no significant interactions among any of the variables.
Finally, we conducted a linear mixed model analysis to investigate the relationship between MER and speech type (read or conversational) and speaker sex. The result revealed that, overall, the conversational speech type had a significantly higher MER (β = 0.020, SE = 0.042, t = 4.723, p = 0.000) compared to the read speech type. For male speakers, MER was also significantly higher than that for female speakers (β = 0.107, SE = 0.024, t = 4.371, p = 0.000). In addition, there was a statistically significant interaction between sex and speech type (β = –0.139, SE = 0.047, t = –2.949, p = 0.003). The model accounted for the variability in MER among speakers by including speakers as a random effect. The estimated variance of the random intercepts for speakers was 0.004, indicating some variability in MER among different speakers. The results are shown in Fig. 4.
4. Discussion and conclusion
RO1a examined Whisper's performance across major native English accents (American, Australian, British, and Canadian), revealing superior performance in North American English compared to British and Australian accents, with no significant difference in Canadian English. These findings emphasise the impact of regional accent variations on Whisper's ASR performance, aligning with prior research (Markl, 2022). Notably, the recognition of several non-native English accents (Swedish and German) surpassing British English is linked to the diverse nature of British accents and the more consistent pronunciation patterns among speakers of the same L1 background. This obscures the recognition disparities among British accents; for instance, native English speakers with the Leeds accent had the highest MER (close to 100%). This underscores the importance of considering regional accent diversity when developing and testing ASR systems for optimal communication experiences across regions.
RO1b computed the average MER across 18 English variants using Whisper's ASR system. Notably, North American English (American and Canadian) exhibited the lowest mean MERs, indicating superior ASR performance, whereas Vietnamese and Thai had the highest mean MERs, suggesting lower accuracy. These variations underscore the challenges in accurately recognising non-native accents, emphasising the need for inclusive ASR technologies. Addressing these performance gaps is vital, given the scarcity of English speech data from certain L1 backgrounds. Strategies like accent-agnostic meta-learning (Winata , 2020) and transfer learning (Meng , 2020; Viglino , 2019) show promise in adapting to unseen accents.
RO2 explored in greater detail the impact of speaker characteristics on Whisper's performance. The analysis revealed several significant relationships between speaker characteristics and MER. English experience demonstrated a significant negative relationship with MER, indicating that higher English proficiency was associated with lower MER values. Although this finding is unsurprising, it confirms that the proficiency of speakers influences the intelligibility of their speech, which, in turn, impacts on the performance of an ASR system. This corroborates earlier findings, establishing a link between pronunciation accuracy and intelligibility in L2 speech (e.g., Deshmukh , 1996; Loukina , 2015; Raymond , 2002). More broadly, this further underscores the need for other sources of speech variations, including atypical or dysarthric speech, to be considered in the deployment of ASR systems (e.g., Christensen, 2021; Xue , 2023).
The analysis revealed a significant negative association between the vowel variable and MER, indicating that speakers of languages with smaller vowel inventories tend to produce higher MERs. This aligns with findings reported by Iverson and Evans (2009) in the context of L2 vowel perception. It is plausible to speculate that individuals whose native languages feature a limited vowel system, in contrast to English with its relatively larger set of approximately 14 vowels, may encounter increased challenges when learning a more extensive and perceptually complex vowel system.
The study identified a significant negative correlation between speaker sex and MER, suggesting higher MERs for male speakers compared to females. This aligns with prior research, indicating that various ASR systems exhibit lower accuracy in recognising male speech, attributing this difference to male speakers' increased disfluencies and longer filled pauses (Adda-Decker and Lamel, 2005; Goldwater , 2010; Shariah and Sawalha, 2013). Notably, these challenges persist in a large pre-trained system like Whisper, a point we will revisit in the discussion of results for RO3.
The study found that the L1 typology of the non-native speakers exhibited a positive correlation with MER. Stess-accent languages exhibited the lowest MER, whereas tone languages exhibited the highest MER values. This finding suggests that Whisper may face challenges when trying to process the speech of speakers whose native language prosodic structure is typologically distant from the target language. Previous studies, such as Graham and Post (2018), have reported on the influence of prosodic typology in the intelligibility of non-native speech in other ASR systems. Taken together, these findings underscore the importance of examining the rich interplay among different variables and not just a few in isolation. An understanding of the relationships between speaker characteristics and MER is crucial in aiding developers to improve ASR models in tailoring them to specific user groups and reducing biases inherent in this technology. However, it is important to note that due to the limited transparency regarding OpenAI's training datasets, there is a possibility that the SAA might have been integrated into the training of the Whisper model. Although this aspect is not a drawback of the present paper, it is a relevant factor to take into account when assessing the obtained results. This is because if used in training, it might contribute to the model's ability to handle various accents and linguistic nuances, which would obscure our understanding of the factors that have shaped its performance on diverse accents.
RO3 compared Whisper's performance on conversational vs read speech for selected L1s, aligning with prior studies highlighting ASR systems' challenges with spontaneous speech (Horii , 2022; Gabler , 2023). The inherent disfluencies in spontaneous speech contribute to ASR errors, a phenomenon less prevalent in read speech. The significant interaction between sex and speech type indicates a modified effect of being male on “MER” in conversational speech, suggesting a reduced impact compared to read speech. Understanding these dynamics is pivotal for optimising ASR technologies in realistic, conversational settings, ensuring effective communication across diverse user populations.
Future research could leverage advanced machine learning techniques, such as neural networks, to enhance ASR system robustness across diverse accents and speech types. Investigating the effectiveness of different approaches to ASR training could yield more accurate and inclusive ASR technologies. Investigating contextual factors, such as environmental noise (Dua , 2023; Kumalija and Nakamoto, 2022), can offer insights into real-world challenges and inform strategies for noise-robust speech recognition. A comprehensive examination of the interplay among speaker characteristics, linguistic features, and ASR outcomes using large speech corpora may reveal complex relationships, paving the way for personalized and context-aware systems. Future work could benefit from using datasets in which the distribution of speakers across different native languages and accents is more balanced. Last, exploring biases in ASR technologies can guide efforts to address disparities and ensure equitable human-machine communication experiences.
Acknowledgment
We wish to acknowledge the Leverhulme Trust for a Research Fellowship to C.G. and the Cambridge Language sciences for an incubator grant to fund aspects of this research.
Author Declarations
Conflict of Interest
The authors have no conflicts of interest to disclose.
Data Availability
The data that support the findings of this study are available from the corresponding author upon reasonable request. The code, datasets, and workflows are available at GitHub https://github.com/crgraham/ASR_Whisper.