Measuring how well human listeners recognize speech under varying environmental conditions (speech intelligibility) is a challenge for theoretical, technological, and clinical approaches to speech communication. The current gold standard—human transcription—is time- and resource-intensive. Recent advances in automatic speech recognition (ASR) systems raise the possibility of automating intelligibility measurement. This study tested 4 state-of-the-art ASR systems with second language speech-in-noise and found that one, whisper, performed at or above human listener accuracy. However, the content of whisper's responses diverged substantially from human responses, especially at lower signal-to-noise ratios, suggesting both opportunities and limitations for ASR--based speech intelligibility modeling.

Speech intelligibility modelling and prediction remain a fundamental challenge for advancing theories, technologies, and clinical instruments aimed at enhancing and/or understanding the basic processes of speech communication. Critical research questions in this area range from understanding the neurocognitive mechanisms of speech perception to development of technologies to support hearing under adverse conditions. While there have been great advances in developing algorithms for predicting speech intelligibility, a substantial gap remains between predicted speech intelligibility and empirically determined human speech recognition accuracy (see Edraki , 2023, and references therein), particularly for second language talkers and listeners (e.g., van Wijngaarden , 2004). Therefore, the current gold standard measurement of speech intelligibility relies on experimental studies of speech transcription by human listeners. Crowd-sourcing platforms (e.g., Amazon Mechanical Turk1 and Prolific2) have substantially reduced the burdens of gathering such data, but these data collection resources still demand a substantial investment of time, effort, and cost from researchers.

The current study tests an alternative way of obtaining speech intelligibility measures: Automatic speech recognition (ASR). The goal of ASR technology is to convert spoken language into text—precisely the task human participants perform in intelligibility studies. If ASR-based intelligibility scores adequately approximate those of human listeners, then the possibility of using ASR to expedite intelligibility measurement can be realized. Our goal is thus to test whether the recognition performance of ASR parallels that of human listeners under some of the most challenging conditions: second language (L2) speech embedded in continuous broad-band noise.

Earlier studies that compared the performance of human and ASR found a large gap between the two (see Scharenborg, 2007, for a review). However, with recent developments in technology, the disparity between human and ASR performance has significantly diminished. For instance, a comparable word error rate between humans and ASR systems was reported under challenging listening conditions, such as conversational speech (e.g., Xiong , 2017) or in the presence of speech-shaped noise or multi-talker babble noise (e.g., Spille , 2018; Tu , 2022).

Here, we focus on L2 speech, which is known to be notoriously difficult for ASR systems despite its ubiquity in real-world settings. Although earlier studies reported a large difference between humans and ASR in L2 speech recognition (e.g., Derwing , 2000), the difference has been reduced in more recent studies (e.g., Inceoglu , 2023; Mulholland , 2016). For example, Inceoglu (2023) compared google assistant ASR and L1 English listeners in their recognition of Taiwanese-accented English sentences and isolated words. While human performance was higher in the sentence recognition task, the ASR system performance was better in the word recognition task, albeit with substantial variation across talkers and items. This suggests that ASR systems may be utilized to even measure L2 intelligibility. However, most of these studies examined L2 speech without the presence of background noise.

In this study, we compare ASR and L1 English human listeners on recognition accuracy of L2 English produced by L1 Mandarin speakers. Speech materials are presented in eight different levels of signal-to-noise ratio (SNR). Four off-the-shelf ASR systems are compared against humans: hubert and wav2vec 2.0 developed by Meta AI Research; whisper developed by Open AI; and the commercial google ASR, google speech-to-text. The human and ASR systems are compared on various parameters: the psychometric function of intelligibility across SNRs (i.e., the relationship between human/ASR responses [accuracy] and magnitude of a stimulus [SNR]), talker speech recognition threshold (SRT [SNR at which over 50% accuracy is attained]), and minimum and maximum accuracy of talkers. We then compare the contents of the responses of humans and the best-performing ASR system.

Testing utilized 120 Hearing in Noise Test (HINT) English sentences (Soli and Wong, 2008)—short simple sentences that are widely used in audiology or clinical settings (e.g., “A boy fell from the window”). Fourteen L1 Mandarin Chinese talkers produced each of these sentences (drawn from the Archive of L1 and L2 Scripted and Spontaneous Transcripts and Recordings (ALLSSTAR) (Bradlow, 2023).

For each talker, the 120 sentences were evenly divided into eight blocks. The orders within and across the blocks were fixed across all talkers. Within each block, sentences were mixed with speech-shaped noise at a fixed SNR (i.e., all 15 sentences in the block had the same SNR). Across blocks, SNR increased from –4 dB to 8 dB in steps of 2 dB (–4, –2, 0, 2, 4, 6, and 8 dB); in the final block, the recordings were presented without noise (quiet [Q]).

For each one of the 14 talkers, data from 10 human participants were collected through prolific.2 Participants were L1 English speakers, US residents, over the age of 18, and had no history of hearing, speech, or language impairments. They were instructed to wear headphones/earbuds and be in a quiet space during the experiment. Participants who did not meet these criteria, failed to perform the transcription task under the Q condition (e.g., describing the stimuli), or had technical issues were excluded from the analysis (15 out of 155 [9.7%]). Participants received $13/h as compensation for participation.

Participants heard each sentence only once and were instructed to transcribe what they heard as best as they can. Before beginning the experiment, participants performed eight practice trials with one sentence per SNR. These had no overlap with the experimental materials (short sentences from the story The Little Prince read by an L1 English speaker). The overall experiment took approximately 30 min.

Note that speech materials were presented on the order of lowest to highest SNR (i.e., from –4 dB to Q). This is one means of eliciting a full range of intelligibility measures for each talker. (N.B.: This choice was motivated by the broader study of adaptation, which the current study is part of and which examines the effects of stimulus presentation order on intelligibility measures.) In the first block (–4 dB), we expect to obtain the lowest intelligibility score, as sentences are mixed with loud background noise and participants have no experience with the talker and task. On the other hand, in the final block (Q), we expect to obtain the highest intelligibility score, as sentences are not mixed with noise and participants have full practice with the talker and task.

Using the same procedure, the recordings were also provided as input to four off-the-shelf ASR systems.

google speech-to-text: Google's commercial ASR (Google, 2023) provides a black-box input-output service (i.e., we do not have access to the model and its training process). In the current study, we used the default “en-US” ASR model.

wav2vec 2.0: wav2vec 2.0, released by Meta AI's research team (Baevski , 2020), is based on a unified end-to-end pipeline for training a self-supervised model operating on the raw speech waveform without specific labels—in our case, without transcriptions. Instead of labels, wav2vec 2.0's pre-training task relies on a “codebook”: a mapping between a continuous representation/learned set of vectors into a discrete one. This discrete mapping is optimized before training of the system. After pre-training, a standard fine-tuning process is conducted, fixing the pre-trained model and then training one or more layers to predict the text of a given audio input.

hubert: hubert, released by Meta AI after wav2vec 2.0 (Hsu , 2021), utilizes an architecture similar to wav2vec 2.0. However, unlike wav2vec 2.0, hubert learns the codebooks separately and uses an iterative process to refine them during training. During the first iteration, hubert utilizes the Mel-frequency cepstral coefficients representation of the audio signal and clusters the frames to yield the codebook required for training. Then, the clusters are refined (iteratively) using a layer from the model to be the new representation of the signal.

Here, we used versions of wav2vec 2.03 and hubert4 pre-trained on 60,000 h of Libri-Light (Kahn , 2020) and fine-tuned on 960 h of Librispeech corpus (Panayotov , 2015); both are English read speech from audiobooks.

whisper: whisper, released by Open AI (Radford , 2023), differs from the previous models in that it is trained on a larger amount of data (680,000 h of speech, including multilingual speech) and has a much larger set of parameters. This model uses Log-Mel representations and was trained on different tasks such as multilingual speech recognition, speech translation, and voice activity detection. Additionally, while wav2vec 2.0 and hubert were trained on speech in quiet, whisper5 was trained on speech embedded in various background sounds as well as background sounds with no speech.

The transcriptions collected from human participants and the four ASR systems were analyzed using word recognition accuracy,6 which was obtained from autoscore (Borrie , 2019). We required all (content, function) words to be completely identical to the target, except that we allowed spaces to be present or absent in compounds (e.g.,“rain coat” vs “raincoat”) and numbers to be written as digits or spelled out (e.g., “3” vs “three”).

Accuracy was statistically assessed using a beta regression, predicting mean word recognition accuracy of each block from listener type, SNR, and the interaction between the listener type and SNR. Talker was included as a control factor. The listener type was coded as a categorical variable with five levels (i.e., the 4 ASR systems plus humans), with humans as the baseline. SNR was coded as an ordinal variable with eight levels (i.e.,–4 to 8 dB in steps of 2 dB and Q). To meet the assumptions of the beta regression (i.e., values should be greater than 0 and less than 1), talker accuracy values were corrected to lie strictly within 0 and 1 (for the transformation formula, see Smithson and Verkuilen, 2006).

Between-talker differences were further examined via SRTs (the first SNR tested in the experiment—thus, out of 8 SNRs—with accuracy equal to or larger than 50%) and minimum/maximum values. The linearly weighted Kappa coefficient was used to measure the reliability of SRT classification by each ASR relative to the human gold standard. We also examined the match between human and ASR levels of minimum and maximum accuracy across SNRs7 for each talker using Pearson correlation and root-mean-square deviations (RMSDs).

The content of the best-performing ASR system identified from the measures above was then further examined to assess its similarities (and differences) from human listener transcriptions. In particular, we focused on a phenomenon that has recently received a great deal of attention in natural language generation: hallucinations, where an artificial system generates output that is unfaithful to the source input/training data (see Ji , 2023, for a review). We defined hallucinations as words in the listener's response that are completely absent from the source signal (i.e., words that do not appear in any position in the target sentence) and compared hallucination rates of humans and the best-performing ASR.

As shown in Fig. 1, the psychometric functions of intelligibility by SNR show a high degree of similarity between whisper and human listeners, whereas other systems clearly diverge from humans. In panels (a)–(c), accuracy values of the ASR systems (colored) are below those of the human listeners (gray)—although note that the accuracy at Q condition matches humans in panels (b) and (c). On the other hand, the accuracy of whisper is similar to or higher than that of humans, as shown in panel (d). Moreover, in panels (a)–(c), the slopes of the psychometric functions differ between ASR and humans, whereas in panel (d), they are similar.

Fig. 1.

Psychometric functions of intelligibility by SNR in humans (gray) and four ASR systems (colored). Each dot is the mean recognition accuracy at a given SNR (averaged over talkers, listeners, and sentences), and bands show bootstrapped 95% confidence intervals. The red dotted line marks 50% accuracy.

Fig. 1.

Psychometric functions of intelligibility by SNR in humans (gray) and four ASR systems (colored). Each dot is the mean recognition accuracy at a given SNR (averaged over talkers, listeners, and sentences), and bands show bootstrapped 95% confidence intervals. The red dotted line marks 50% accuracy.

Close modal

Statistical analysis confirmed all these observations. As expected, the model showed a main effect of SNR [β = 0.46, standard error (s.e.) β = 0.02, χ2(1) = 449.86, p < 0.001], with lower performance at lower SNR levels. google speech-to-text, hubert, and wav2vec 2.0 showed, on average, significantly worse performance than humans [Google speech-to-text, β = –1.22, s.e. β = 0.06, χ2(1) = 330.54, p < 0.001; hubert, β = –1.31, s.e. β = 0.06, χ2(1) = 340.91, p < 0.001; wav2vec 2.0, β = –1.26, s.e. β = 0.06, χ2(1) = 331.68, p < 0.001]. The slope of this function also significantly differed between these three ASR systems and humans [google speech-to-text, β = 0.11, s.e. β = 0.03, χ2(1) = 14.50, p < 0.001; hubert, β = 0.37, s.e. β = 0.03, χ2(1) = 136.84, p < 0.001; wav2vec 2.0, β = 0.23, s.e. β = 0.03, χ2(1) = 63.85, p < 0.001].

In contrast, whisper was significantly more accurate than human listeners [β = 0.17, s.e. β = 0.06, χ2(1) = 7.88, p < 0.01], and the slope of the SNR function showed no significant difference across humans and whisper [β = 0.01, s.e. β = 0.03, χ2(1) = 0.16, p = 0.69].

As shown in the top four panels in Fig. 2, analysis of SRT demonstrated that whisper performs similar to or better than humans. (N.B.: The lower the SRT is, the better the intelligibility; this means that the listener could achieve 50% accuracy with louder background noise.) For google speech-to-text, hubert, and wav2vec 2.0, SRT was always higher than humans, with linearly weighted Kappa values indicating anti-correlation or at-chance performance relative to humans. In contrast, whisper SRTs were either similar to or lower than those of humans (especially in talkers that humans had difficulty with), yielding good agreement (κ = 0.58).8

Fig. 2.

Speech recognition threshold (SRT [top]) and minimum (middle) and maximum (bottom) accuracy of each talker in humans and four ASR systems. Talkers are ordered by human ranking of best-to-worst performance within each row. Gray dots/lines show human listener measures; colored dots/lines show ASR systems. Statistics for comparison of ASR systems and humans are shown within each graph: (top) linearly weighted Kappa values (κ), (middle and bottom) correlation coefficients (r), and root-mean-square deviations (RMSDs). The highest Kappa values and coefficients and lowest RMSDs within each row are marked in boldface.

Fig. 2.

Speech recognition threshold (SRT [top]) and minimum (middle) and maximum (bottom) accuracy of each talker in humans and four ASR systems. Talkers are ordered by human ranking of best-to-worst performance within each row. Gray dots/lines show human listener measures; colored dots/lines show ASR systems. Statistics for comparison of ASR systems and humans are shown within each graph: (top) linearly weighted Kappa values (κ), (middle and bottom) correlation coefficients (r), and root-mean-square deviations (RMSDs). The highest Kappa values and coefficients and lowest RMSDs within each row are marked in boldface.

Close modal

Analysis of the minimum accuracy of each talker (Fig. 2, middle) also found that whisper showed outstanding performance and high similarity to humans. Overall, whisper showed similar values to humans (RMSD = 5.05% [note that for some talkers, the accuracy was even higher than humans]), and whisper's overall patterning of minimum intelligibility across talkers was similar to humans (r = 0.97). While google speech-to-text and wav2vec 2.0 showed relatively high correlations (rs > 0.7), each system underestimated minimum values (RMSDs > 20%). hubert estimated minimum accuracies near 0% for almost all talkers (yielding the highest RMSD and lowest r).

In contrast to the measures above, for maximum accuracy, all models exhibited a good match to human data (rs > 0.6). wav2vec 2.0 showed the most human-like performance, although the other models were not far behind (Fig. 2, bottom). With respect to accuracy values, whisper stood out from the other models in that the values were higher than humans in most talkers; this was particularly noticeable among low-intelligibility talkers.

Measures of overall accuracy and between-talker variation suggest whisper is the most human-like. To further assess its performance, we examined the content of its responses, using hallucination rates. As shown in Fig. 3, on average whisper hallucinates at much greater rates, especially at lower SNR levels (20% points more hallucinations at −4 dB and −2 dB SNRs).9

Fig. 3.

Mean hallucination rates for human (gray) and whisper (purple) transcribers (averaged over talkers, listeners, and sentences). Bands indicate bootstrapped 95% confidence intervals.

Fig. 3.

Mean hallucination rates for human (gray) and whisper (purple) transcribers (averaged over talkers, listeners, and sentences). Bands indicate bootstrapped 95% confidence intervals.

Close modal

Table 1 shows a sample of whisper's responses on sentences with a high hallucination rate. As shown in these responses, whisper often generated random sentences that have little connection to the targets. The first several rows make this problem concrete. After hearing the same target sentence at the same SNR produced by different talkers, whisper produced completely distinct responses.

TABLE 1.

Examples of hallucinations in whisper.

Target Transcription by whisper
Somebody stole the money.  Transcription of talker a's production: I do not know if I want to stay or go back. 
Talker b: I am glad that we sit down and talk. 
Talker c: That is not what I was thinking man. 
Talker d: Some of the things do not seem to work. 
Mother read the instructions.  I do not really care about the park you can see. 
Swimmers can hold their breath.  Three months and I am already done with it. 
They're shopping for school clothes.  Yeah I will show you what it looks like. 
Target Transcription by whisper
Somebody stole the money.  Transcription of talker a's production: I do not know if I want to stay or go back. 
Talker b: I am glad that we sit down and talk. 
Talker c: That is not what I was thinking man. 
Talker d: Some of the things do not seem to work. 
Mother read the instructions.  I do not really care about the park you can see. 
Swimmers can hold their breath.  Three months and I am already done with it. 
They're shopping for school clothes.  Yeah I will show you what it looks like. 

In contrast, at low SNRs, humans tend to only transcribe words that they recognized or simply respond as “I don't know, I couldn't make out anything, unintelligible, n/a.” (N.B.: Human listeners in the current experiment were not allowed to leave their responses blank.) This pattern shows that humans and whisper are qualitatively different—whisper hallucinates substantially, producing sentences totally unrelated to the target—despite their similarities in accuracy.

Automatic intelligibility measurement that is quantitatively and qualitatively similar to human intelligibility would accelerate and improve the science of speech acoustics and its application by eliminating the time and expense required for human data collection. To assess the feasibility of an automated approach, four state-of-the-art ASR systems and human listeners were compared on their recognition of Mandarin-accented English embedded in various levels of background noise.

The results suggest that whisper is a plausible alternative. In particular, the slope of the psychometric function and SRT classification of whisper were similar to humans', and whisper's minimum accuracy levels were well correlated with humans'. In contrast, google speech-to-text, hubert, and wav2vec 2.0 showed overall significantly different psychometric functions and anti- or uncorrelated SRT classifications and grossly underestimated minimum accuracies. There was, however, little difference in performance across systems for capturing maximum accuracies.

We suspect that the difference in ASR performance reflects how the systems are trained. As mentioned in Sec. 2.3, whisper was trained on various tasks and materials, which presumably made it robust to L2 speech in noisy environment. This contrasts with wav2vec 2.0 and hubert, which were pre-trained on read English speech at quiet. Thus, the critical reason why whisper outperforms other models may be the similarity between training and test materials. Additionally, architectural differences between ASR systems may have also played a role.

Despite its overall good performance, there are caveats: whisper exhibits clear differences from human listeners. In general, it showed higher accuracy than humans, particularly for talkers that humans had difficulty with. Moreover, unlike humans, whisper was prone to hallucinating words that were not present in the target signals. Consistent with recent results in other domains of cognition and language, this suggests that the potential ease of using artificial systems (Dillion , 2023) must be carefully balanced with their problems and limitations (Crockett and Messeri, 2023).

Based on our findings, we suggest that ASR systems could be utilized to measure intelligibility of L1 English listeners, using similar types of talkers and speech materials. Yet, depending on the goal and purpose of the study, the choice between human listeners vs ASRs or among ASR systems should vary. For measurement of intelligibility with little or no background noise, one can use whisper, hubert, and wav2vec 2.0 instead of human listeners. Analysis of the maximum accuracy demonstrated a high correlation as well as a low RMSD between humans and these ASR systems. google speech-to-text, however, cannot replace human listeners, as it consistently underestimated human speech recognition accuracy. On the other hand, if one is interested in obtaining the psychometric function of intelligibility across a wide range of SNRs, whisper may be utilized. Out of the systems tested, it was the only one that closely captures average human performance and between-talker variation in intelligibility across a range of SNRs.

Although whisper showed a surprising similarity to humans, the finding that whisper regularly exceeds human performance warrants caution. Utilizing whisper is not appropriate when a researcher is interested in specific levels of accuracy or highly precise rankings among various talkers. whisper also fails to model the content of human responses; it hallucinates much more than humans, especially at lower SNRs.

To gather a deeper understanding of their strengths and limitations, further investigation of ASR systems is clearly warranted. Promising lines of investigation include the following: L2 English talkers with a much wider range of L1 backgrounds, languages other than English, semantically anomalous sentences instead of meaningful materials, and multi-talker babble from varying background languages. Our results also point towards directions for future development of ASR technology, especially considering whisper's tendency to hallucinate.

In sum, the current study provides a first step in accelerating speech science through ASR and also advancing ASR technology through rigorous comparison against human listeners. We note that with additional fine-tuning of the ASR systems, one may obtain intelligibility measures that are either more accurate than or more similar to humans. When used with full awareness of its limitations, the overall performance of the off-the-shelf ASR systems provides a promising opportunity to automatically estimate intelligibility, particularly for researchers with limited resources and technical expertise.

See the supplementary material for the analyses of word error rate (supplementary material A) and whisperx (Bain et al., 2023, supplementary material B).

This work was supported by NSF DRL Grant No. 2219843 and BSF Grant No. 2022618. Thanks to Chun Chan for assistance with human data collection.

The authors have no conflicts to disclose.

The Northwestern University Institutional Review Board approved the human intelligibility experiment. Informed consent was obtained from all participants.

The speech data used in this study are available on SpeechBox as part of the ALLSSTAR corpus at https://speechbox.linguistics.northwestern.edu/#!/?goto=allsstar. Transcription data from human listeners and ASRs are available from the Open Science Foundation at https://osf.io/bzde6/.

1

See the Amazon Mechanical Turk (https://www.mturk.com) site for more information.

2

See the Prolific site (https://www.prolific.com/) for more information.

3

See the wav2vec large model posted on GitHub (https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vec).

4

See the hubert large model posted on GitHub (https://github.com/facebookresearch/fairseq/tree/main/examples/hubert).

5

Further information about whisper-large is available on the Hugging Face site (https://huggingface.co/openai/whisper-large).

6

Similar results were observed using the word error rate (WER) (see supplementary material A).

7

The minimum and maximum accuracy values were typically found in the lowest (68/70 cases) and highest (59/70) SNR.

8

We obtained qualitatively similar results when SRT was determined using a beta regression fitted to the empirical accuracy values.

9

We also examined whisperx (Bain , 2023), a speech recognition system incorporating the whisper system, that reduces hallucination rates on L1 English speech. As shown in the supplementary material B, while this system hallucinates at a lower rate than whisper, it still hallucinates more than human listeners. Analysis of the metrics in previous sections shows that whisperx exhibits overall higher performance than whisper, resulting in an overall worse fit to the human data.

1.
Baevski
,
A.
,
Zhou
,
H.
,
Mohamed
,
A.
, and
Auli
,
M.
(
2020
). “
wav2vec 2.0: A framework for self-supervised learning of speech representations
,”
Adv. Neur. Info. Proc. Syst.
33
,
12449
12460
.
2.
Bain
,
M.
,
Huh
,
J.
,
Han
,
T.
, and
Zisserman
,
A.
(
2023
). “
Whisper X: Time-accurate speech transcription of long-form audio
,” in
Proceedings of INTERSPEECH 2023
, Dublin, Ireland (
ISCA
,
Baixas, France
), pp.
4489
4493
.
3.
Borrie
,
S. A.
,
Barrett
,
T. S.
, and
Yoho
,
S. E.
(
2019
). “
Autoscore: An open-source automated tool for scoring listener perception of speech
,”
J. Acoust. Soc. Am.
145
,
392
399
.
4.
Bradlow
,
A.
(
2023
). “
ALLSSTAR: Archive of L1 and L2 Scripted and Spontaneous Transcripts and Recordings
,” https://speechbox.linguistics.northwestern.edu/#@!/?goto=allsstar (Last viewed September 2023).
5.
Crockett
,
M.
, and
Messeri
,
L.
(
2023
). “
Should large language models replace human participants?
,” PsyArXiv 4zdx9 https://osf.io/preprints/psyarxiv/4zdx9.
6.
Derwing
,
T. M.
,
Munro
,
M. J.
, and
Carbonaro
,
M.
(
2000
). “
Does popular speech recognition software work with ESL speech?
,”
TESOL Quart.
34
,
592
603
.
7.
Dillion
,
D.
,
Tandon
,
N.
,
Gu
,
Y.
, and
Gray
,
K.
(
2023
). “
Can AI language models replace human participants?
,”
Trends Cogn. Sci.
27
,
597
600
.
8.
Edraki
,
A.
,
Chan
,
W.-Y.
,
Fogerty
,
D.
, and
Jensen
,
J.
(
2023
). “
Modeling the effect of linguistic predictability on speech intelligibility prediction
,”
JASA Express Lett.
3
(
3
),
035207
.
9.
Google
(
2023
). “
Speech-to-Text: Automatic speech recognition|google cloud
,” https://cloud.google.com/speech-to-text (Last viewed September 2023).
10.
Hsu
,
W.-N.
,
Bolte
,
B.
,
Tsai
,
Y.-H. H.
,
Lakhotia
,
K.
,
Salakhutdinov
,
R.
, and
Mohamed
,
A.
(
2021
). “
HuBERT: Self-supervised speech representation learning by masked prediction of hidden units
,”
IEEE/ACM Trans. Audio. Speech. Lang. Process.
29
,
3451
3460
.
11.
Inceoglu
,
S.
,
Chen
,
W.-H.
, and
Lim
,
H.
(
2023
). “
Assessment of L2 intelligibility: Comparing L1 listeners and automatic speech recognition
,”
ReCALL
35
,
89
104
.
12.
Ji
,
Z.
,
Lee
,
N.
,
Frieske
,
R.
,
Yu
,
T.
,
Su
,
D.
,
Xu
,
Y.
,
Ishii
,
E.
,
Bang
,
Y. J.
,
Madotto
,
A.
, and
Fung
,
P.
(
2023
). “
Survey of hallucination in natural language generation
,”
ACM Comput. Surv.
55
,
1
38
.
13.
Kahn
,
J.
,
Rivière
,
M.
,
Zheng
,
W.
,
Kharitonov
,
E.
,
Xu
,
Q.
,
Mazaré
,
P.-E.
,
Karadayi
,
J.
,
Liptchinsky
,
V.
,
Collobert
,
R.
,
Fuegen
,
C.
,
Likhomanenko
,
T.
,
Synnaeve
,
G.
,
Joulin
,
A.
,
Mohamed
,
A.
, and
Dupoux
,
E.
(
2020
). “
Libri-light: A benchmark for ASR with limited or no supervision
,” in
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
, Barcelona, Spain (
IEEE
,
New York
), pp.
7669
7673
.
14.
Mulholland
,
M.
,
Lopez
,
M.
,
Evanini
,
K.
,
Loukina
,
A.
, and
Qian
,
Y.
(
2016
). “
A comparison of ASR and human errors for transcription of non-native spontaneous speech
,” in
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
(
IEEE
,
New York
), pp.
5855
5859
.
15.
Panayotov
,
V.
,
Chen
,
G.
,
Povey
,
D.
, and
Khudanpur
,
S.
(
2015
). “
Librispeech: An ASR corpus based on public domain audio books
,” in
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
, South Brisbane, Australia (
IEEE
,
New York
), pp.
5206
5210
.
16.
Radford
,
A.
,
Kim
,
J. W.
,
Xu
,
T.
,
Brockman
,
G.
,
McLeavey
,
C.
, and
Sutskever
,
I.
(
2023
). “
Robust speech recognition via large-scale weak supervision
,”
Proc. Intl Conf. Mach. Learn.
202
,
28492
28518
.
17.
Scharenborg
,
O.
(
2007
). “
Reaching over the gap: A review of efforts to link human and automatic speech recognition research
,”
Speech Commun.
49
,
336
347
.
18.
Smithson
,
M.
, and
Verkuilen
,
J.
(
2006
). “
A better lemon squeezer? Maximum-likelihood regression with beta-distributed dependent variables
,”
Psychol. Methods
11
,
54
71
.
19.
Soli
,
S. D.
, and
Wong
,
L. L. N.
(
2008
). “
Assessment of speech intelligibility in noise with the hearing in noise test
,”
Int. J. Audiol.
47
(
6
),
356
361
.
20.
Spille
,
C.
,
Ewert
,
S. D.
,
Kollmeier
,
B.
, and
Meyer
,
B. T.
(
2018
). “
Predicting speech intelligibility with deep neural networks
,”
Comp. Speech Lang.
48
,
51
66
.
21.
Tu
,
Z.
,
Ma
,
N.
, and
Barker
,
J.
(
2022
). “
Unsupervised uncertainty measures of automatic speech recognition for non-intrusive speech intelligibility prediction
,” in
Proceedings of INTERSPEECH 2022
, Incheon, Korea (
ISCA
,
Baixas, France
), pp.
3493
3497
.
22.
van Wijngaarden
,
S. J.
,
Bronkhorst
,
A. W.
,
Houtgast
,
T.
, and
Steeneken
,
H. J. M.
(
2004
). “
Using the speech transmission index for predicting non-native speech intelligibility
,”
J. Acoust. Soc. Am.
115
(
3
),
1281
1291
.
23.
Xiong
,
W.
,
Droppo
,
J.
,
Huang
,
X.
,
Seide
,
F.
,
Seltzer
,
M.
,
Stolcke
,
A.
,
Yu
,
D.
, and
Zweig
,
G.
(
2017
). “
Achieving human parity in conversational speech recognition
,”
IEEE/ACM Trans. Audio Speech Lang. Process.
25
(
12
),
2410
2423
.

Supplementary Material