A test is proposed to characterize the performance of speech recognition systems. The QuickSIN test is used by audiologists to measure the ability of humans to recognize continuous speech in noise. This test yields the signal-to-noise ratio at which individuals can correctly recognize 50% of the keywords in low-context sentences. It is argued that a metric for automatic speech recognizers will ground the performance of automatic speech-in-noise recognizers to human abilities. Here, it is demonstrated that the performance of modern recognizers, built using millions of hours of unsupervised training data, is anywhere from normal to mildly impaired in noise compared to human participants.
1. Introduction
We are interested in comparing human and machine recognition in the face of noise. For many years, automatic speech recognizers (ASR) have performed better than humans in the identification of clean speech (Xiong , 2018). Commercial speech recognition efforts now use millions of hours of training data to achieve state-of-the-art performance (Zhang , 2023). Whereas some of their training data is certainly noisy, there is no explicit measure of noise tolerance. On the other hand, speech enhancement networks successfully pick out the target speaker, thus, reducing the noise, however, “using enhanced speech as a preprocessor for ASR often degrades recognition accuracy” (O'Shaughnessy, 2024). Here, we wish to ground the performance of machine speech recognition by putting it on a human scale.
This Letter proposes using QuickSIN (Etymotic Research, Inc., 2001; Killion , 2004) to measure the performance of automatic speech recognition systems. The QuickSIN test is widely used by audiologists and researchers to measures a subject's ability to recognize speech in the presence of background noise. Performance on QuickSIN is influenced by the degree of hearing loss (Fitzgerald , 2023; Wilson , 2007), auditory pathology (Qian , 2023; Smith , 2024), and is related to perceived auditory handicap (Fitzgerald , 2024). Because QuickSIN is widely used in audiological practices and well characterized in humans, we suggest that this test is a good metric to quantify performance of ASR in the presence of background noise. This test is not useful as part of a train/eval/test cycle because there is limited speech material for this test—here, we only suggest using QuickSIN to quantify machine performance on a human scale.
2. Methods
The QuickSIN test material consists of a foreground speech signal mixed with background speech babble. The IEEE sentences (IEEE, 1969) were spoken by a female speaker with four speaker babble noise added in the background (Killion , 2004). The monaural mixture is played to human subjects and provided as direct input to automatic speech recognition systems. Both systems are judged based on the number of keywords, a maximum of five per sentence, that are recognized correctly. We use either an approximate counting procedure or, as we describe in this Letter, logistic regression to estimate the signal-to-noise ratio (SNR) for which 50% of the words are correctly recognized.
QuickSIN defines 12 lists of sentences in noise. We only use the seven lists that are judged most homogeneous: 1, 2, 6, 8, 10, 11, and 12 (Killion , 2006; McArdle and Wilson, 2006). Each QuickSIN list is distributed as an audio file that is 60 s long, consisting of six sentences at six different SNRs (25, 20, 15, 10, 5, and 0 dB). Each sentence has five designated keywords that are used for scoring. For our automated tests, the entire file is passed to the recognizer, which returns all the words that are recognized and their start and stop times. Automated scoring is more difficult in this test because computers are precise and “four” and “for” are equally good answers for the QuickSIN test. We, thus, use a table of homonyms and other normalizations to match audiologist behavior. We use a strict scoring protocol, where all “phonemes” must be recognized correctly, as any errors indicate that the speech was heard incorrectly. Hence, “4” and “four” are the same word, and “Tara” is taken to be equal to “Tear a” while we score “sheet” and “sheep” or “chart” and “charts” as incorrect responses.
For example, the transcript of QuickSIN list 1, sentence 1 contains these five underlined keywords (Sklaney, 2006): A white silk jacket goes with any shoes. We match the recognized words with the expected keywords and count the number of matches. This gives us the number of words correctly recognized at each SNR. We average the scores from multiple lists to reduce the measurement noise. Note that these test sentences are continuous speech, but only the five keywords in each sentence are used for scoring.
The QuickSIN test is manually scored by audiologists, who count the number of correctly recognized keywords and then use an approximation that computes the SNR that gives 50% recognition. We describe this counting algorithm first because it is the conventional approach and then describe logistic regression for a more exact approach. The counting process overestimates the SNR that achieves the 50% mark, therefore, we apply a correction factor to the more principled regression approach such that the resulting regression scores match those in the literature.
The ad hoc approach is the standard in a clinic and based on an approximation (Killion and Fikret-Pasa, 1993) that is easy for an audiologist to implement with paper and pencil. It consists of the following steps (Etymotic Research, Inc., 2001, 2006), where the 2.5 dB factor is derived by Tillman and Olsen (1973):
The QuickSIN has five words per step and 5 dB per step. Our highest SNR is 25 dB, hence, we take 25 + 2.5 = 27.5 minus the total number of words repeated correctly. This gives what we call SNR-50, the SNR required for the patient to repeat 50% of the words correctly.
Furthermore, this is converted into the SNR Loss (compared to normal human listeners):
Since SNR-50 for normal-hearing persons is 2 dB, we subtract 2 dB to derive the formula for a patient's SNR LOSS: 25.5 – (total words correct in six sentences).
We can also fit the data to a logistic curve by converting the number of correctly recognized keywords at each SNR into a fraction and then fitting a logistic regression curve to it (Nunez-Iglesias , 2017). This gives us a curve from which we can estimate the SNR that produces 50% accuracy.
We tested two modern, large-scale, commercial recognizers using the QuickSIN test. OpenAI (San Francisco, CA) builds their Whisper system to recognize speech with 68 000 h of speech data (Radford , 2022). We benchmarked the performance of the base (74 × 106), small (244 × 106), and medium (769 × 106 parameters) models, all dated September 2022, as well as the large model (1.55 × 109 parameters), dated November 2023. Alternatively, we also tested three cloud recognizers offered by Google (Mountain View, CA). The largest is called USM or Chirp (Google, 2023), was trained on 12 × 106 h of speech, and uses over 2 × 109 parameters to efficiently represent speech sounds across more than 100 languages (Zhang , 2023). The Chirp recognizer, as well as two older recognizers with unknown pedigree, are used in this test.1 It is worth noting that these large, successful recognizers use significantly less training data than was previously suggested would be needed to reach human performance (Moore, 2003).2 Most importantly, our goal is not to define the state of the art. We want to demonstrate current abilities and make the QuickSIN tools available to others.
Here, we evaluated the performance of each recognizer on the seven standard QuickSIN lists (Killion , 2006; McArdle and Wilson, 2006). We sent all unmodified sentences, in WAV format, to the automatic speech recognition systems to recognize the speech. We scored the results by counting the number of correct words in the recognition results. We then plotted the psychometric function for each recognizer and compared performances to clinical norms.
3. Results
Figure 1 shows the speech recognition performance for each of the seven recognizers across the six different SNR values. Performances improved for all recognizers with increasing SNR values but was rarely perfect. These curves look similar—each recognizer and most humans recognize all the words at 25 dB SNR and miss most words at 0 dB. What counts is the SNR where the recognition score crosses the 50% line. Human listeners average 2 dB for this threshold (Fitzgerald , 2023), but with impaired hearing, the SRT can go as high as 25 dB. QuickSIN scores are relative to this 2 dB threshold, thus, the average human listener has a QuickSIN SNR Loss of 0 dB.
Our summary statistic is the speech reception threshold (SRT), which is the SNR at which a subject can recognize 50% of the words correctly (Plomp and Mimpen, 1979). We scored each recognizer's performance using the original counting method and logistic regression as they produce different scores for the SRT. Figure 2 compares these recognizers using this metric.
The SNR-50 score differs depending on whether it is calculated by the conventional counting approach or via logistic regression. This difference appears to be systematic and is summarized in the scatterplot of Fig. 3. Using linear regression, we find the logistic regression results are 0.94 dB lower (more optimistic) than the standard counting approach. Although the regression method has a firmer statistical basis, we compare the ASR results to human results using the counting method because that is what conventionally defines the normal, mild, moderate, and severely impaired limits (Etymotic Research, Inc., 2006).
We suggest the following procedure to match the regression approach to the counting procedure used clinically, which defines the clinical boundaries. First, use the logistic regression method to find the SNR-50, add 0.94 dB to account for the difference between the regression and counting approaches, and then subtract 2 dB to account for expected human performance. By this method, we arrive at a QuickSIN SNR Loss that we can match to human expectations. Human listeners, depending on the state of the auditory system, which is usually quantified in terms of their average elevated threshold in dB, show a wide variation (Fitzgerald , 2023; Smith , 2024). The defined impairments for SNR Loss are normal (<3 dB), mildly impaired (>3 and <7 dB), and severely impaired (>7 dB; Etymotic Research, Inc., 2006). The best recognizer, the large Whisper model, has a QuickSIN SNR Loss of 2.1 dB and, thus, achieves normal human performance.
4. Discussion
The recognizers tested here flirt with the boundary between normal and mild SNR Loss. The very large USM/Chirp recognizer, which performs better than the Whisper recognizers in Google's tests, performs worse on QuickSIN's cocktail party conversations. Except for the very tiny Base recognizer (74 M parameters), the Whisper recognizers show SNR performance in the upper parts of the normal range. There are many different ways to optimize a recognizer, but one common goal is to match human performance and, thus, expectations. Perhaps the biggest systems are emphasizing recognizing all of the words that are heard at the expense of hearing just the foreground speech. We would love to know more about how large, semi-supervised recognition systems model the foreground and background sounds.
We believe that QuickSIN is a simple and effective way to characterize the performance of ASR system when recognizing noisy speech. While there are many types of noise and ways to measure it, we believe it is important to ground the results to human performance. The QuickSIN test is widely used in audiology clinics and research laboratories, and its performance has been characterized in thousands of individuals with varying degrees of hearing loss (Fitzgerald , 2023). Thus, in our example, the performance of a state-of-the-art recognizer rates as normal to mildly impaired when scored on this human test.
This approach has several caveats. First and most importantly, these speech recognition systems have no sense of speaker identity or other aspects of auditory scene analysis (Bregman, 1990). Second, a human listener might attend more closely to the foreground speech and, therefore, more easily ignore the background noise. Third, the speech recognition system is assuming a much wider vocabulary than one might expect in an audiology booth. (In some cases, we had to look up the recognized token to see that it really was a word.) Fourth, there are a number of modifications, including language abilities, audio-visual cues, and binaural cues that are not part of this study but might be important in other tasks. Finally, we took an especially firm stance on similar words, which is good for reproducibility, but is likely more strict than how human audiologists score it in a real-time test.
At this point, however, state of the automatic speech recognition engine that is trained with millions of hours of speech is at the boundary between normal and mildly impaired (∼3 dB SNR Loss on the counting scale) compared to normal human performance in noise. We hope that the QuickSIN test that we propose here will allow speech recognition engineers to better connect their work to human results and expectations.
Author Declarations
Conflict of Interest
The authors have no conflicts to disclose.
Data Availability
The code and data that support the findings of this study are available online at https://github.com/MalcolmSlaney/QuickSIN_Benchmark (Slaney, 2024).
Except for the Chirp/USM recognizer, there is no public information about the other Google recognizers, save for their names, that describes their underlying technology, expected use, or performance.
Moore (2003) suggests that a 50 year old human has heard ∼100 000 h of speech. They extrapolate from several papers to conclude that between 600 000 to 10 000 000 h of training data would be needed to reach zero errors. The authors argue that this is a “fantastic amount of speech” and, thus, “simply demanding more and more training data is not going to provide a satisfactory solution to approaching human levels of speech recognition performance. What is needed is a change in approach that would alter the slope of the data shown….” Evidently, enough data are possible in concert with new deep neural networks that have sufficient number of parameters to achieve these results.