Humans can modify their speech to improve intelligibility in noisy environments. With the advancement of speech synthesis technology, machines may also synthesize voices that remain highly intelligible in noise condition. This study evaluates both the subjective and objective intelligibility of synthesized speech in speech-shaped noise from three major speech synthesis platforms. It was found that synthesized voices have a similar intelligibility range to human voices, and some synthesized voices were more intelligible than human voices. It was also found that two modern automatic speech recognition systems recognized 10% more words than human listeners.
1. Introduction
Speech synthesis dates back to centuries ago when pioneers tried to build “speaking machines” to mimic the way humans speak (Schroeder, 1993). Modern speech synthesis started with “the vocoder” invented by Homer Dudley at Bells Lab in 1938, which remakes speech from pitch and spectrum information of recorded speech (Dudley, 1939). By 1980, several speech synthesis systems became publicly available. One of the most widely adopted systems is the DECtalk based on the work of Dennis Klatt (Klatt, 1980). At that time, the synthesized speech sounded robotic and was not highly intelligible. Over the next decades, advancements in acoustics, linguistics, signal processing, and artificial intelligence have gradually improved the intelligibility and quality of synthesized speech (Cambre , 2020). Nowadays, the dominant form of speech synthesis is text-to-speech (TTS), which converts written text to spoken words. Text-to-speech systems have been integrated in daily life through various applications such as public announcements, voice assistants, language learning, and audiobooks. Leading commercial platforms such as Amazon, Apple, Google, IBM, and Microsoft offer a variety of TTS services, making voice-based applications easily accessible to the public.
With recent advances in large-scale generative models (Le , 2023), TTS technology has been improving rapidly (Ren , 2022; Tan , 2024), making synthesized speech almost indistinguishable from real-human speech. A potential direction for TTS technology is to synthesize speech with higher intelligibility by leveraging speech data that remain highly intelligible in noise. Noise is one of the most significant factors affecting speech intelligibility. With decreasing signal-to-noise ratio (SNR), the intelligibility drops rapidly even for listeners with normal hearing (Friesen , 2001). An interesting and important aspect of natural speech in noise recognition is that, at the same SNR, speech from certain talkers is easier to understand than others (Barker and Cooke, 2007; Bradlow , 1996; Hood and Poole, 1980). People can also deliberately modify their speech to improve the intelligibility when talking to hearing impaired listeners or talking in noisy environments (Bond and Moore, 1994; Picheny , 1985). Synthesized voices with high intelligibility can be deployed in hearing-assistive applications or scenarios with high noise levels such as public announcements in train stations or airports.
Subjective listening is the gold standard in evaluating speech intelligibility in noise, but it is time-consuming and difficult to scale up. To overcome these limitations, objective intelligibility metrics (OIMs) have been proposed to predict speech intelligibility in noise. These OIMs include speech transmission index (STI) (Houtgast and Steeneken, 1971; Goldsworthy and Greenberg, 2004; Payton and Shrestha, 2013), glimpsing proportion (GP) (Tang and Cooke, 2016; Edraki , 2022), short-time objective intelligibility (STOI) (Taal , 2010, 2011), and hearing-aid speech perception index (HASPI) (Kates and Arehart, 2014, 2021). Except for the GP method (Barker and Cooke, 2007; Tang and Cooke, 2016), most OIMs have not been fully evaluated to predict intelligibility of synthesized speech in noise.
Automatic speech recognition (ASR) offers a viable alternative to predict speech intelligibility in noise (Feng and Chen, 2022; Karbasi and Kolossa, 2022). Recently, Slaney and Fitzgerald (2024) found ASR achieved comparable performance to human listeners using the QuickSIN test. One advantage of using ASR to predict speech intelligibility is that it can provide detailed word-level or even phoneme-level intelligibility. Such additional information can be utilized for training models in other speech applications, such as speech enhancement and speech style conversion.
The goal of the present study is twofold: (1) compare the intelligibility between natural and synthesized speech and (2) compare the OIM and ASR performance against the gold-standard human performance. These two comparisons would reveal the current state of speech synthesis and speech recognition technology and suggest a potential way to synthesize voices with high intelligibility by utilizing an ASR system.
2. Human and synthesized speech
2.1 Speech materials
The speech materials consisted of 250 sentences that were grouped into 25 lists of ten sentences. The sentences were selected from Bamford–Kowal–Bench sentences for the hearing in noise test (HINT) by Nilsson (1994). These sentences describe common events in daily life with high context cues. The speech recordings were produced by four human talkers (two females and two males) and twelve synthesized voices (six females and six males).
A male talker (MaleA) was the same as in the original HINT study (Nilsson , 1994). Recordings from an additional three talkers (FemaleM, FemaleN, MaleD) were made inside a double-wall sound room in the Hearing and Speech Laboratory, University of California Irvine. During the recording, talkers were instructed to speak normally as if they were talking to their friends. A Shure SM58 microphone (Shure, Chicago, IL) was placed approximately 20 cm from the talker to record the speech. All speech recordings were made at 48 000 Hz and downsampled to 16 000 Hz for further processing, in which silence periods before and after each sentence were trimmed and the root-mean-square (RMS) level was normalized across all sentences and all talkers.
Artificially synthesized speech was generated using three commercially available TTS platforms: Amazon Polly, Microsoft Azure TTS, and Google TTS. Four voices (two females and two males) were selected from each system, as follows: Joanna, Ruth, Joey, and Stephen from Amazon Polly; Amber, Jenny, Brandon, and Eric from Microsoft Azure; and C, G, I, and J from Google. All speech was generated in June 2023 at a sample rate of 16 000 Hz. Similarly, silence periods before and after each sentence were trimmed, and the RMS level was normalized across all sentences and platforms.
2.2 Talker characteristics
The acoustic characteristics of both natural and synthesized talkers are listed in Table 1. For the corresponding talker, the average value across all 250 sentences was obtained for the mean, minimum, and maximum values of the fundamental frequency (F0 mean, F0 min, and F0 max), words per minute, and syllable rate. The F0 range was calculated as F0 max–F0 min. Specifically, F0 for each sentence was estimated using the pYIN algorithm (Mauch and Dixon, 2014), with a frame size of 2048 and a step size of 512. The range of F0 was set to be 65–500 Hz. The average, minimum, and maximum F0 values were calculated from the estimated F0 values of the voiced frames in each sentence. Words per minute was calculated by dividing the number of words in each sentence by the duration of the sentence in minutes. Syllable rate was calculated by dividing the number of syllables in each sentence by the duration of the sentence in seconds (Yang and Zeng, 2024). The human recognition in quiet values were obtained by averaging intelligibility scores from two human listeners in the quiet condition (see Sec. 3).
Talker characteristics and human recognition in quiet.
System . | Talker . | Gender . | F0 mean (Hz) . | F0 min (Hz) . | F0 max (Hz) . | F0 range (Hz) . | Words per minute . | Syllable rate (1/s) . | Human recognition in quiet . |
---|---|---|---|---|---|---|---|---|---|
Human | FemaleM | Female | 210.4 | 173.8 | 260.9 | 87.1 | 224 | 4.6 | 0.961 |
FemaleN | Female | 254.4 | 198.7 | 349.8 | 151.1 | 161 | 3.3 | 0.978 | |
MaleA | Male | 100.8 | 71.4 | 143.1 | 71.7 | 162 | 3.3 | 0.973 | |
MaleD | Male | 111.8 | 90.9 | 144.3 | 53.4 | 151 | 3.1 | 0.992 | |
Amazon | Joanna | Female | 159.4 | 119.0 | 230.2 | 111.1 | 204 | 4.2 | 0.969 |
Ruth | Female | 193.0 | 130.5 | 292.4 | 161.9 | 194 | 4.0 | 1.000 | |
Joey | Male | 102.5 | 70.2 | 130.9 | 60.7 | 156 | 3.2 | 0.965 | |
Stephen | Male | 136.2 | 82.1 | 198.8 | 116.7 | 208 | 4.3 | 1.000 | |
Azure | Amber | Female | 231.2 | 146.1 | 307.3 | 161.2 | 197 | 4.1 | 0.993 |
Jenny | Female | 187.7 | 135.4 | 263.2 | 127.8 | 202 | 4.2 | 0.962 | |
Brandon | Male | 151.7 | 86.2 | 207.4 | 121.2 | 201 | 4.2 | 0.973 | |
Eric | Male | 112.6 | 71.8 | 158.0 | 86.2 | 190 | 3.9 | 0.982 | |
C | Female | 161.3 | 104.8 | 231.8 | 126.9 | 190 | 3.9 | 1.000 | |
G | Female | 213.5 | 141.7 | 284.3 | 142.7 | 199 | 4.1 | 0.955 | |
I | Male | 151.5 | 102.6 | 210.1 | 107.5 | 206 | 4.3 | 0.958 | |
J | Male | 143.4 | 93.5 | 212.9 | 119.4 | 196 | 4.1 | 0.971 |
System . | Talker . | Gender . | F0 mean (Hz) . | F0 min (Hz) . | F0 max (Hz) . | F0 range (Hz) . | Words per minute . | Syllable rate (1/s) . | Human recognition in quiet . |
---|---|---|---|---|---|---|---|---|---|
Human | FemaleM | Female | 210.4 | 173.8 | 260.9 | 87.1 | 224 | 4.6 | 0.961 |
FemaleN | Female | 254.4 | 198.7 | 349.8 | 151.1 | 161 | 3.3 | 0.978 | |
MaleA | Male | 100.8 | 71.4 | 143.1 | 71.7 | 162 | 3.3 | 0.973 | |
MaleD | Male | 111.8 | 90.9 | 144.3 | 53.4 | 151 | 3.1 | 0.992 | |
Amazon | Joanna | Female | 159.4 | 119.0 | 230.2 | 111.1 | 204 | 4.2 | 0.969 |
Ruth | Female | 193.0 | 130.5 | 292.4 | 161.9 | 194 | 4.0 | 1.000 | |
Joey | Male | 102.5 | 70.2 | 130.9 | 60.7 | 156 | 3.2 | 0.965 | |
Stephen | Male | 136.2 | 82.1 | 198.8 | 116.7 | 208 | 4.3 | 1.000 | |
Azure | Amber | Female | 231.2 | 146.1 | 307.3 | 161.2 | 197 | 4.1 | 0.993 |
Jenny | Female | 187.7 | 135.4 | 263.2 | 127.8 | 202 | 4.2 | 0.962 | |
Brandon | Male | 151.7 | 86.2 | 207.4 | 121.2 | 201 | 4.2 | 0.973 | |
Eric | Male | 112.6 | 71.8 | 158.0 | 86.2 | 190 | 3.9 | 0.982 | |
C | Female | 161.3 | 104.8 | 231.8 | 126.9 | 190 | 3.9 | 1.000 | |
G | Female | 213.5 | 141.7 | 284.3 | 142.7 | 199 | 4.1 | 0.955 | |
I | Male | 151.5 | 102.6 | 210.1 | 107.5 | 206 | 4.3 | 0.958 | |
J | Male | 143.4 | 93.5 | 212.9 | 119.4 | 196 | 4.1 | 0.971 |
3. Speech intelligibility evaluation
3.1 Speech recognition
Ten participants listened to the sentences in noise condition, where sentences were mixed with speech–spectrum-shaped noise at −5 dB SNR. Two other participants listened to the sentences in the quiet condition (average scores are shown in Table 1). All participants were screened for normal hearing [<15 dB hearing level (HL) from 0.25 to 8 kHz in octave steps].
The speech recognition test was conducted inside a double-wall sound room. All sentences were presented binaurally under Sennheiser HDA-200 headphones (Sennheiser, Wedemark, Germany) at 70 dB sound pressure level. The participants were instructed to listen and type the words that they could recognize. The spellings were checked automatically and confirmed by the participants if a misspelling had occurred.
To familiarize the participants with the test materials and procedure, sentences from list 1 were used as practice. To avoid the order effect, sixteen talkers were presented in a random order. To minimize the learning effect and differences between the sentence lists, no list was repeated with random presentation of sentence lists from 2 to 25 for each talker.
3.2 Automatic speech recognition
We performed ASR using five ASR systems, including three commercial cloud-based platforms: Amazon Transcribe, Microsoft Azure speech-to-text, and Google speech-to-text; and two recent open-source models: wav2vec2 (Baevski , 2020) and Whisper (Radford , 2022). The three commercial systems were evaluated in February 2024 with all default settings. The official implementations of wav2vec2 and Whisper in Hugging Face were used. The pretrained model tested was “large-960h” for wav2vec2 and “large-v2” for Whisper. The five ASR systems recognized all 250 sentences in both quiet and −5 dB SNR conditions.
3.3 Intelligibility scoring
Listener can be either a human listener or an ASR system. Human recognition for talker was calculated as the average intelligibility score across all human listeners. ASR recognition for talker was calculated as the average intelligibility score across the same sentences as those the human participants listened to. Two human graders scored and verified the intelligibility scores independently. The first author made the final decision when discrepancies occurred.
3.4 Objective speech intelligibility metrics
We evaluated the speech intelligibility of the human and synthesized speech by five OIMs. These OIMs were STOI, weighted spectro-temporal modulation index (wSTMI), spectro-temporal glimpsing index (STGI), envelope-regression speech transmission index (ER-STI), and hearing-aid speech perception index version 2 (HASPI-v2).
STOI calculates correlation coefficients between time-frequency representations of the degraded and the reference speech (Taal , 2011). The STOI values were calculated using a Python implementation of the STOI with all default settings (https://github.com/mpariente/pystoi).
Both wSTMI and STGI are based on spectro-temporal modulation (STM) analysis. Inspired by STOI (Edraki , 2021), wSTMI calculates the normalized correlation coefficients between STM features of the degraded and reference speech. Inspired by the original GP method (Edraki , 2022), STGI extends the GP from the time-frequency domain to the STM domain and calculates the fraction of STM bins whose similarity between the degraded and original signals exceeds a pre-defined threshold as an intelligibility estimate. The STGI and wSTMI values were calculated using a Python implementation provided by the author (https://github.com/aminEdraki/py-intelligibility).
The ER-STI determines the linear regression coefficients between the normalized intensity envelopes of the degraded and reference speech in seven frequency bands. These regression coefficients are then translated to an STI index. The ER-STI values were calculated by our Python implementation that replicated the original method to evaluate intelligibility of different speech styles (Sec. 3D in Payton and Shrestha, 2013), in which the window length of calculating regression coefficients was set to the length of the audio length.
Hearing-aid speech perception index version 2 is based on a peripheral auditory model. The auditory model mimics processing in the peripheral auditory system, which extracts signal envelopes in 32 frequency bands. The extracted envelopes are then subject to cross correlation between cepstral coefficients of the degraded and reference speech in ten modulation frequency bands. The output cross correlation values are mapped to speech intelligibility using a trained neural network (Kates and Arehart, 2021). The HASPI-v2 values were calculated using a Python implementation of HAPSI-v2 provided by the Clarity Project (https://github.com/claritychallenge/clarity), assuming hearing thresholds of 5 dB HL across all audiometric frequencies.
The OIM score for talker T was calculated the same way as the ASR recognition, i.e., averaging values across the same sentences as used in human recognition.
4. Results
Figure 1 shows average human recognition results in −5 dB SNR noise conditions for all 16 talkers. The recognition accuracy ranges from 22.8% for the least intelligible talker (human FemaleM) to 78.8% for the most intelligible talker (Azure Jenny). In contrast, human recognition in quiet was 96.1% and 96.2% for FemaleM and Jenny, respectively (Table 1). This large range of intelligibility existed for both natural and synthesized speech. The range of natural speech intelligibility in noise was 48.9 percentage points (71.7%–22.8%), while that of synthesized speech from Azure was 45.6 percentage points (78.8%–33.2%). The synthesized speech from Amazon and Google have a relatively small range of intelligibility.
Average human recognition results for human talkers (orange), Amazon talkers (blue), Azure talkers (red), and Google talkers (brown). The results were obtained for speech mixed with speech–spectrum-shaped noise at −5 dB SNR.
Average human recognition results for human talkers (orange), Amazon talkers (blue), Azure talkers (red), and Google talkers (brown). The results were obtained for speech mixed with speech–spectrum-shaped noise at −5 dB SNR.
Figure 2 shows correlation between human recognition and five ASR systems. Azure and Whisper produced the best correlation with human recognition (r2 = 0.87 and 0.84, respectively). Both systems also produced a similar dynamic range to the human recognition score, indicated by the regression slope of 0.83. Importantly and interestingly, both systems also produced better than human recognition (all data points, except for one, were above the diagonal line). On average, Azure recognized 12.7 percentage points more words than human listeners (67.7% vs 55.0%, black dashed arrow lines), while Whisper recognized 10.7 percentage points more (65.7% vs 55.0%, black dashed arrow lines). The respective intercept of 0.22 and 0.20 suggested that the two ASR systems recognized about 20% of words even when the human listeners failed to understand any. The other three ASR systems performed worse than the human listeners, with wav2vec2 producing the poorest recognition (2.3% of words on average).
Correlations between human recognition and five ASR systems: Microsoft Azure speech-to-text, Whisper, Amazon Transcribe, Google speech-to-text, and wav2vec2. The speech was mixed with speech–spectrum-shaped noise at −5 dB SNR. Human voices are plotted using solid orange dots. Synthesized voices are plotted using open circles. The black dashed arrow lines represent average human recognition (vertical) and average ASR recognition (horizontal). The black solid line represents the linear regression between objective metric and human recognition. The regression coefficients are labeled in the upper left corner of each panel along with the r-squared value.
Correlations between human recognition and five ASR systems: Microsoft Azure speech-to-text, Whisper, Amazon Transcribe, Google speech-to-text, and wav2vec2. The speech was mixed with speech–spectrum-shaped noise at −5 dB SNR. Human voices are plotted using solid orange dots. Synthesized voices are plotted using open circles. The black dashed arrow lines represent average human recognition (vertical) and average ASR recognition (horizontal). The black solid line represents the linear regression between objective metric and human recognition. The regression coefficients are labeled in the upper left corner of each panel along with the r-squared value.
Figure 3 shows the correlation with human recognition and five OIMs. Among them, STOI and wSTMI correlated the most with human recognition, with the r-squared values being 0.70 and 0.68, respectively (top panels in Fig. 3). STGI and ER-STI could explain about half of the variability in human recognition (r2 = 0.54 and 0.50, respectively), while HASPI-v2 explained only 24% of human variability (bottom three panels in Fig. 3). Note also the relatively flat regression function for all objective metrics, showing that their corresponding values were confined to a small range (0.2–0.3), despite a possible range of 0–1.
Correlations between human recognition and five objective metrics: STOI, wSTMI, STGI, ER-STI, and HASPI-v2. Other representations are the same as in Fig. 2.
Correlations between human recognition and five objective metrics: STOI, wSTMI, STGI, ER-STI, and HASPI-v2. Other representations are the same as in Fig. 2.
5. Discussion
This study evaluated speech intelligibility of both human and synthesized voices in speech-shaped noise at −5 dB SNR. The subjective evaluation results indicate that synthesized voices have reached human-level intelligibility and certain synthesized voices remain highly intelligible even in noisy environments. The objective evaluation results reveal that modern ASR systems are more effective than traditional OIMs in predicting speech intelligibility. Moreover, the ASR systems Azure and Whisper outperformed normal hearing listeners when recognizing speech in speech-shaped noise.
There is a large range of intelligibility for both human and synthesized voices. Human listeners recognized 70%–80% of words from more intelligible voices while only understanding 20%–40% of words from less intelligible voices. This is likely caused by a variety of global and local acoustic properties including fundamental frequency range, vowel space size, and articulation precision (Hazan and Markham, 2004) as well as speech rate. It requires further investigation to confirm the previous findings. However, the most intelligible talker identified in the study, Jenny, does not fully fit the profile described by Hazan and Markham (2004). They characterized an intelligible speaker as “a woman who produces sentences with a relatively wide range in fundamental frequency.” Contrary to this description, Jenny's speech does not exhibit a wide fundamental frequency range, and her speech tempo does not significantly differ from that of other speakers (Table 1).
This study supports the assertion by Cambre (2020) that modern TTS systems have achieved or possibly surpassed human-level intelligibility. The most intelligible voice among sixteen tested voices was a synthesized voice from Microsoft Azure, followed by other highly intelligible voices from Amazon Polly and Google TTS (Fig. 1). This suggests that all three cloud-based TTS platforms can synthesize highly intelligible voices for applications in noisy conditions.
In addition to modern TTS systems having surpassed human talkers, modern ASR systems have also surpassed human listeners when it comes to recognizing words in noisy conditions. The two best performing ASR systems, Azure and Whisper, recognized 12.5% and 9.5% more words than human listeners (Fig. 2). The lower performance of human listeners can be attributed to their tendency to make minor mistakes such as typos and grammar errors compared to ASR systems. In the quiet condition, human listeners made one error in every twenty words on average (Table 1). One example of such an error is “The bad (bag) fell off the shelf.” This may be due to d/g confusion or typo. Another example of error involves homonyms. The human listener responded, “They road (rode) their bicycles.” The response makes no semantic sense, and the human listener failed to catch the mistake. Since this study rated speech intelligibility based on word accuracy, one simple spelling error has a large impact on the accuracy level. Rating intelligibility based on phoneme accuracy is more tolerant with spelling errors and would better reflect phonetic properties of the speech. Moreover, the listening task at −5 dB SNR in this study is very demanding. Human listeners reported being exhausted after the task, which can also affect their performance.
The most effective OIM in aligning with human recognition was found to be STOI, closely followed by wSTMI. However, one major drawback of traditional OIMs is that they do not correlate with subjective intelligibility linearly within their full range. Objective intelligibility metric values would change rapidly when the subjective intelligibility ranges from 20% to 80% (Fig. 3), while it would change slowly when the subjective intelligibility falls in the range of 0%–20% and 80%–100%. In the original STOI study (Taal , 2011), when the intelligibility score increased from 20% to 80%, the STOI value only increased from 0.4 to 0.6. When the STOI value changed from 0.6 to 1, the intelligibility score increased from 80% to 100%. Therefore, a logistic transformation must be applied to translate OIM values to subjective intelligibility scores, but this transformation varies from dataset to dataset (Edraki , 2022; Taal , 2011). This makes it difficult to translate OIM values directly to intelligibility scores and to interpret the meanings of OIM values. As a result, these OIMs are not comparable to each other. For the same set of speech in this study, predictions from OIMs are vastly different (Fig. 3), even though their full prediction ranges are all from 0 to 1. For example, HASPI predictions are around 0.3 while STOI predictions are around 0.6. A recent study further raised doubt about the reliability of OIMs (Gelderblom , 2024). The study found the improvement of OIM values did not correlate with the improvement of subjective speech recognition for speech obtained from advanced speech enhancement systems. Last but not least, OIMs only provide an intelligibility prediction number, while ASR systems provide the recognized scripts, which allow for further and more detailed investigation. Therefore, modern ASR systems may be a better alternative to objectively predicting speech intelligibility. However, OIMs are still useful, as they do not require much computational power and can be calculated faster than sophisticated ASR systems.
Although this study successfully identified “intrinsically clear” voices from three TTS services, it is not exhaustive due to the limited number of TTS voices evaluated. Because of the constraint of speech material, we only evaluated four voices from each TTS service, which is a small fraction of the different available voices: 13 from Amazon, 47 from Microsoft Azure, and 15 from Google. Despite this limitation, this study establishes a framework for efficiently testing and selecting the clearest TTS voice using ASR systems, reducing the reliance on human listeners.
The lingering question is whether it is possible to synthesize voices more intelligible than the best-performing voice, Microsoft Azure's “Jenny.” To explore this, one could screen all available voices in noisy conditions with either Microsoft Azure or Whisper and select the voice with the highest possible intelligibility score. If the currently available voices are not satisfactory, we may also record “deliberately clear” speech with the help of ASR systems so that the recorded speech achieves a high intelligibility score in noisy conditions.
6. Conclusions
The main findings of this work are as follows:
-
Modern TTS systems can synthesize voices as intelligible in noise as human voices.
-
Microsoft Azure speech-to-text and Whisper can recognize speech in noise better than human listeners with similar dynamic range, while the other three ASR systems performed poorly.
-
Best-performing ASR systems are better alternatives to predicting speech intelligibility in noise than traditional OIMs.
Acknowledgments
This work was supported by the Center for Hearing Research, University of California Irvine.
Author Declarations
Conflict of Interest
The authors have no conflicts to disclose.
Ethics Approval
The experiment was approved by the Institutional Review Board, University of California Irvine. Informed consent was obtained from all participants including listeners and talkers.
Data Availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.