Human-generated measures of speech intelligibility are time-intensive methods. The present study seeks to automate the assessment of speech intelligibility by developing a deep neural network that estimates a standardized intelligibility score based on acoustic input. Mel-frequency cepstral coefficients were extracted from the UW/NU IEEE sentence corpus, which had been manipulated with three signal-to-noise ratios (-2, 0, 2 dB). Listener transcriptions were obtained from the UAW speech intelligibility dataset, and the Levenshtein distance was calculated between the transcriptions and the speaker's prompt. The neural network was trained to predict the Levenshtein distance given MFCC representations of sentences. Ten-fold cross-validation was used to verify the accuracy of the model and to investigate the correlation of the model predictions with the average human responses. The accuracy of the model was also compared with the Levenshtein distance generated by transcriptions produced by the DeepSpeech ASR model. This study investigates the reliability of deep neural networks as an alternative to human-based inference in quantifying the intelligibility of speech. The advantages and disadvantages of the different approaches to assessing speech intelligibility are discussed.

This content is only available via PDF.