Automatic accentedness rating has the potential to improve many human-computer interactions involving speech, including the adaptation of automatic speech recognition or other artificial intelligence models to the speaker’s accent. Accent ratings may also be used as a metric by which language learners can quantify their progress. This study employs bidirectional long short-term memory layers in a neural network to predict human ratings of the accentedness of recorded speech. Speech data are extracted in five-second segments from over 2,000 first- and second-language English speakers from multiple corpora. Human ratings are obtained in an online experiment where participants rate the accentedness of a given speech recording on a 9-point Likert scale. Mel-frequency cepstral coefficients and mel-filterbank energy features are tested as speech input representations for the neural network. When models are evaluated on a held out test set, the model’s predictions and average human ratings are correlated (r=0.57). While previous methods which automatically compare speech that has been transcribed or use accent-specific Gaussian mixture models to compare acoustic templates perform better, the present model requires no transcription or template and can perform accent-general inference.
Skip Nav Destination
September 19 2022
Automatic accentedness rating using deep neural networks
Tyler T. Schnoor;
Matthew C. Kelley;
Tyler T. Schnoor, Matthew C. Kelley, Benjamin V. Tucker; Automatic accentedness rating using deep neural networks. Proc. Mtgs. Acoust. 29 November 2021; 45 (1): 060013. https://doi.org/10.1121/2.0001617
Download citation file: