Automatic accentedness rating has the potential to improve many human-computer interactions involving speech, including the adaptation of automatic speech recognition or other artificial intelligence models to the speaker’s accent. Accent ratings may also be used as a metric by which language learners can quantify their progress. This study employs bidirectional long short-term memory layers in a neural network to predict human ratings of the accentedness of recorded speech. Speech data are extracted in five-second segments from over 2,000 first- and second-language English speakers from multiple corpora. Human ratings are obtained in an online experiment where participants rate the accentedness of a given speech recording on a 9-point Likert scale. Mel-frequency cepstral coefficients and mel-filterbank energy features are tested as speech input representations for the neural network. When models are evaluated on a held out test set, the model’s predictions and average human ratings are correlated (r=0.57). While previous methods which automatically compare speech that has been transcribed or use accent-specific Gaussian mixture models to compare acoustic templates perform better, the present model requires no transcription or template and can perform accent-general inference.

This content is only available via PDF.