Speech technology promises to enhance security and assist in everyday tasks. Automatic speech recognition (ASR) converts spoken words into text, facilitating interaction with electronic devices. However, ASR may not work equally well for underrepresented accent groups. Multiple studies over the last several years (Koenecke et al. 2020, Tatman 2017) have shown that ASR performs particularly poorly on African American English (AAE). This performance drop is likely due to imbalances in accent representation in training data. Here we assess manipulation of vocal tract properties, including exploratory manipulation of vocal fold harmonics, as a data augmentation method for improving ASR performance on AAE in adaptation of end-to-end systems.

This content is only available via PDF.