Phonological contrasts are usually signaled by multiple cues, and tonal languages typically involve multiple dimensions to distinguish between tones (e.g., duration, pitch contour, and voice quality, etc.). While the topic has been extensively studied, research has mostly used small datasets. This study employs a deep neural network (DNN) based speech recognizer trained on the AISHELL-1 (Bu et al., 2017) speech corpus (178 hours of read speech) to explore the tone space in Mandarin Chinese. A recent study shows that DNN models learn linguistically-interpretable information to distinguish between vowels (Weber et al., 2016). Specifically, from a low-dimensional Bottleneck layer, the model learns features comparable to F1 and F2. In the current study, we propose a more complicated Long Short-Term Memory (LSTM) model—with a Bottleneck layer implemented in the hidden layers—to account for variable duration, an important cue for tone discrimination. By interpreting the features learned in the Bottleneck layer, we explore what acoustic dimensions are involved in distinguishing tones. The large amount of data from the speech corpus also renders the results more convincing and provides additional insights not possible from studies with more limited data sets.
Skip Nav Destination
Meeting abstract. No PDF available.
March 01 2019
A deep neural network approach to investigate tone space in languages
J. Acoust. Soc. Am. 145, 1913 (2019)
Bing'er Jiang, Tim O'Donnell, Meghan Clayards; A deep neural network approach to investigate tone space in languages. J. Acoust. Soc. Am. 1 March 2019; 145 (3_Supplement): 1913. https://doi.org/10.1121/1.5101949
Download citation file: