Automatic inference of paralinguistic information from speech, such as age, is an important area of research with many technological applications. Speaker age estimation can help with age-appropriate curation of information content and personalized interactive experiences. However, automatic speaker age estimation in children is challenging due to the paucity of speech data representing the developmental spectrum, and the large signal variability including within a given age group. Most prior approaches in child speaker age estimation adopt methods directly drawn from research on adult speech. In this paper, we propose a novel technique that exploits temporal variability present in children's speech for estimation of children's age. We focus on phone durations as biomarker of children's age. Phone duration distributions are derived by forced-aligning children's speech with transcripts. Regression models are trained to predict speaker age among children studying in kindergarten up to grade 10. Experiments on two children's speech datasets are used to demonstrate the robustness and portability of proposed features over multiple domains of varying signal conditions. Phonemes contributing most to estimation of children speaker age are analyzed and presented. Experimental results suggest phone durations contain important development-related information of children. The proposed features are also suited for application under low data scenarios.
Phone duration modeling for speaker age estimation in children
Prashanth Gurunath Shivakumar, Somer Bishop, Catherine Lord, Shrikanth Narayanan; Phone duration modeling for speaker age estimation in children. J. Acoust. Soc. Am. 1 November 2022; 152 (5): 3000–3009. https://doi.org/10.1121/10.0015198
Download citation file: