Automatic inference of paralinguistic information from speech, such as age, is an important area of research with many technological applications. Speaker age estimation can help with age-appropriate curation of information content and personalized interactive experiences. However, automatic speaker age estimation in children is challenging due to the paucity of speech data representing the developmental spectrum, and the large signal variability including within a given age group. Most prior approaches in child speaker age estimation adopt methods directly drawn from research on adult speech. In this paper, we propose a novel technique that exploits temporal variability present in children's speech for estimation of children's age. We focus on phone durations as biomarker of children's age. Phone duration distributions are derived by forced-aligning children's speech with transcripts. Regression models are trained to predict speaker age among children studying in kindergarten up to grade 10. Experiments on two children's speech datasets are used to demonstrate the robustness and portability of proposed features over multiple domains of varying signal conditions. Phonemes contributing most to estimation of children speaker age are analyzed and presented. Experimental results suggest phone durations contain important development-related information of children. The proposed features are also suited for application under low data scenarios.

1.
Bahari
,
M. H.
,
McLaren
,
M.
,
Van hamme
,
H.
, and
van Leeuwen
,
D. A
(
2014
). “
Speaker age estimation using i-vectors
,”
Eng. Appl. Artif. Intell.
34
,
99
108
.
2.
Barreda
,
S.
, and
Assmann
,
P. F.
(
2018
). “
Modeling the perception of children's age from speech acoustics
,”
J. Acoust. Soc. Am.
143
(
5
),
EL361
EL366
.
3.
Barreda
,
S.
, and
Assmann
,
P. F.
(
2021
). “
Perception of gender in children's voices
,”
J. Acoust. Soc. Am.
150
(
5
),
3949
3963
.
4.
Bocklet
,
T.
,
Maier
,
A.
, and
Nöth
,
E.
(
2008
). “
Age determination of children in preschool and primary school age with GMM-based supervectors and support vector machines/regression
,” in
International Conference on Text, Speech and Dialogue
(
Springer
,
Berlin
), pp.
253
260
.
5.
Bone
,
D.
,
Chaspari
,
T.
, and
Narayanan
,
S.
(
2017a
). “
Behavioral signal processing and autism: Learning from multimodal behavioral signals
,” in
Autism Imaging Devices
(
CRC Press
,
Boca Raton, FL
), pp.
335
360
.
6.
Bone
,
D.
,
Lee
,
C.-C.
,
Chaspari
,
T.
,
Gibson
,
J.
, and
Narayanan
,
S.
(
2017b
). “
Signal processing and machine learning for mental health research and clinical applications
,”
IEEE Sign. Process. Mag.
34
(
5
),
196
.
7.
Dillon
,
J.
(
1983
). “
Cognitive complexity and duration of classroom speech
,”
Instrum. Sci.
12
(
1
),
59
66
.
8.
Esposito
,
A.
,
Marinaro
,
M.
, and
Palombo
,
G.
(
2004
). “
Children speech pauses as markers of different discourse structures and utterance information content
,” in
Proceedings of the International Conference: From Sound to Sense
, Vol.
50
, pp.
10
13
.
9.
Fedorova
,
A.
,
Glembek
,
O.
,
Kinnunen
,
T.
, and
Matějka
,
P.
(
2015
). “
Exploring ANN back-ends for i-vector based speaker age estimation
,” in
Sixteenth Annual Conference of the International Speech Communication Association
.
10.
Gallagher
,
T. M.
(
1977
). “
Revision behaviors in the speech of normal children developing language
,”
J. Speech Hear. Res.
20
(
2
),
303
318
.
11.
Gerosa
,
M.
,
Giuliani
,
D.
,
Narayanan
,
S.
, and
Potamianos
,
A.
(
2009
). “
A review of ASR technologies for children's speech
,” in
Proceedings of the 2nd Workshop on Child, Computer and Interaction
, pp.
1
8
.
12.
Gerosa
,
M.
,
Lee
,
S.
,
Giuliani
,
D.
, and
Narayanan
,
S.
(
2006
). “
Analyzing children's speech: An acoustic study of consonants and consonant-vowel transition
,” in
2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, IEEE
, Vol.
1
.
13.
Ghahremani
,
P.
,
Nidadavolu
,
P. S.
,
Chen
,
N.
,
Villalba
,
J.
,
Povey
,
D.
,
Khudanpur
,
S.
, and
Dehak
,
N.
(
2018
). “
End-to-end deep neural network age estimation
,” in
Interspeech
, pp.
277
281
.
14.
Grzybowska
,
J.
, and
Kacprzak
,
S.
(
2016
). “
Speaker age classification and regression using i-vectors
,” in
INTERSPEECH
, pp.
1402
1406
.
15.
Joseph
,
R. M.
,
Tager-Flusberg
,
H.
, and
Lord
,
C.
(
2002
). “
Cognitive profiles and social-communicative functioning in children with autism spectrum disorder
,”
J. Child Psychol. Psychiat.
43
(
6
),
807
821
.
16.
Karalunas
,
S. L.
,
Hawkey
,
E.
,
Gustafsson
,
H.
,
Miller
,
M.
,
Langhorst
,
M.
,
Cordova
,
M.
,
Fair
,
D.
, and
Nigg
,
J. T.
(
2018
). “
Overlapping and distinct cognitive impairments in attention-deficit/hyperactivity and autism spectrum disorder without intellectual disability
,”
J. Abnorm. Child Psychol.
46
(
8
),
1705
1716
.
17.
Kent
,
R. D.
(
1976
). “
Anatomical and neuromuscular maturation of the speech mechanism: Evidence from acoustic studies
,”
J. Speech Hear. Res.
19
(
3
),
421
447
.
18.
Kent
,
R. D.
, and
Forner
,
L. L.
(
1980
). “
Speech segment durations in sentence recitations by children and adults
,”
J. Phon.
8
(
2
),
157
168
.
19.
Kockmann
,
M.
,
Burget
,
L.
, and
Černockỳ
,
J.
(
2010
). “
Brno University of Technology system for Interspeech 2010 paralinguistic challenge
,” in
Eleventh Annual Conference of the International Speech Communication Association
.
20.
Lee
,
S.
,
Potamianos
,
A.
, and
Narayanan
,
S.
(
1999
). “
Acoustics of children's speech: Developmental changes of temporal and spectral parameters
,”
J. Acoust. Soc. Am.
105
(
3
),
1455
1468
.
21.
Lee
,
S.
,
Potamianos
,
A.
, and
Narayanan
,
S.
(
2014
). “
Developmental acoustic study of American English diphthongs
,”
J. Acoust. Soc. Am.
136
(
4
),
1880
1894
.
22.
Li
,
M.
,
Han
,
K. J.
, and
Narayanan
,
S.
(
2013
). “
Automatic speaker age and gender recognition using acoustic and prosodic level information fusion
,”
Comput. Speech Lang.
27
(
1
),
151
167
.
23.
Long
,
C.
,
Gurka
,
M. J.
, and
Blackman
,
J.
(
2011
). “
Cognitive skills of young children with and without autism spectrum disorder using the BSID-III
,”
Autism Res. Treat.
2011
,
759289
.
24.
Lord
,
C.
,
Risi
,
S.
,
Lambrecht
,
L.
,
Cook
,
E. H.
,
Leventhal
,
B. L.
,
DiLavore
,
P. C.
,
Pickles
,
A.
, and
Rutter
,
M.
(
2000
). “
The autism diagnostic observation schedule–generic: A standard measure of social and communication deficits associated with the spectrum of autism
,”
J. Autism Dev. Disorders
30
(
3
),
205
223
.
25.
Mallouh
,
A. A.
,
Qawaqneh
,
Z.
, and
Barkana
,
B. D.
(
2018
). “
New transformed features generated by deep bottleneck extractor and a GMM–UBM classifier for speaker age and gender classification
,”
Neural Comput. Applic.
30
(
8
),
2581
2593
.
26.
Mirhassani
,
S. M.
,
Zourmand
,
A.
, and
Ting
,
H.-N.
(
2014
). “
Age estimation based on children's voice: A fuzzy-based decision fusion strategy
,”
Sci. World J.
2014
,
534064
.
27.
Oller
,
D. K.
,
Niyogi
,
P.
,
Gray
,
S.
,
Richards
,
J. A.
,
Gilkerson
,
J.
,
Xu
,
D.
,
Yapanel
,
U.
, and
Warren
,
S. F.
(
2010
). “
Automated vocal analysis of naturalistic recordings from children with autism, language delay, and typical development
,”
Proc. Natl. Acad. Sci. U.S.A.
107
(
30
),
13354
13359
.
28.
Potamianos
,
A.
, and
Narayanan
,
S.
(
2003
). “
Robust recognition of children's speech
,”
IEEE Trans. Speech Audio Process.
11
(
6
),
603
616
.
29.
Potamianos
,
A.
,
Narayanan
,
S.
, and
Lee
,
S.
(
1997
). “
Automatic speech recognition for children
,” in
Fifth European Conference on Speech Communication and Technology
.
30.
Povey
,
D.
,
Ghoshal
,
A.
,
Boulianne
,
G.
,
Burget
,
L.
,
Glembek
,
O.
,
Goel
,
N.
,
Hannemann
,
M.
,
Motlicek
,
P.
,
Qian
,
Y.
,
Schwarz
,
P.
,
Silovský
,
J.
,
Stemmer
,
G.
, and
Veselý
,
K.
(
2011
). “
The Kaldi speech recognition toolkit
,” in
IEEE 2011 Workshop on Automatic Speech Recognition and Understanding
, IEEE Signal Processing Society.
31.
Qawaqneh
,
Z.
,
Mallouh
,
A. A.
, and
Barkana
,
B. D.
(
2017
). “
DNN-based models for speaker age and gender classification
,” in
International Conference on Bio-Inspired Systems and Signal Processing
, Vol.
5
, pp.
106
111
.
32.
Sadjadi
,
S. O.
,
Ganapathy
,
S.
, and
Pelecanos
,
J. W.
(
2016
). “
Speaker age estimation on conversational telephone speech using senone posterior based i-vectors
,” in
2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
,
IEEE
, pp.
5040
5044
.
33.
Safavi
,
S.
,
Russell
,
M.
, and
Jančovič
,
P.
(
2014
). “
Identification of age-group from children's speech by computers and humans
,” in
Fifteenth Annual Conference of the International Speech Communication Association
.
34.
Safavi
,
S.
,
Russell
,
M.
, and
Jančovič
,
P.
(
2018
). “
Automatic speaker, age-group and gender identification from children's speech
,”
Comput. Speech Lang.
50
,
141
156
.
35.
Sánchez-Hevia
,
H. A.
,
Gil-Pita
,
R.
,
Utrilla-Manso
,
M.
, and
Rosa-Zurera
,
M.
(
2019
). “
Convolutional-recurrent neural network for age and gender prediction from speech
,” in
2019 Signal Processing Symposium (SPSympo)
,
IEEE
, pp.
242
245
.
36.
Sarma
,
M.
,
Sarma
,
K. K.
, and
Goel
,
N. K.
(
2020
). “
Children's age and gender recognition from raw speech waveform using DNN
,” in
Advances in Intelligent Computing and Communication
(
Springer
,
Berlin
), pp.
1
9
.
37.
Schuller
,
B.
,
Steidl
,
S.
,
Batliner
,
A.
,
Burkhardt
,
F.
,
Devillers
,
L.
,
Müller
,
C.
, and
Narayanan
,
S. S.
(
2010
). “
The interspeech 2010 paralinguistic challenge
,” in
Eleventh Annual Conference of the International Speech Communication Association
.
38.
Schuller
,
B.
,
Steidl
,
S.
,
Batliner
,
A.
,
Burkhardt
,
F.
,
Devillers
,
L.
,
Müller
,
C.
, and
Narayanan
,
S. S.
(
2013
). “
Paralinguistics in speech and language–state-of-the-art and the challenge
,”
Comput. Speech Lang.
27
(
1
),
4
39
.
39.
Shivakumar
,
P. G.
, and
Georgiou
,
P.
(
2020
). “
Transfer learning from adult to children for speech recognition: Evaluation, analysis and recommendations
,”
Comput. Speech Lang.
63
,
101077
.
40.
Shivakumar
,
P. G.
,
Li
,
M.
,
Dhandhania
,
V.
, and
Narayanan
,
S. S.
(
2014a
). “
Simplified and supervised i-vector modeling for speaker age regression
,” in
2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
,
IEEE
, pp.
4833
4837
.
41.
Shivakumar
,
P. G.
, and
Narayanan
,
S.
(
2022
). “
End-to-end neural systems for automatic children speech recognition: An empirical study
,”
Comput. Speech Lang.
72
,
101289
.
42.
Shivakumar
,
P. G.
,
Potamianos
,
A.
,
Lee
,
S.
, and
Narayanan
,
S. S.
(
2014b
). “
Improving speech recognition for children using acoustic adaptation and pronunciation modeling
,” in
Wocci
, pp.
15
19
.
43.
Shobaki
,
K.
,
Hosom
,
J.-P.
, and
Cole
,
R. A.
(
2000
). “
The OGI Kids' speech corpus and recognizers
,” in
Sixth International Conference on Spoken Language Processing
.
44.
Singh
,
L.
,
Shantisudha
,
P.
, and
Singh
,
N. C.
(
2007
). “
Developmental patterns of speech production in children
,”
Appl. Acoust.
68
(
3
),
260
269
.
45.
Skoog Waller
,
S.
,
Eriksson
,
M.
, and
Sörqvist
,
P.
(
2015
). “
Can you hear my age? Influences of speech rate and speech spontaneity on estimation of speaker age
,”
Front. Psychol.
6
,
978
.
46.
Smith
,
B. L.
(
1978
). “
Temporal aspects of English speech production: A developmental perspective
,”
J. Phon.
6
(
1
),
37
67
.
47.
Sperdin
,
H. F.
, and
Schaer
,
M.
(
2016
). “
Aberrant development of speech processing in young children with autism: New insights from neuroimaging biomarkers
,”
Front. Neurosci.
10
,
393
.
48.
Vorperian
,
H. K.
,
Wang
,
S.
,
Chung
,
M. K.
,
Schimek
,
E. M.
,
Durtschi
,
R. B.
,
Kent
,
R. D.
,
Ziegert
,
A. J.
, and
Gentry
,
L. R.
(
2009
). “
Anatomic development of the oral and pharyngeal portions of the vocal tract: An imaging study
,”
J. Acoust. Soc. Am.
125
(
3
),
1666
1678
.
49.
Ward
,
W.
,
Cole
,
R.
,
Bolanos
,
D.
,
Buchenroth-Martin
,
C.
,
Svirsky
,
E.
,
Vuuren
,
S. V.
,
Weston
,
T.
,
Zheng
,
J.
, and
Becker
,
L.
(
2011
). “
My science tutor: A conversational multimedia virtual tutor for elementary school science
,”
ACM Trans. Speech Language Process. (TSLP)
7
(
4
),
1
29
.
50.
Zazo
,
R.
,
Nidadavolu
,
P. S.
,
Chen
,
N.
,
Gonzalez-Rodriguez
,
J.
, and
Dehak
,
N.
(
2018
). “
Age estimation in short speech utterances based on lstm recurrent neural networks
,”
IEEE Access
6
,
22524
22530
.
You do not currently have access to this content.