Vocal sound imitations provide a new challenge for understanding the coupling between articulatory mechanisms and the resulting audio. In this study, the classification of three articulatory categories, phonation, supraglottal myoelastic vibrations, and turbulence, have been modeled from audio recordings. Two data sets were assembled, consisting of different vocal imitations by four professional imitators and four non-professional speakers in two different experiments. The audio data were manually annotated by two experienced phoneticians using a detailed articulatory description scheme. A separate set of audio features was developed specifically for each category using both time-domain and spectral methods. For all time-frequency transformations, and for some secondary processing, the recently developed Auditory Receptive Fields Toolbox was used. Three different machine learning methods were applied for predicting the final articulatory categories. The result with the best generalization was found using an ensemble of multilayer perceptrons. The cross-validated classification accuracy was 96.8% for phonation, 90.8% for supraglottal myoelastic vibrations, and 89.0% for turbulence using all the 84 developed features. A final feature reduction to 22 features yielded similar results.

1.
Alías
,
F.
,
Socoró
,
J. C.
, and
Sevillano
,
X.
(
2016
). “
A review of physical and perceptual feature extraction techniques for speech, music and environmental sounds
,”
Appl. Sci.
6
(
5
),
143
.
2.
Bay
,
H.
,
Ess
,
A.
,
Tuytelaars
,
T.
, and
Van Gool
,
L.
(
2008
). “
SURF: Speeded up robust features
,”
Comput. Vis. Image Understand.
110
(
3
),
346
359
.
3.
Brown
,
J. C.
, and
Puckette
,
M. S.
(
1992
). “
An efficient algorithm for the calculation of a constant Q transform
,”
J. Acoust. Soc. Am.
92
(
5
),
2698
2701
.
4.
Brugman
,
H.
, and
Russel
,
A.
(
2004
). “
Annotating multimedia/ multi-modal resources with ELAN
,” in
Proceedings of LREC 2004, Fourth International Conference on Language Resources and Evaluation
, developed at Max Planck Institute for Psycholinguistics, The Language Archive, Nijmegen, NL, https://tla.mpi.nl/tools/tla-tools/elan/ (Last viewed September 26, 2017).
5.
Burred
,
J. J.
, and
Lerch
,
A.
(
2004
). “
Hierarchical automatic audio signal classification
,”
J. Audio Eng. Soc.
52
(
7/8
),
724
738
, available at http://www.aes.org/e-lib/browse.cfm?elib=13015.
6.
Carding
,
P. N.
,
Steen
,
I. N.
,
Webb
,
A.
,
Mackenzie
,
K.
,
Deary
,
I. J.
, and
Wilson
,
J. A.
(
2004
). “
The reliability and sensitivity to change of acoustic measures of voice quality
,”
Clin. Otolaryngol.
29
(
5
),
538
544
.
7.
Chang
,
C. C.
, and
Lin
,
C. J.
(
2011
). “
LIBSVM: A library for support vector machines
,”
ACM Trans. Intell. Syst. Tech. (TIST)
2
(
3
),
1
39
.
8.
Cheveigné
,
A.
, and
Kawahara
,
H.
(
2002
). “
YIN, a fundamental frequency estimator for speech and music
,”
J. Acoust. Soc. Am.
111
(
4
),
1917
1930
.
9.
Dau
,
T.
,
Kollmeier
,
B.
, and
Kohlrausch
,
A.
(
1997
). “
Modeling auditory processing of amplitude modulation, I. Detection and masking with narrow-band carriers
,”
J. Acoust. Soc. Am.
102
(
5
),
2892
2905
.
10.
Elowsson
,
A.
(
2016
). “
Beat tracking with a cepstroid invariant neural network
,” in
17th International Society for Music Information Retrieval Conference (ISMIR 2016),
pp.
351
357
.
11.
Elowsson
,
A.
, and
Friberg
,
A.
(
2015
). “
Modeling the perception of tempo
,”
J. Acoust. Soc. Am.
137
,
3163
3177
.
12.
Elowsson
,
A.
, and
Friberg
,
A.
(
2017
). “
Predicting the perception of performed dynamics in music audio with ensemble learning
,”
J. Acoust. Soc. Am.
141
,
2224
2242
.
13.
Elowsson
,
A.
,
Friberg
,
A.
,
Madison
,
G.
, and
Paulin
,
J.
(
2013
). “
Modelling the speed of music using features from harmonic/percussive separated audio
,” in
Proceedings of the International Symposium on Music Information Retrieval (ISMIR 2013)
, pp.
481
486
.
14.
FitzGerald
,
D.
(
2010
). “
Harmonic/percussive separation using median filtering
,” in
Proceedings of DAFx-10
, Graz, Austria (September 6–10).
15.
Friberg
,
A.
,
Schoonderwaldt
,
E.
, and
Juslin
,
P. N.
(
2007
). “
CUEX: An algorithm for extracting expressive tone variables from audio recordings
,”
Acta Acust. united Acust.
93
,
411
420
, available at https://www.ingentaconnect.com/contentone/dav/aaua/2007/00000093/00000003/art00010.
16.
Geladi
,
P.
, and
Kowalski
,
B. R.
(
1986
). “
Partial least-squares regression: A tutorial
,”
Anal. Chim. Acta.
185
,
1
17
.
17.
Gorham-Rowan
,
M. M.
, and
Laures-Gore
,
J.
(
2006
). “
Acoustic-perceptual correlates of voice quality in elderly men and women
,”
J. Commun. Disorders
39
(
3
),
171
184
.
18.
Hansen
,
L. K.
, and
Salamon
,
P.
(
1990
). “
Neural network ensembles
,”
IEEE Trans. Pattern Anal. Mach. Intell.
12
,
993
1001
.
19.
Heman-Ackah
,
Y. D.
,
Michael
,
D. D.
, and
Goding
,
G. S.
(
2002
). “
The relationship between cepstral peak prominence and selected parameters of dysphonia
,”
J. Voice
16
(
1
),
20
27
.
20.
Hillenbrand
,
J.
,
Cleveland
,
R. A.
, and
Erickson
,
R. L.
(
1994
). “
Acoustic correlates of breathy vocal quality
,”
J. Speech Lang. Hear. Res.
37
(
4
),
769
778
.
21.
Hillenbrand
,
J.
, and
Houde
,
R. A.
(
1996
). “
Acoustic correlates of breathy vocal quality: Dysphonic voices and continuous speech
,”
J. Speech Lang. Hear. Res.
39
(
2
),
311
321
.
22.
Ladefoged
,
P.
, and
Maddieson
,
I.
(
1996
).
The Sounds of the World's Languages
(
Blackwell Publishers
,
Oxford, UK
).
23.
Laver
,
J.
(
1980
).
The Phonetic Description of Voice Quality
(
Cambridge University Press
,
Cambridge
).
24.
Lemaitre
,
G.
,
Houix
,
O.
,
Misdariis
N.
, and
Susini
P.
(
2010
). “
Listener expertise and sound identification influence the categorization of environmental sounds
,”
J. Exp. Psychol.: Appl.
16
(
1
),
16
32
.
25.
Lemaitre
,
G.
,
Houix
,
O.
,
Voisin
,
F.
,
Misdariis
,
N.
, and
Susini
,
P.
(
2016a
). “
Vocal imitations of non-vocal sounds
,”
PLoS One
11
(
12
),
e0168167
.
26.
Lemaitre
,
G.
,
Jabbari
,
A.
,
Misdariis
,
N.
,
Houix
,
O.
, and
Susini
,
P.
(
2016b
). “
Vocal imitations of basic auditory features
,”
J. Acoust. Soc. Am.
139
(
1
),
290
300
.
27.
Lemaitre
,
G.
,
Scurto
,
H.
,
Françoise
,
J.
,
Bevilacqua
,
F.
,
Houix
,
O.
, and
Susini
,
P.
(
2017
). “
Rising tones and rustling noises: Metaphors in gestural depictions of sounds
,”
PLoS One
12
(
7
),
e0181786
.
28.
Lemaitre
,
G.
,
Voisin
,
F.
,
Scurto
,
H.
,
Houix
,
O.
,
Susini
,
P.
,
Misdariis
,
N.
, and
Bevilacqua
,
F.
(
2015
). “
A large set of vocal and gestural imitations
,” Deliverable 4.4.1 in the EC-project Sketching Audio Technologies using Vocalizations and Gestures (SkAT-VG), http://skatvg.iuav.it/wp-content/uploads/2015/11/SkATVGDeliverableD4.4.1.pdf (Last viewed September 5, 2018).
29.
Lindeberg
,
T.
, and
Friberg
,
A.
(
2015a
). “
Idealized computational models for auditory receptive fields
,”
PLoS One
10
(
3
),
e0119032
.
30.
Lindeberg
,
T.
, and
Friberg
,
A.
(
2015b
). “
Scale-space theory for auditory signals
,” in
Proceedings of Scale Space and Variational Methods in Computer Vision (SSVM 2015)
, Vol. 9087 of Springer Lecture Notes in Computer Science, pp.
3
15
.
31.
Maryn
,
Y.
,
Roy
,
N.
,
De Bodt
,
M.
,
Van Cauwenberge
,
P.
, and
Corthals
,
P.
(
2009
). “
Acoustic measurement of overall voice quality: A meta-analysis
,”
J. Acoust. Soc. Am.
126
(
5
),
2619
2634
.
32.
Moisik
,
S. R.
(
2013
). “
The epilarynx in speech
,” Ph.D. thesis,
University of Victoria
, Department of Linguistics, Canada.
33.
Moisik
,
S. R.
,
Esling
,
J. H.
, and
Crevier-Buchman
,
L.
(
2010
). “
A high-speed laryngoscopic investigation of aryepiglottic trilling
,”
J. Acoust. Soc. Am.
127
(
3
),
1548
1558
.
34.
Peeters
,
G.
,
Giordano
,
B. L.
,
Susini
,
P.
,
Misdariis
,
N.
, and
McAdams
,
S.
(
2011
). “
The timbre toolbox: Extracting audio descriptors from musical signals
,”
J. Acoust. Soc. Am.
130
,
2902
2916
.
35.
Polikar
,
R.
(
2006
). “
Ensemble based systems in decision making
,”
IEEE Circ. Syst. Mag.
6
(
3
),
21
45
.
36.
Prame
,
E.
(
1994
). “
Measurements of the vibrato rate of ten singers
,”
J. Acoust. Soc. Am.
96
,
1979
1984
.
37.
Rao
,
V. M.
(
2011
). “
Vocal melody extraction from polyphonic audio with pitched accompaniment
,” Ph.D. thesis,
Indian Institute of Technology Bombay
, Department of Electrical Engineering, Bombay.
38.
Smola
,
A. J.
, and
Schölkopf
,
B.
(
2004
). “
A tutorial on support vector regression
,”
Stat. Comput.
14
(
3
),
199
222
.
39.
Ternström
,
S.
, and
Mauro
,
D. A.
(
2015
). “
Extensive set of recorded imitations
,” Deliverable D2.2.2 in the EC-project Sketching Audio Technologies using Vocalizations and Gestures (SkAT-VG), http://skatvg.iuav.it/wp-content/uploads/2015/01/SkATVGDeliverableD2.2.2.pdf (Last viewed September 5, 2018).
You do not currently have access to this content.