Vocal sound imitations provide a new challenge for understanding the coupling between articulatory mechanisms and the resulting audio. In this study, the classification of three articulatory categories, phonation, supraglottal myoelastic vibrations, and turbulence, have been modeled from audio recordings. Two data sets were assembled, consisting of different vocal imitations by four professional imitators and four non-professional speakers in two different experiments. The audio data were manually annotated by two experienced phoneticians using a detailed articulatory description scheme. A separate set of audio features was developed specifically for each category using both time-domain and spectral methods. For all time-frequency transformations, and for some secondary processing, the recently developed Auditory Receptive Fields Toolbox was used. Three different machine learning methods were applied for predicting the final articulatory categories. The result with the best generalization was found using an ensemble of multilayer perceptrons. The cross-validated classification accuracy was 96.8% for phonation, 90.8% for supraglottal myoelastic vibrations, and 89.0% for turbulence using all the 84 developed features. A final feature reduction to 22 features yielded similar results.
Skip Nav Destination
Article navigation
September 2018
September 19 2018
Prediction of three articulatory categories in vocal sound imitations using models for auditory receptive fields
Anders Friberg;
Anders Friberg
a)
1
Speech, Music and Hearing, School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology
, Lindstedtsvägen 24, 10044 Stockholm, Sweden
Search for other works by this author on:
Tony Lindeberg;
Tony Lindeberg
2
Computational Brain Science Lab, Computational Science and Technology, School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology
, Lindstedtsvägen 5, 10044 Stockholm, Sweden
Search for other works by this author on:
Martin Hellwagner;
Martin Hellwagner
1
Speech, Music and Hearing, School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology
, Lindstedtsvägen 24, 10044 Stockholm, Sweden
Search for other works by this author on:
Pétur Helgason;
Pétur Helgason
1
Speech, Music and Hearing, School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology
, Lindstedtsvägen 24, 10044 Stockholm, Sweden
Search for other works by this author on:
Gláucia Laís Salomão;
Gláucia Laís Salomão
1
Speech, Music and Hearing, School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology
, Lindstedtsvägen 24, 10044 Stockholm, Sweden
Search for other works by this author on:
Anders Elowsson;
Anders Elowsson
1
Speech, Music and Hearing, School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology
, Lindstedtsvägen 24, 10044 Stockholm, Sweden
Search for other works by this author on:
Guillaume Lemaitre;
Guillaume Lemaitre
3
Institute for Research and Coordination in Acoustics and Music
, 1 Place Igor Stravinsky, Paris 75004, France
Search for other works by this author on:
Sten Ternström
Sten Ternström
1
Speech, Music and Hearing, School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology
, Lindstedtsvägen 24, 10044 Stockholm, Sweden
Search for other works by this author on:
a)
Electronic mail: afriberg@kth.se
J. Acoust. Soc. Am. 144, 1467–1483 (2018)
Article history
Received:
February 02 2018
Accepted:
August 16 2018
Citation
Anders Friberg, Tony Lindeberg, Martin Hellwagner, Pétur Helgason, Gláucia Laís Salomão, Anders Elowsson, Guillaume Lemaitre, Sten Ternström; Prediction of three articulatory categories in vocal sound imitations using models for auditory receptive fields. J. Acoust. Soc. Am. 1 September 2018; 144 (3): 1467–1483. https://doi.org/10.1121/1.5052438
Download citation file:
Sign in
Don't already have an account? Register
Sign In
You could not be signed in. Please check your credentials and make sure you have an active account and try again.
Pay-Per-View Access
$40.00
Citing articles via
A survey of sound source localization with deep learning methods
Pierre-Amaury Grumiaux, Srđan Kitić, et al.
Co-speech head nods are used to enhance prosodic prominence at different levels of narrow focus in French
Christopher Carignan, Núria Esteve-Gibert, et al.
Source and propagation modelling scenarios for environmental impact assessment: Model verification
Michael A. Ainslie, Robert M. Laws, et al.
Related Content
Predicting the perception of performed dynamics in music audio with ensemble learning
J. Acoust. Soc. Am. (March 2017)
Transglottal pressure during the laryngeal vibratory cycle
J Acoust Soc Am (August 2005)
Two‐mass model of the larynx: Vocal fold vibration as a negative differential resistance oscillation
J Acoust Soc Am (August 2005)
A reversal of the song advantage in vocal pitch imitation
JASA Express Lett. (March 2022)
Vocal imitation of synthesised sounds varying in pitch, loudness and spectral centroid
J. Acoust. Soc. Am. (February 2017)