The article is devoted to a comparative study of the deep convolutional neural network (CNN) architectures recognizing basic emotions in human conversational speech using spectrograms of the speech audio signals. The paper considers a number of the most common CNN architectures: AlexNet, VGG-13, ResNet-18, MobileNet-V2, EfficientNet-B0. The research was carried out on the IEMOCAP dataset, which was labeled into four classes of basic emotions: Anger, Happiness, Neutral, Sadness. Accuracy, precision, recall, f1-score were chosen as quality metrics. A comparative analysis of recognition accuracy showed that the best results were achieved in accuracy for the simplest and least deep architectures AlexNet and VGG13 (accuracy was 0.649 and 0.662, respectively). The study also showed that with an increase in the depth and complexity of the architecture, the recognition results decrease.
Skip Nav Destination
Article navigation
22 June 2022
PROCEEDINGS OF THE II INTERNATIONAL CONFERENCE ON ADVANCES IN MATERIALS, SYSTEMS AND TECHNOLOGIES: (CAMSTech-II 2021)
29–31 July 2021
Krasnoyarsk, Russian Federation
Research Article|
June 22 2022
A comparison study of widespread CNN architectures for speech emotion recognition on spectrogram
Artem Ryabinov;
Artem Ryabinov
a)
Russian Academy of Sciences, St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS)
, 39, 14th Line, St. Petersburg, 199178, Russia
Search for other works by this author on:
Mikhail Uzdiaev
Mikhail Uzdiaev
b)
Russian Academy of Sciences, St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS)
, 39, 14th Line, St. Petersburg, 199178, Russia
b)Corresponding author: [email protected]
Search for other works by this author on:
b)Corresponding author: [email protected]
AIP Conf. Proc. 2467, 050008 (2022)
Citation
Artem Ryabinov, Mikhail Uzdiaev; A comparison study of widespread CNN architectures for speech emotion recognition on spectrogram. AIP Conf. Proc. 22 June 2022; 2467 (1): 050008. https://doi.org/10.1063/5.0092612
Download citation file:
Pay-Per-View Access
$40.00
Sign In
You could not be signed in. Please check your credentials and make sure you have an active account and try again.
76
Views
Citing articles via
Inkjet- and flextrail-printing of silicon polymer-based inks for local passivating contacts
Zohreh Kiaee, Andreas Lösel, et al.
Effect of coupling agent type on the self-cleaning and anti-reflective behaviour of advance nanocoating for PV panels application
Taha Tareq Mohammed, Hadia Kadhim Judran, et al.
Students’ mathematical conceptual understanding: What happens to proficient students?
Dian Putri Novita Ningrum, Budi Usodo, et al.
Related Content
Speech emotion recognition based on transfer learning from the FaceNet framework
J. Acoust. Soc. Am. (February 2021)
Classifying the emotional speech content of participants in group meetings using convolutional long short-term memory network
J. Acoust. Soc. Am. (February 2021)
An efficient approach for driver’s drowsiness detection system using deep learning and transfer learning
AIP Conf. Proc. (July 2024)
A contemporary approach for emotion recognition using deep learning techniques from IEMOCAP multimodal emotion dataset
AIP Conf. Proc. (March 2024)
Modeling EfficientNet-B3 model for AI-based COVID-19 detection in chest x-rays
AIP Conf. Proc. (March 2024)