The article is devoted to a comparative study of the deep convolutional neural network (CNN) architectures recognizing basic emotions in human conversational speech using spectrograms of the speech audio signals. The paper considers a number of the most common CNN architectures: AlexNet, VGG-13, ResNet-18, MobileNet-V2, EfficientNet-B0. The research was carried out on the IEMOCAP dataset, which was labeled into four classes of basic emotions: Anger, Happiness, Neutral, Sadness. Accuracy, precision, recall, f1-score were chosen as quality metrics. A comparative analysis of recognition accuracy showed that the best results were achieved in accuracy for the simplest and least deep architectures AlexNet and VGG13 (accuracy was 0.649 and 0.662, respectively). The study also showed that with an increase in the depth and complexity of the architecture, the recognition results decrease.

1.
V.K.
Vilyunas
and
Yu.B.
Gippenreiter
,
The psychology of emotions
(
Piter
,
St. Petersburg
,
2004
).
2.
Akçay
,
Mehmet
Berkehan
, and
Kaya
Oguz
,
Speech Communication
116
,
56
76
(
2020
).
3.
A.
Krizhevsky
,
I.
Sutskever
, and
G. E.
Hinton
,
Advances in neural information processing systems
25
,
1097
1105
(
2012
).
4.
Karen
Simonyan
and
Andrew
Zisserman
, “
Very deep convolutional networks for large-scale image recognition
,” arXiv preprint arXiv:arXiv:1409.1556 (
2014
).
5.
He
Kaiming
,
Zhang
Xiangyu
,
Ren
Shaoqing
, and
Jian
Sun
, “
Deep residual learning for image recognition
,” in
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp.
770
778
(
2016
).
6.
M.
Sandler
,
A.
Howard
,
M.
Zhu
,
A.
Zhmoginov
, and
L. C.
Chen
, “
Mobilenetv2: Inverted residuals and linear bottlenecks
,” in
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp.
4510
4520
(
2018
).
7.
Mingxing
Tan
, and
Le Quoc
, “
Efficientnet: Rethinking model scaling for convolutional neural networks
,” in
Proc. International Conference on Machine Learning
, pp.
6105
6114
(
2019
).
8.
C.
Busso
 et al.,
Language resources and evaluation
42
(
4
),
335
359
(
2008
).
9.
Y.
Zhang
,
J.
Du
,
Z.
Wang
,
J.
Zhang
, and
Y.
Tu
, “
Attention based fully convolutional network for speech emotion recognition
,” in
Proc. 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
, pp.
1771
1775
(
2018
).
10.
M.N.
Stolar
,
M.
Lech
,
R.S.
Bolia
, and
M.
Skinner
, “
Real time speech emotion recognition using RGB image classification and transfer learning
,” in
Proc. 2017 11th International Conference on Signal Processing and Communication Systems (ICSPCS)
, pp.
1
8
) (
2017
).
11.
F.
Burkhardt
,
A.
Paeschke
,
M.
Rolfes
,
W. F.
Sendlmeier
, and
B.
Weiss
, “
A database of German emotional speech
,” in
Proc. Ninth european conference on speech communication and technology
, pp.
1517
1520
(
2005
).
12.
A.S.
Popova
,
A.G.
Rassadin
, and
A.A.
Ponomarenko
, “
Emotion recognition in sound
,” in
International Conference on Neuroinformatics
, pp.
117
124
(
2017
).
13.
S. R.
Livingstone
and
F.A.
Russo
,
PloS one
13
(
5
),
e0196391
(
2018
).
14.
M.
Sajjad
and
S.
Kwon
,
IEEE Access
8
,
79861
79875
(
2020
).
15.
A.G.
Howard
,
M.
Zhu
,
B.
Chen
,
D.
Kalenichenko
,
W.
Wang
,
T.
Weyand
, and
H.
Adam
, “
Mobilenets: Efficient convolutional neural networks for mobile vision applications
,” arXiv preprint1704.04861 (
2017
).
16.
Jie
Hu
,
Shen
Li
, and
Sun
Gang
, “
Squeeze-and-excitation networks
,” in
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp.
7132
7141
(
2018
).
17.
A.V.
Ryabinov
,
M.Yu.
Uzdiaev
,
I.V.
Vatamaniuk
,
Proceedings of the Southwest State University
25
(
1
),
82
109
(
2021
).
18.
G.E.
Hinton
,
N.
Srivastava
,
A.
Krizhevsky
,
I.
Sutskever
, and
R.R.
Salakhutdinov
, “
Improving neural networks by preventing co-adaptation of feature detectors
,” arXiv preprint 1207.0580 (
2012
).
19.
Vinod
Nair
and
Geoffrey E.
Hinton
, “
Rectified linear units improve restricted boltzmann machines
,” in
Proc. Icml.
(
2010
).
20.
Diederik P.
Kingma
and
Ba.
Jimmy
, “
Adam: A method for stochastic optimization
,” arXiv preprint arXiv:1412.6980 (
2014
).
This content is only available via PDF.
You do not currently have access to this content.