Speech plays an important role in human–computer emotional interaction. FaceNet used in face recognition achieves great success due to its excellent feature extraction. In this study, we adopt the FaceNet model and improve it for speech emotion recognition. To apply this model for our work, speech signals are divided into segments at a given time interval, and the signal segments are transformed into a discrete waveform diagram and spectrogram. Subsequently, the waveform and spectrogram are separately fed into FaceNet for end-to-end training. Our empirical study shows that the pretraining is effective on the spectrogram for FaceNet. Hence, we pretrain the network on the CASIA dataset and then fine-tune it on the IEMOCAP dataset with waveforms. It will derive the maximum transfer learning knowledge from the CASIA dataset due to its high accuracy. This high accuracy may be due to its clean signals. Our preliminary experimental results show an accuracy of 68.96% and 90% on the emotion benchmark datasets IEMOCAP and CASIA, respectively. The cross-training is then conducted on the dataset, and comprehensive experiments are performed. Experimental results indicate that the proposed approach outperforms state-of-the-art methods on the IEMOCAP dataset among single modal approaches.

1.
W. Q.
Zheng
,
J. S.
Yu
, and
Y. X.
Zou
, “
An experimental study of speech emotion recognition based on deep convolutional neural networks
,” in
Proceedings of the 2015 International Conference on Affective Computing and Intelligent Interaction (ACII)
, Xi'an, China (September 21–24,
2015
), pp.
827
831
.
2.
F.
Schroff
,
D.
Kalenichenko
, and
J.
Philbin
, “
FaceNet: A unified embedding for face recognition and clustering
,” in
Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
,
Boston, MA
(June 7–12,
2015
), pp.
815
823
.
3.
C.
Busso
,
M.
Bulut
,
C.-C.
Lee
,
A.
Kazemzadeh
,
E.
Mower
,
S.
Kim
,
J. N.
Chang
,
S.
Lee
, and
S. S.
Narayanan
, “
IEMOCAP: Interactive emotional dyadic motion capture database
,”
Lang. Resour. Eval.
42
(4),
335
359
(
2008
).
4.
CASIA Chinese Emotion Corpus
,
2008
, http://www.chineseldc.org/resource_info.php?rid=76.
5.
M. B.
Akçay
and
K.
Oğuz
, “
Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers
,”
Speech Commun.
116
,
56
76
(
2020
).
6.
Information on the Interspeech conference
available at http://www.interspeech2020.org/ (Last viewed10/29/2020).
7.
A.
Satt
,
S.
Rozenberg
, and
R.
Hoory
, “
Efficient emotion recognition from speech using deep learning on spectrograms
,” in
Proceedings of Interspeech 2017
,
Stockholm, Sweden
(August 20–24,
2017
), pp.
1089
1093
.
8.
X.
Ma
,
Z.
Wu
,
J.
Jia
,
M.
Xu
,
H.
Meng
, and
L.
Cai
, “
Emotion recognition from variable-length speech segments using deep learning on spectrograms
,” in
Proceedings of Interspeech 2018
,
Hyderabad, India
(September 2–6,
2018
), pp.
3683
3687
.
9.
D.
Dai
,
Z.
Wu
,
R.
Li
,
X.
Wu
,
J.
Jia
, and
H. M.
Meng
, “
Learning discriminative features from spectrograms using center loss for speech emotion recognition
,” in
Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
,
Brighton, UK
(May 12–17,
2019
), pp.
7405
7409
.
10.
M.
Neumann
and
N. T.
Vu
, “
Improving speech emotion recognition with unsupervised representation learning on unlabeled speech
,” in
Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
,
Brighton, UK
(May 12–17,
2019
), pp.
7390
7394
.
11.
S.
Sabour
,
N.
Frosst
, and
G.
Hinton
, “
Dynamic routing between capsules
,” in
Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017)
,
Long Beach, CA
(December 4–9,
2017
), pp.
3856
3866
.
12.
X.
Wu
,
S.
Liu
,
Y.
Cao
,
X.
Li
,
J.
Yu
,
D.
Dai
,
X.
Ma
,
S.
Hu
,
Z.
Wu
,
X.
Liu
, and
H.
Meng
, “
Speech emotion recognition using capsule networks
,” in
Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
,
Brighton, UK
(May 12–17,
2019
), pp.
6695
-
6699
.
13.
H.
Meng
,
T.
Yan
,
F.
Yuan
, and
H.
Wei
, “
Speech emotion recognition from 3D log-mel spectrograms with deep learning network
,”
IEEE Access
7
,
125868
125881
(
2019
).
14.
G.
Trigeorgis
,
F.
Ringeval
,
R.
Brueckner
,
E.
Marchi
,
M. A.
Nicolaou
,
B.
Schuller
, and
S.
Zafeiriou
, “
Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network
,” in
Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
,
Shanghai, China
(March 20–25,
2016
), pp.
5200
5204
.
15.
C. W.
Huang
and
S. S.
Narayanan
, “
Attention assisted discovery of sub-utterance structure in speech emotion recognition
,” in
Proceedings of Interspeech 2016
,
San Francisco, CA
(September 8–12,
2016
), pp.
1387
1391
.
16.
M.
Neumann
and
N. T.
Vu
, “
Attentive convolutional neural network based speech emotion recognition: A study on the impact of input features, signal length, and acted speech
,” in
Proceedings of Interspeech 2017
,
Stockholm, Sweden
(August 20–24,
2017
), pp.
1263
1267
.
17.
S.
Mirsamadi
,
E.
Barsoum
, and
C.
Zhang
, “
Automatic speech emotion recognition using recurrent neural networks with local attention
,” in
Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
,
New Orleans, LA
(March 5–9,
2017
), pp.
2227
2231
.
18.
C. W.
Huang
and
S.
Narayanan
, “
Shaking acoustic spectral sub-bands can Letxer regularize learning in affective computing
,” in
Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
,
Calgary, Canada
(April 15–20,
2018
), pp.
6827
6831
.
19.
Z.
Zhang
,
B.
Wu
, and
B.
Schuller
, “
Attention-augmented end-to-end multi-task learning for emotion prediction from speech
,” in
Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
,
Brighton, UK
(May 12–17,
2019
), pp.
6705
6709
.
20.
Z.
Aldeneh
and
E. M.
Provost
, “
Using regional saliency for speech emotion recognition
,” in
Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
,
New Orleans, LA
(March 5–9,
2017
), pp.
2741
2745
.
21.
K.
Mangalam
and
T.
Guha
, “
Learning Spontaneity to Improve Emotion Recognition in Speech
,” in
Proceedings of Interspeech 2018
,
Hyderabad, India
(September 2–6,
2018
), pp.
946
950
22.
R.
Li
,
Z.
Wu
,
J.
Jia
,
S.
Zhao
, and
H.
Meng
, “
Dilated residual network with multi-head self-attention for speech emotion recognition
,” in
Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
,
Brighton, UK
(May 12–17,
2019
), pp.
6675
6679
.
23.
J.
Huang
,
Y.
Li
,
J.
Tao
, and
Z.
Lian
, “
Speech emotion recognition from variable-length inputs with triplet loss function
,” in
Proceedings of Interspeech 2018
,
Hyderabad, India
(September 2–6,
2018
), pp.
3673
3677
.
24.
V.
Chernykh
and
P.
Prikhodko
, “
Emotion recognition from speech with recurrent neural networks
,” in
Proceedings of Interspeech 2017
,
Stockholm, Sweden
(August 20–24,
2017
), pp.
3673
3677
.
25.
W.
Han
,
H.
Ruan
,
X.
Chen
,
Z.
Wang
,
H.
Li
, and
B.
Schuller
, “
Towards temporal modelling of categorical speech emotion recognition
,” in
Proceedings of Interspeech 2018
,
Hyderabad, India
(September 2–6,
2018
), pp.
932
936
.
26.
C.
Szegedy
,
W.
Liu
,
Y. Q.
Jia
,
P.
Sermanet
,
S.
Reed
,
D.
Anguelov
,
D.
Erhan
,
V.
Vanhoucke
, and
A.
Rabinovich
, “
Going deeper with convolutions
,” in
Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
,
Boston, MA
(June 7–12,
2015
), pp.
1
9
.
27.
K.
Simonyan
and
A.
Zisserman
, “
Very deep convolutional networks for large-scale image recognition
,” in
Proceedings of the International Conference on Learning Representations (ICLR)
, pp.
1884
2021
(
2015
).
28.
M. D.
Zeiler
and
R.
Fergus
, “
Visualizing and understanding convolutional networks
,” in
Proceedings of Computer Vision–ECCV 2014
, pp.
818
833
(
2014
).
29.
B. E. D.
Kingsbury
,
N.
Morgan
, and
S.
Greenberg
, “
Robust speech recognition using the modulation spectrogram
,”
Speech Commun.
25
(
1
),
117
132
(
1998
).
30.
H. M.
Fayek
,
M.
Lech
, and
L.
Cavedon
, “
Evaluating deep learning architectures for speech emotion recognition
,”
Neural Netw.
92
,
60
68
(
2017
).
31.
L. H.
Sun
,
S.
Fu
, and
F.
Wang
, “
Decision tree SVM model with Fisher feature selection for speech emotion recognition
,”
EURASIP J. Audio Speech Music Process.
2019
,
1
14
(
2019
).
32.
L.
Chen
,
W.
Su
,
M.
Wu
,
W.
Pedrycz
, and
K.
Hirota
, “
A fuzzy deep neural network with sparse autoencoder for emotional intention understanding in human–robot interaction
,”
IEEE Trans. Fuzzy Syst.
28
(
7
),
1252
1264
(
2020
).
33.
L.-F.
Chen
,
W.
Su
,
Y.
Feng
, and
M.
Wu
, “
Two-layer fuzzy multiple random forest for speech emotion recognition in human-robot interaction
,”
Inf. Sci.
509
,
150
163
(
2020
).
34.
J. T.
Liu
,
W. M.
Zheng
,
Y.
Zong
,
C.
Lu
, and
C.
Tang
, “
Cross-corpus speech emotion recognition based on deep domain-adaptive convolutional neural network
,”
IEICE Trans. Inf. Syst.
E103.D
(
2
),
459
463
(
2020
).
35.
Z. T.
Liu
,
Q.
Xie
,
M.
Wu
, and
W.
Cao
, “
Speech emotion recognition based on an improved brain emotion learning model
,”
Neurocomputing
309
,
145
156
(
2018
).
36.
G. L.
Li
,
Y.
Tie
, and
L.
Qi
, “
Multi-feature speech emotion recognition based on random forest classification optimization
,”
Microelectronics Comput.
36
(
1
),
70
73
(
2019
).
37.
M.
Gao
,
J.
Dong
,
D.
Zhou
,
Q.
Zhang
, and
D.
Yang
, “
End-to-end speech emotion recognition based on one-dimensional convolutional neural network
,” in
Proceedings of the 3rd International Conference on Innovation in Artificial Intelligence
, pp.
78
82
(
2019
).
You do not currently have access to this content.