Speech plays an important role in human–computer emotional interaction. FaceNet used in face recognition achieves great success due to its excellent feature extraction. In this study, we adopt the FaceNet model and improve it for speech emotion recognition. To apply this model for our work, speech signals are divided into segments at a given time interval, and the signal segments are transformed into a discrete waveform diagram and spectrogram. Subsequently, the waveform and spectrogram are separately fed into FaceNet for end-to-end training. Our empirical study shows that the pretraining is effective on the spectrogram for FaceNet. Hence, we pretrain the network on the CASIA dataset and then fine-tune it on the IEMOCAP dataset with waveforms. It will derive the maximum transfer learning knowledge from the CASIA dataset due to its high accuracy. This high accuracy may be due to its clean signals. Our preliminary experimental results show an accuracy of 68.96% and 90% on the emotion benchmark datasets IEMOCAP and CASIA, respectively. The cross-training is then conducted on the dataset, and comprehensive experiments are performed. Experimental results indicate that the proposed approach outperforms state-of-the-art methods on the IEMOCAP dataset among single modal approaches.
Skip Nav Destination
,
,
,
,
,
Article navigation
February 2021
February 25 2021
Speech emotion recognition based on transfer learning from the FaceNet frameworka) Available to Purchase
Special Collection:
Machine Learning in Acoustics
Shuhua Liu;
Shuhua Liu
b)
1
Northeast Normal University
, Changchun, Jilin Province 130117, China
Search for other works by this author on:
Mengyu Zhang;
Mengyu Zhang
1
Northeast Normal University
, Changchun, Jilin Province 130117, China
Search for other works by this author on:
Ming Fang;
Ming Fang
1
Northeast Normal University
, Changchun, Jilin Province 130117, China
Search for other works by this author on:
Jianwei Zhao;
Jianwei Zhao
1
Northeast Normal University
, Changchun, Jilin Province 130117, China
Search for other works by this author on:
Kun Hou;
Kun Hou
1
Northeast Normal University
, Changchun, Jilin Province 130117, China
Search for other works by this author on:
Chih-Cheng Hung
Chih-Cheng Hung
2
College of Computing and Software Engineering, Kennesaw State University
, Marietta, Georgia 30060, USA
Search for other works by this author on:
Shuhua Liu
1,b)
Mengyu Zhang
1
Ming Fang
1
Jianwei Zhao
1
Kun Hou
1
Chih-Cheng Hung
2
1
Northeast Normal University
, Changchun, Jilin Province 130117, China
2
College of Computing and Software Engineering, Kennesaw State University
, Marietta, Georgia 30060, USA
a)
This paper is part of a special issue on Machine Learning in Acoustics.
b)
Electronic mail: [email protected]
J. Acoust. Soc. Am. 149, 1338–1345 (2021)
Article history
Received:
October 01 2020
Accepted:
January 26 2021
Citation
Shuhua Liu, Mengyu Zhang, Ming Fang, Jianwei Zhao, Kun Hou, Chih-Cheng Hung; Speech emotion recognition based on transfer learning from the FaceNet framework. J. Acoust. Soc. Am. 1 February 2021; 149 (2): 1338–1345. https://doi.org/10.1121/10.0003530
Download citation file:
Pay-Per-View Access
$40.00
Sign In
You could not be signed in. Please check your credentials and make sure you have an active account and try again.
Citing articles via
Focality of sound source placement by higher (ninth) order ambisonics and perceptual effects of spectral reproduction errors
Nima Zargarnezhad, Bruno Mesquita, et al.
A survey of sound source localization with deep learning methods
Pierre-Amaury Grumiaux, Srđan Kitić, et al.
Variation in global and intonational pitch settings among black and white speakers of Southern American English
Aini Li, Ruaridh Purse, et al.
Related Content
AI based face recognition system using FaceNet deep learning architecture
AIP Conf. Proc. (September 2022)
Development of multi-face recognition system with partial faces using deep neural networks
AIP Conf. Proc. (March 2025)
Classifying the emotional speech content of participants in group meetings using convolutional long short-term memory network
J. Acoust. Soc. Am. (February 2021)
Android based application for locating a missing person
AIP Conf. Proc. (March 2024)