Emotion is a central component of verbal communication between humans. Due to advances in machine learning and the development of affective computing, automatic emotion recognition is increasingly possible and sought after. To examine the connection between emotional speech and significant group dynamics perceptions, such as leadership and contribution, a new dataset (14 group meetings, 45 participants) is collected for analyzing collaborative group work based on the lunar survival task. To establish a training database, each participant's audio is manually annotated both categorically and along a three-dimensional scale with axes of activation, dominance, and valence and then converted to spectrograms. The performance of several neural network architectures for predicting speech emotion are compared for two tasks: categorical emotion classification and 3D emotion regression using multitask learning. Pretraining each neural network architecture on the well-known IEMOCAP (Interactive Emotional Dyadic Motion Capture) corpus improves the performance on this new group dynamics dataset. For both tasks, the two-dimensional convolutional long short-term memory network achieves the highest overall performance. By regressing the annotated emotions against post-task questionnaire variables for each participant, it is shown that the emotional speech content of a meeting can predict 71% of perceived group leaders and 86% of major contributors.
Skip Nav Destination
,
,
,
Article navigation
February 2021
February 04 2021
Classifying the emotional speech content of participants in group meetings using convolutional long short-term memory networka) Available to Purchase
Special Collection:
Machine Learning in Acoustics
Mallory M. Morgan;
Mallory M. Morgan
b)
1
School of Architecture, Rensselaer Polytechnic Institute
, Troy, New York 12180, USA
Search for other works by this author on:
Indrani Bhattacharya;
Indrani Bhattacharya
c)
2
Department of Electrical, Computer, and Systems Engineering, Rensselaer Polytechnic Institute
, Troy, New York 12180, USA
Search for other works by this author on:
Richard J. Radke;
Richard J. Radke
d)
2
Department of Electrical, Computer, and Systems Engineering, Rensselaer Polytechnic Institute
, Troy, New York 12180, USA
Search for other works by this author on:
Jonas Braasch
Jonas Braasch
e)
1
School of Architecture, Rensselaer Polytechnic Institute
, Troy, New York 12180, USA
Search for other works by this author on:
Mallory M. Morgan
1,b)
Indrani Bhattacharya
2,c)
Richard J. Radke
2,d)
Jonas Braasch
1,e)
1
School of Architecture, Rensselaer Polytechnic Institute
, Troy, New York 12180, USA
2
Department of Electrical, Computer, and Systems Engineering, Rensselaer Polytechnic Institute
, Troy, New York 12180, USA
J. Acoust. Soc. Am. 149, 885–894 (2021)
Article history
Received:
July 28 2020
Accepted:
January 09 2021
Citation
Mallory M. Morgan, Indrani Bhattacharya, Richard J. Radke, Jonas Braasch; Classifying the emotional speech content of participants in group meetings using convolutional long short-term memory network. J. Acoust. Soc. Am. 1 February 2021; 149 (2): 885–894. https://doi.org/10.1121/10.0003433
Download citation file:
Pay-Per-View Access
$40.00
Sign In
You could not be signed in. Please check your credentials and make sure you have an active account and try again.
Citing articles via
Focality of sound source placement by higher (ninth) order ambisonics and perceptual effects of spectral reproduction errors
Nima Zargarnezhad, Bruno Mesquita, et al.
A survey of sound source localization with deep learning methods
Pierre-Amaury Grumiaux, Srđan Kitić, et al.
Drawer-like tunable ventilated sound barrier
Yong Ge, Yi-jun Guan, et al.
Related Content
Speech emotion recognition based on transfer learning from the FaceNet framework
J. Acoust. Soc. Am. (February 2021)
A contemporary approach for emotion recognition using deep learning techniques from IEMOCAP multimodal emotion dataset
AIP Conf. Proc. (March 2024)
A comparison study of widespread CNN architectures for speech emotion recognition on spectrogram
AIP Conf. Proc. (June 2022)
Emotion recognition with speech articulatory coordination features
J. Acoust. Soc. Am. (October 2021)
TOMFuN: A tensorized optical multimodal fusion network
APL Mach. Learn. (March 2025)