Emotion is a central component of verbal communication between humans. Due to advances in machine learning and the development of affective computing, automatic emotion recognition is increasingly possible and sought after. To examine the connection between emotional speech and significant group dynamics perceptions, such as leadership and contribution, a new dataset (14 group meetings, 45 participants) is collected for analyzing collaborative group work based on the lunar survival task. To establish a training database, each participant's audio is manually annotated both categorically and along a three-dimensional scale with axes of activation, dominance, and valence and then converted to spectrograms. The performance of several neural network architectures for predicting speech emotion are compared for two tasks: categorical emotion classification and 3D emotion regression using multitask learning. Pretraining each neural network architecture on the well-known IEMOCAP (Interactive Emotional Dyadic Motion Capture) corpus improves the performance on this new group dynamics dataset. For both tasks, the two-dimensional convolutional long short-term memory network achieves the highest overall performance. By regressing the annotated emotions against post-task questionnaire variables for each participant, it is shown that the emotional speech content of a meeting can predict 71% of perceived group leaders and 86% of major contributors.

1.
G.
De Souza
and
H.
Klein
, “
Emergent leadership in the group goal-setting process
,”
Small Group Res.
26
(
4
),
475
496
(
1995
).
2.
A.
Pescosolido
, “
Informal leaders and the development of group efficacy
,”
Small Group Res.
32
(
1
),
74
93
(
2001
).
3.
C.
Beyan
,
F.
Capozzi
,
C.
Becchio
, and
V.
Murino
, “
Multi-task learning of social psychology assessments and nonverbal features for automatic leadership identification
,” in
Proc. 19th ACM Int. Conf. on Multimodal Interact.
(
ACM
,
New York
,
2017
), pp.
451
455
.
4.
C.
Beyan
,
N.
Carissimi
,
F.
Capozzi
,
S.
Vascon
,
M.
Bustreo
,
A.
Pierro
,
C.
Becchio
, and
V.
Murino
, “
Detecting emergent leader in a meeting environment using nonverbal visual features only
,” in
Proc. 18th ACM Int. Conf. on Multimodal Interact.
(
ACM
,
New York
,
2016
), pp.
317
324
.
5.
D.
Sanchez-Cortes
,
O.
Aran
, and
M. S. M. D.
Gatica-Perez
, “
A nonverbal behavior approach to identify emergent leaders in small groups
,”
IEEE Trans. Multimedia
14
(
3
),
816
832
(
2012
).
6.
I.
Bhattacharya
,
M.
Foley
,
N.
Zhang
,
T.
Zhang
,
C.
Ku
,
C.
Mine
,
H.
Ji
,
C.
Riedl
,
B.
Foucault Welles
, and
R.
Radke
, “
A multimodal-sensor-enabled room for unobtrusive group meeting analysis
,” in
Proceedings of the 2018 International Conference on Multimodal Interaction
(
ACM
,
New York
,
2018
), pp.
347
355
.
7.
L.
Zhang
,
M.
Morgan
,
I.
Bhattacharya
,
J.
Braasch
,
C.
Riedl
,
B. F.
Welles
, and
R.
Radke
, “
Visual focus of attention estimation and prosodic features for analyzing group interactions
,” in
Proceedings of the International Conference on Multimodal Interaction
(
ACM
,
New York
,
2019
), pp.
385
394
.
8.
L.
Zhang
,
I.
Bhattacharya
,
M.
Morgan
,
M.
Foley
,
C.
Riedl
,
B.
Welles
, and
R.
Radke
, “
Multiparty visual co-occurrences for estimating personality traits in group meetings
,” in
IEEE WACV
(
Aspen, CO
,
2020
), pp.
2085
2094
.
9.
P.
Ekman
,
E.
Sorenson
, and
W.
Friesen
, “
Pan-cultural elements in facial displays of emotions
,”
Science
164
,
86
88
(
1969
).
10.
J.
Russel
and
A.
Mehrabian
, “
Evidence for a three-factor theory of emotions
,”
J. Res. Pers.
11
,
273
294
(
1977
).
11.
M.
Akçay
and
K.
Oğuz
, “
Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers
,”
Speech Commun
116
,
56
76
(
2020
).
12.
F.
Burkhardt
,
A.
Paeschke
,
M.
Rolfes
,
W.
Sendlmeier
, and
B.
Weiss
, “
A database of German emotional speech
,” in
9th European Conference on Speech Communication and Technology
(
2005
), pp.
1517
1520
.
13.
C.
Busso
,
M.
Bulut
,
C.
Lee
,
A.
Kazemzadeh
,
E.
Mower
,
S.
Kim
,
J.
Chang
,
S.
Lee
, and
S.
Narayanan
, “
IEMOCAP: Interactive emotional dyadic motion capture database
,”
J. Lang. Resour. Eval.
42
(
4
),
335
359
(
2008
).
14.
M.
Grimm
,
K.
Kroschel
, and
S.
Narayanan
, “
The Vera am Mittag German audio-visual emotional speech database
,” in
IEEE Int. Conf. on Multimedia
(
IEEE
,
New York
,
2008
), pp.
865
868
.
15.
C.
Clavel
,
I.
Vasilescu
,
L.
Devillers
,
T.
Ehrette
, and
G.
Richard
, “
Safe corpus: Fear-type emotions detection for surveillance application
,” in
Proc. of LREC
(
2006
), pp.
1099
1104
.
16.
F.
Ringeval
,
A.
Sonderegger
,
J.
Sauer
, and
D.
Lalanne
, “
Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions
,” in
EmoSPACE in Proc. IEEE Face and Gestures
(
2013
).
17.
C.
Anagnostopoulos
,
T.
Iliou
, and
I.
Giannoukos
, “
Features and classifiers for emotion recognition from speech: A survey from 2000 to 2011
,”
Artif. Intell. Rev.
43
(
2
),
155
177
(
2015
).
18.
G.
Trigeorgis
,
F.
Ringeval
,
R.
Brueckner
,
E.
Marchi
,
M. A.
Nicolaou
,
B.
Schuller
, and
S.
Zafeiri
, “
Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network
,” in
ICASSP
(
2016
), pp.
5200
5204
.
19.
C.
Etienne
,
G.
Fidanza
,
A.
Petrovskii
,
L.
Devillers
, and
B.
Schmauch
, “
CNN+LSTM architecture for speech emotion recognition with data augmentation
,” in
INTERSPEECH: Speech, Music and Mind
(
IEEE
,
New York
,
2018
).
20.
Y.
Zhang
,
J.
Du
,
Z.
Wang
,
J.
Zhang
, and
Y.
Tu
, “
Attention based fully convolutional network for speech emotion recognition
,” in
2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
(
IEEE
,
New York
,
2018
), pp.
1771
1775
.
21.
A. M.
Badshah
,
J.
Ahmad
,
N.
Rahim
, and
S. W.
Baik
, “
Speech emotion recognition from spectrograms with deep convolutional neural network
,” in
2017 International Conference on Platform Technology and Service (PlatCon)
(
IEEE
,
New York
,
2017
), pp.
1
5
.
22.
Y.
Xie
,
R.
Liang
,
Z.
Liang
,
C.
Huang
,
C.
Zou
, and
B.
Schuller
, “
Speech emotion classification using attention-based LSTM
,”
IEEE/ACM Trans. Audio, Speech, Lang. Process.
27
(
11
),
1675
1685
(
2019
).
23.
J.
Wang
,
M.
Xue
,
R.
Culhane
,
E.
Diao
,
J.
Ding
, and
V.
Tarokh
, “
Speech emotion recognition with dual-sequence LSTM architecture
,” in
2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
(
IEEE
,
New York
,
2020
), pp.
6474
6478
.
24.
Z.
Yang
and
J.
Hirschberg
, “
Predicting arousal and valence from waveforms and spectrograms using deep neural networks
,” in
Interspeech 2018
(
ISCA
,
2018
), pp.
3092
3096
.
25.
R.
Xia
and
Y.
Liu
, “
A multi-task learning framework for emotion recognition using 2D continuous space
,”
IEEE Trans. Affective Comput.
8
(
1
),
3
14
(
2017
).
26.
J.
Chang
and
S.
Scherer
, “
Learning representations of emotional speech with deep convolutional generative adversarial networks
,” in
2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
(
IEEE
,
New York
2017
), pp.
2746
2750
.
27.
S.
Chen
,
Q.
Jin
,
J.
Zhao
, and
S.
Wang
, “
Multimodal multi-task learning for dimensional and continuous emotion recognition
,” in
Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge—AVEC '17
(
ACM
,
New York
,
2017
), pp.
19
26
.
28.
R.
Milner
,
M.
Jalal
,
R.
Ng
, and
T.
Hain
, “
A cross-corpus study on apeech emotion recognition
,” in
Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop
(
IEEE
,
New York
,
2019
), pp.
385
394
.
29.
P.
Song
,
X.
Zhang
,
S.
Ou
,
J.
Liu
,
Y.
Yu
, and
W.
Zheng
, “
Cross-corpus speech emotion recognition using transfer semi-supervised discriminant analysis
,” in
Proceedings of the 10th International Symposium on Chinese Spoken Language Processing (ISCSLP)
(
IEEE
,
New York
,
2016
), pp.
385
394
.
30.
J.
Hall
and
W.
Watson
, “
The effects of a normative intervention on group decision-making performance
,”
Hum. Relat.
23
(
4
),
299
317
(
1970
).
31.
S.
Baron-Cohen
,
S.
Wheelwright
,
J.
Hill
,
Y.
Raste
, and
I.
Plumb
, “
The ‘Reading the Mind in the Eyes’ Test revised version: A study with normal adults, and adults with Asperger syndrome or high-functioning autism
,”
J. Child. Psychol. Psychiatry
42
(
2
),
241
251
(
2001
).
32.
J.
Chang
,
T.
Sy
, and
J.
Change
, “
Team emotional intelligence and performance: Interactive dynamics between leaders and members
,”
Small Group Res.
43
(
1
),
75
104
(
2012
).
33.
D.
Chrusciel
, “
Considerations of emotional intelligence (EI) in dealing with change decision management
,”
Manag. Decis.
44
(
5
),
644
657
(
2006
).
34.
B.
Rammstedt
and
O.
John
, “
Measuring personality in one minute or less: A 10-item short version of the big five inventory in English and German
,”
J. Res. Pers.
41
(
1
),
203
212
(
2007
).
35.
S.
Kichuk
and
W.
Wiesner
, “
The big five personality factors and team performance: Implications for selecting successful product design teams
,”
J. Eng. Technol. Manage.
14
(
3-4
),
195
221
(
1997
).
36.
B.
Barry
and
G.
Stewart
, “
Composition, process, and performance in self-managed groups: The role of personality
,”
J. Appl. Psychol.
82
(
1
),
62
78
(
1997
).
37.
P.
Curşeu
,
R.
Ilies
,
D.
Virgǎ
,
L.
Marticuţoiu
, and
F.
Sava
, “
Personality characteristics that are valued in teams: Not always ‘more is better’?
,”
Int. J. Psychol.
54
(
5
),
638
649
(
2018
).
38.
R.
Lord
, “
Functional leadership behavior: Measurement and relation to social power and leadership perceptions
,”
Administ. Sci. Quart.
22
,
114
133
(
1977
).
39.
D.
Jayagopi
,
D.
Sanchez-Cortes
,
K.
Otsuka
,
J.
Yamato
, and
D.
Gatica-Perez
, “
Linking speaking and looking behavior patterns with group composition, perception, and performance
,” in
Proceedings of the 14th ACM International Conference on Multimodal Interaction
(
ACM
,
New York
,
2012
), pp.
433
440
.
40.
R.
Lord
,
R.
Foti
, and
C. D.
Vader
, “
A test of leadership categorization theory: Internal structure, information processing, and leadership
,”
Organ. Behav. Hum. Perform.
34
(
3
),
343
378
(
1984
).
41.
S.
Kozlowski
, “
Advancing research on team process dynamics: Theoretical, methodological, and measurement considerations
,”
Organ. Psychol. Rev.
5
(
4
),
270
299
(
2015
).
42.
M.
Bradley
and
P.
Lang
, “
Measuring emotion: The self-assessment manikin and the semantic differential
,”
J. Behav. Ther. Exp. Psychiatry
25
(
1
),
49
59
(
1994
).
43.
M.
Grimm
and
K.
Kroschel
, “
Evaluation of natural emotions using self assessment manikins
,” in
IEEE Automatic Speech Recognition and Understanding Workshop
(
2005
), pp.
381
385
.
44.
F.
Chollet
, “
keras
,” available at https://github.com/fchollet/keras (Last viewed 7/13/2020).
45.
J.
Bergstra
,
D.
Yamins
, and
D.
Cox
, “
Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms
,” in
Proc. 12th Python in Sci. Conf.
(Python,
2013
), pp.
13
20
.
46.
J.
Zhao
,
X.
Mao
, and
L.
Chen
, “
Speech emotion recognition using deep 1D & 2D CNN LSTM networks
,”
Biomed. Signal Process. Control
47
,
312
323
(
2019
).
47.
C.
Beyan
,
F.
Capozzi
,
C.
Becchio
, and
V.
Murino
, “
Prediction of the leadership style of an emergent leader using audio and visual nonverbal features
,”
IEEE Trans. Multimedia
20
(
2
),
441
456
(
2018
).
48.
I.
Bhattacharya
, “
Unobtrusive analysis of group interactions without cameras
,” in
Proceedings of the 2018 on International Conference on Multimodal Interaction
(
ACM
,
New York
,
2018
), pp.
501
505
.
You do not currently have access to this content.