Motivated by the source-filter model of speech production, analysis of emotional speech based on the inverse-filtering method has been extensively conducted. The relative contribution of the glottal source and vocal tract cues to perception of emotions in speech is still unclear, especially after removing the effects of the known dominant factors (e.g., F0, intensity, and duration). In this present study, the glottal source and vocal tract parameters were estimated in a simultaneous manner, modified in a controlled way and then used for resynthesizing emotional Japanese vowels by applying a recently developed analysis-by-synthesis method. The resynthesized emotional vowels were presented to native Japanese listeners with normal hearing for perceptually rating emotions in valence and arousal dimensions. Results showed that glottal source information played a dominant role in perception of emotions in vowels, while vocal tract information contributed to valence and arousal perceptions after neutralizing the effects of F0, intensity, and duration cues.

1.
Airas
,
M.
, and
Alku
,
P.
(
2006
). “
Emotions in vowel segments of continuous speech: Analysis of the glottal flow using the normalised amplitude quotient
,”
Phonetica
63
(
1
),
26
46
.
2.
Alku
,
P.
(
2011
). “
Glottal inverse filtering analysis of human voice production-a review of estimation and parameterization methods of the glottal excitation and their applications
,”
Sadhana
36
(
5
),
623
650
.
3.
Auberge
,
V.
, and
Cathiard
,
M.
(
2003
). “
Can we hear the prosody of smile?
,”
Speech Commun.
40
(
1
),
87
97
.
4.
Audibert
,
N.
,
Auberge
,
V.
, and
Rilliard
,
A.
(
2005
). “
The prosodic dimensions of emotion in speech: The relative weights of parameters
,” in
INTERSPEECH
, Lisbon, Portugal, pp.
525
528
.
5.
Audibert
,
N.
,
Vincent
,
D.
,
Auberge
,
V.
, and
Rosec
,
O.
(
2006
). “
Expressive speech synthesis: Evaluation of a voice quality centered coder on the different acoustic dimensions
,” in
Proc. Speech Prosody
, Dresden, Germany, Vol. 2006, pp.
525
528
.
6.
Banziger
,
T.
, and
Scherer
,
K. R.
(
2005
). “
The role of intonation in emotional expressions
,”
Speech Commun.
46
(
3
),
252
267
.
7.
Bulut
,
M.
, and
Narayanan
,
S.
(
2008
). “
On the robustness of overall F0-only modifications to the perception of emotions in speech
,”
J. Acoust. Soc. Am.
123
(
6
),
4547
4558
.
8.
Degottex
,
G.
,
Bianco
,
E.
, and
Rodet
,
X.
(
2008
). “
Usual to particular phonatory situations studied with high-speed videoendoscopy
,” in
International Conference on Voice Physiology and Biomechanics
, Tampere, Finland, pp.
19
26
.
9.
Drugman
,
T.
,
Bozkurt
,
B.
, and
Dutoit
,
T.
(
2012b
). “
A comparative study of glottal source estimation techniques
,”
Comput. Speech Lang.
26
(
1
),
20
34
.
10.
Drugman
,
T.
,
Thomas
,
M.
,
Gudnason
,
J.
,
Naylor
,
P.
, and
Dutoit
,
T.
(
2012a
). “
Detection of glottal closure instants from speech signals: A quantitative review
,”
IEEE Trans. Audio, Speech, Lang. Process.
20
(
3
),
994
1006
.
11.
Elbarougy
,
R.
, and
Akagi
,
M.
(
2014
). “
Improving speech emotion dimensions estimation using a three-layer model of human perception
,”
Acoust. Sci. Technol.
35
(
2
),
86
98
.
12.
Erickson
,
D.
(
2004
). “
Acoustic and articulator analysis of sad Japanese speech
,” in
Proc. Fall Meet. Phonet. Soc. Jpn.
, 2004, Tokyo, Japan.
13.
Erickson
,
D.
,
Shochi
,
T.
,
Menezes
,
C.
,
Kawahara
,
H.
, and
Sakakibara
,
K.-I.
(
2008
). “
Some non-F0 cues to emotional speech: An experiment with morphing
,” in
Proceedings of the 4th International Conference on Speech Prosody
, Campinas, Brazil, pp.
677
680
.
14.
Erickson
,
D.
,
Yoshida
,
K.
,
Menezes
,
C.
,
Fujino
,
A.
,
Mochida
,
T.
, and
Shibuya
,
Y.
(
2006
). “
Exploratory study of some acoustic and articulatory characteristics of sad speech
,”
Phonetica
63
(
1
),
1
25
.
15.
Erickson
,
D.
,
Zhu
,
C.
,
Kawahara
,
S.
, and
Suemitsu
,
A.
(
2016
). “
Articulation, acoustics and perception of mandarin Chinese emotional speech
,”
Open Linguist.
2
(
1
),
620
635
.
16.
Fant
,
G.
,
Liljencrants
,
J.
, and
Lin
,
Q.-g.
(
1985
). “
A four-parameter model of glottal flow
,”
STL-QPSR
26
(
4
),
1
13
.
17.
Fu
,
Q.
, and
Murphy
,
P.
(
2006
). “
Robust glottal source estimation based on joint source-filter model optimization
,”
IEEE Trans. Audio, Speech, Lang. Process.
14
(
2
),
492
501
.
18.
Gangamohan
,
P.
,
Kadiri
,
S. R.
, and
Yegnanarayana
,
B.
(
2016
). “
Analysis of emotional speech-a review
,” in
Toward Robotic Socially Believable Behaving Systems-Volume I
(
Springer
,
Cham
), pp.
205
238
.
19.
Goudbeek
,
M.
,
Goldman
,
J. P.
, and
Scherer
,
K. R.
(
2009
). “
Emotion dimensions and formant position
,” in
Tenth Annual Conference of the International Speech Communication Association
, Brighton, UK, 6–10 September.
20.
Iliev
,
A. I.
,
Scordilis
,
M. S.
,
Papa
,
J. P.
, and
Falcao
,
A. X.
(
2010
). “
Spoken emotion recognition through optimum-path forest classification using glottal features
,”
Comput. Speech Lang.
24
(
3
),
445
460
.
21.
Juslin
,
P. N.
, and
Laukka
,
P.
(
2001
). “
Impact of intended emotion intensity on cue utilization and decoding accuracy in vocal expression of emotion
,”
Emotion
1
(
4
),
381
412
.
22.
Juslin
,
P. N.
, and
Scherer
,
K. R.
(
2005
). “
Vocal expression of affect
,” in
The New Handbook of Methods in Nonverbal Behavior Research
(
Oxford University Press
,
Oxford, UK
), pp.
65
135
.
23.
Kawahara
,
H.
,
Masuda-Katsuse
,
I.
, and
De Cheveigne
,
A.
(
1999
). “
Restructuring speech representations using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds
,”
Speech Commun.
27
(
3
),
187
207
.
24.
Kim
,
J.
,
Toutios
,
A.
,
Kim
,
Y.-C.
,
Zhu
,
Y.
,
Lee
,
S.
, and
Narayanan
,
S.
(
2014
). “
Usc-emo-mri corpus: An emotional speech production database recorded by real-time magnetic resonance imaging
,” in
International Seminar on Speech Production (ISSP)
, Cologne, Germany.
25.
Laukkanen
,
A.-M.
,
Vilkman
,
E.
,
Alku
,
P.
, and
Oksanen
,
H.
(
1997
). “
On the perception of emotions in speech: The role of voice quality
,”
Logoped. Phoniatr. Vocol.
22
(
4
),
157
168
.
26.
Lee
,
C. M.
,
Yildirim
,
S.
,
Bulut
,
M.
,
Kazemzadeh
,
A.
,
Busso
,
C.
,
Deng
,
Z.
,
Lee
,
S.
, and
Narayanan
,
S.
(
2004
). “
Emotion recognition based on phoneme classes
,” in
Eighth International Conference on Spoken Language Processing
, Jeju Island, South Korea, 4–8 October.
27.
Lee
,
S.
,
Bresch
,
E.
,
Adams
,
J.
,
Kazemzadeh
,
A.
, and
Narayanan
,
S.
(
2006
). “
A study of emotional speech articulation using a fast magnetic resonance imaging technique
,” in
INTERSPEECH
, Pittsburgh, PA, pp.
1792
1795
.
28.
Leinonen
,
L.
,
Hiltunen
,
T.
,
Linnankoski
,
I.
, and
Laakso
,
M.-L.
(
1997
). “
Expression of emotional–motivational connotations with a one-word utterance
,”
J. Acoust. Soc. Am.
102
(
3
),
1853
1863
.
29.
Li
,
A.
,
Fang
,
Q.
,
Hu
,
F.
,
Zheng
,
L.
,
Wang
,
H.
, and
Dang
,
J.
(
2010
). “
Acoustic and articulatory analysis on mandarin Chinese vowels in emotional speech
,” in
2010 7th International Symposium on Chinese Spoken Language Processing (ISCSLP)
, IEEE, Tainan, Taiwan, pp.
38
43
.
30.
Li
,
Y.
,
Sakakibara
,
K.-I.
,
Morikawa
,
D.
, and
Akagi
,
M.
(
2017
). “
Commonalities of glottal sources and vocal tract shapes among speakers in emotional speech
,” in
International Seminar on Speech Production (ISSP)
, Tianjing, China, pp.
79
81
.
31.
Lu
,
H.-L.
(
2002
). “
Toward a high-quality singing synthesizer with vocal texture control
,” Ph.D. dissertation,
Stanford University
, Stanford, CA.
32.
Mori
,
H.
, and
Kasuya
,
H.
(
2007
). “
Voice source and vocal tract variations as cues to emotional states perceived from expressive conversational speech
,” in
INTERSPEECH
, Antwerp, Belgium, pp.
27
31
.
33.
Ohtsuka
,
T.
, and
Kasuya
,
H.
(
2002
). “
Robust ARX-based speech analysis method taking voicing source pulse train into account
,”
Acoustical Soc. Jpn.
58
(
7
),
386
397
.
34.
Ringeval
,
F.
, and
Chetouani
,
M.
(
2008
). “
A vowel based approach for acted emotion recognition
,” in
Ninth Annual Conference of the International Speech Communication Association
, Brisbane, Australia, 22–26 September.
35.
Rothenberg
,
M.
(
1973
). “
A new inverse-filtering technique for deriving the glottal air flow waveform during voicing
,”
J. Acoust. Soc. Am.
53
(
6
),
1632
1645
.
36.
Rubin
,
D. C.
, and
Talarico
,
J. M.
(
2009
). “
A comparison of dimensional models of emotion: Evidence from emotions, prototypical events, autobiographical memories, and words
,”
Memory
17
(
8
),
802
808
.
37.
Sauter
,
D. A.
,
Eisner
,
F.
,
Calder
,
A. J.
, and
Scott
,
S. K.
(
2010
). “
Perceptual cues in nonverbal vocal expressions of emotion
,”
Q. J. Exp. Psychol.
63
(
11
),
2251
2272
.
38.
Scherer
,
K. R.
(
1986
). “
Vocal affect expression: A review and a model for future research
,”
Psychol. Bull.
99
(
2
),
143
165
.
39.
Scherer
,
K. R.
(
2003
). “
Vocal communication of emotion: A review of research paradigms
,”
Speech Commun.
40
(
1
),
227
256
.
40.
Sun
,
R.
,
Moore
,
E.
, and
Torres
,
J. F.
(
2009
). “
Investigating glottal parameters for differentiating emotional categories with similar prosodics
,” in
ICASSP
, IEEE, Taibei, Taiwan, pp.
4509
4512
.
41.
Sundberg
,
J.
,
Patel
,
S.
,
Bjorkner
,
E.
, and
Scherer
,
K. R.
(
2011
). “
Interdependencies among voice source parameters in emotional speech
,”
IEEE Trans. Affective Comput.
2
(
3
),
162
174
.
42.
Tao
,
J.
,
Li
,
Y.
, and
Pan
,
S.
(
2009
). “
A multiple perception model on emotional speech
,” in
3rd International Conference on Affective Computing and Intelligent Interaction and Workshops, 2009, ACII 2009
(
IEEE
,
New York
), pp.
1
6
.
43.
Vincent
,
D.
,
Rosec
,
O.
, and
Chonavel
,
T.
(
2005
). “
Estimation of lf glottal source parameters based on an ARX model
,” in
INTERSPEECH
, Lisbon, Portugal, pp.
333
336
.
44.
Waaramaa
,
T.
,
Laukkanen
,
A.-M.
,
Airas
,
M.
, and
Alku
,
P.
(
2010
). “
Perception of emotional valences and activity levels from vowel segments of continuous speech
,”
J. Voice
24
(
1
),
30
38
.
45.
Yanushevskaya
,
I.
,
Ni Chasaide
,
A.
, and
Gobl
,
C.
(
2009
). “
Voice parameter dynamics in portrayed emotions
,” in
Models and analysis of vocal emissions for biomedical applications: 6th International Workshop
(
Firenze University Press
,
Firenze, Italy
), pp.
1000
1004
.
46.
Yanushevskaya
,
I.
,
Tooher
,
M.
,
Gobl
,
C.
, and
Chasaide
,
A. N. C.
(
2007
). “
Time- and amplitude-based voice source correlates of emotional portrayals
,” in
International Conference on Affective Computing and Intelligent Interaction, ACII2007: Lecture Notes in Computer Science
(
Springer
,
Berlin
), pp.
159
170
.
You do not currently have access to this content.