When pitch is explicitly modelled for parametric speech synthesis, microprosodic variations of the fundamental frequency f0 are usually disregarded by current intonation models. While there are numerous studies dealing with the nature and the origin of microprosody, little research has been done on its audibility and its effect on the naturalness of synthetic speech. In this work, the influence of obstruent-related microprosodic variations on the perceived naturalness of articulatory speech synthesis was studied. A small corpus of 20 German words and sentences was re-synthesized using the state-of-the-art articulatory synthesizer VocalTractLab. The pitch contours of the real utterances were extracted and fitted with the Target-Approximation-Model. After the real microprosodic variations were removed from the obtained pitch contours, synthetic variations were applied based on a microprosody model. Subsequently, multiple stimuli with different microprosody amplitudes were synthesized and evaluated in a listening experiment. The results indicate that microprosodic variations are barely audible, but can lead to a greater perceived naturalness of the synthesized speech in certain cases.

1.
D. H.
Whalen
and
A. G.
Levitt
, “
The universality of intrinsic f0 of vowels
,”
J. Phon.
23
(
3
),
349
366
(
1995
).
2.
P.
Birkholz
and
X.
Zhang
, “
Accounting for microprosody in modeling intonation
,” in
Proceedings of ICASSP 2020
,
Barcelona, Spain
(
May 4–8
,
2020
), pp.
8099
8103
.
3.
E.
Meyer
, “
Zur Tonbewegung des Vokals im gesprochenen und gesungenen Einzelwort
” (“On the tonal movement of the vowel in spoken and sung words”),
Phonetische Studien (Beiblatt zu der Zeitschrift: Die neueren Sprachen)
10
,
1
21
(
1896–1897
).
4.
A. S.
House
and
G.
Fairbanks
, “
The influence of consonant environment upon the secondary acoustical characteristics of vowels
,”
J. Acoust. Soc. Am.
25
(
1
),
105
113
(
1953
).
5.
I.
Lehiste
and
G. E.
Peterson
, “
Some basic considerations in the analysis of intonation
,”
J. Acoust. Soc. Am.
33
(
4
),
419
425
(
1961
).
6.
W. A.
Lea
, “
Segmental and suprasegmental influences on fundamental frequency contours
,” in
Consonant Types and Tone
, edited by
L. M.
Hyman
(
University of Southern California
,
Los Angeles, CA
,
1973
), Vol. 1, pp.
15
70
.
7.
J.-M.
Hombert
, “
Consonant types, vowel quality, and tone
,” in
Tone: A Linguistic Survey
(
Academic Press
,
New York
,
1978
), Vol.
77
, p.
112
.
8.
K. J.
Kohler
, “
F0 in the production of lenis and fortis plosives
,”
Phonetica
39
,
199
218
(
1982
).
9.
K.
Silverman
, “
F0 perturbations as a function of voicing of prevocalic and postvocalic stops and fricatives, and of syllable stress
,” in
Reproduced Sound: 1985 Autumn Conference, Windermere: Conference Handbook
(
Institute of Acoustics
,
Windermere, UK
,
1984
), Vol. 6, pp.
445
452
.
10.
H. M.
Hanson
, “
Effects of obstruent consonants on fundamental frequency at vowel onset in English
,”
J. Acoust. Soc. Am.
125
(
1
),
425
441
(
2009
).
11.
J. P.
Kirby
and
D. R.
Ladd
, “
Effects of obstruent voicing on vowel F0: Evidence from ‘true voicing' languages
,”
J. Acoust. Soc. Am.
140
(
4
),
2400
2411
(
2016
).
12.
J.
Kingston
, “
Segmental influences on F0: Automatic or controlled
?,” in
Tones and Tunes
(
Mouton de Gruyter
,
Berlin, Germany
,
2007
), Vol. 2, pp.
171
210
.
13.
A.
Di Cristo
and
D. J.
Hirst
, “
Modelling French micromelody: Analysis and synthesis
,”
Phonetica
43
(
1–3
),
11
30
(
1986
).
14.
J.-M.
Hombert
,
J. J.
Ohala
, and
W. G.
Ewan
, “
Phonetic explanations for the development of tones
,”
Language
55
,
37
58
(
1979
).
15.
M.
Halle
and
K. N.
Stevens
, “
A note on laryngeal features
,” in
MIT Quarterly Progress Report
(
MIT
,
Cambridge, MA
,
1971
), Vol. 101, pp.
198
212
.
16.
K. J.
Kohler
, “
F0 in the perception of lenis and fortis plosives
,”
J. Acoust. Soc. Am.
78
(
1
),
21
32
(
1985
).
17.
A.
Löfqvist
,
T.
Baer
,
N. S.
McGarr
, and
R. S.
Story
, “
The cricothyroid muscle in voicing control
,”
J. Acoust. Soc. Am.
85
(
3
),
1314
1321
(
1989
).
18.
A.
Löfqvist
,
L. L.
Koenig
, and
R. S.
McGowan
, “
Vocal tract aerodynamics in /aca/ utterances: Measurements
,”
Speech Commun.
16
(
1
),
49
66
(
1995
).
19.
C. X.
Xu
and
Y.
Xu
, “
Effects of consonant aspiration on Mandarin tones
,”
J. Int. Phon.
33
(
2
),
165
181
(
2003
).
20.
A. L.
Francis
,
V.
Ciocca
,
V. K. M.
Wong
, and
J. K. L.
Chan
, “
Is fundamental frequency a cue to aspiration in initial stops?
,”
J. Acoust. Soc. Am.
120
(
5
),
2884
2895
(
2006
).
21.
J.
Kingston
and
R. L.
Diehl
, “
Phonetic knowledge
,”
Language
70
(
3
),
419
454
(
1994
).
22.
O.
Fujimura
, “
Remarks on stop consonants: Synthesis experiments and acoustic cues
,” in
Form and Substance: Phonetic and Linguistic Papers Presented to Eli Fischer-Jørgensen
(
Akademisk Forlag
,
Copenhagen, Denmark
,
1971
), pp.
221
232
.
23.
D. W.
Massaro
and
M. M.
Cohen
, “
The contribution of fundamental frequency and voice onset time to the /zi/-/si/distinction
,”
J. Acoust. Soc. Am.
60
(
3
),
704
717
(
1976
).
24.
A.
Rao MV
,
S.
Victory J
, and
P. K.
Ghosh
, “
Effect of source filter interaction on isolated vowel-consonant-vowel perception
,”
J. Acoust. Soc. Am.
144
(
2
),
EL95
EL99
(
2018
).
25.
M.
Morise
,
F.
Yokomori
, and
K.
Ozawa
, “
WORLD: A vocoder-based high-quality speech synthesis system for real-time applications
,”
IEICE Trans. Inf. Syst.
99
(
7
),
1877
1884
(
2016
).
26.
W.
Ping
,
K.
Peng
,
A.
Gibiansky
,
S. O.
Arik
,
A.
Kannan
,
S.
Narang
,
J.
Raiman
, and
J.
Miller
, “
Deep voice 3: 2000-speaker neural text-to-speech
,” in
Proceedings of the ICLR
,
Vancouver, Canada
(
April 30–May 3
,
2018
), pp.
214
217
.
27.
P.
Birkholz
, “
Modeling consonant-vowel coarticulation for articulatory speech synthesis
,”
PloS One
8
(
4
),
e60603
(
2013
).
28.
P.
Birkholz
, “
VocalTractLab (version 2.3) [computer program]
,” https://www.vocaltractlab.de/ (Last viewed January 28, 2021).
29.
P.
Birkholz
,
B. J.
Kröger
, and
C.
Neuschaefer-Rube
, “
Model-based reproduction of articulatory trajectories for consonant–vowel sequences
,”
IEEE Trans. Audio Speech Lang. Process.
19
(
5
),
1422
1433
(
2011
).
30.
P.
Birkholz
, “
Enhanced area functions for noise source modeling in the vocal tract
,” in
Proceeding of the ISSP
,
Cologne, Germany
(
May 5–8
,
2014
), pp.
32
40
.
31.
P.
Birkholz
,
S.
Drechsel
, and
S.
Stone
, “
Perceptual optimization of an enhanced geometric vocal fold model for articulatory speech synthesis
,” in
Proceeding of Interspeech
,
Graz, Austria
(
September 15–19
,
2019
), pp.
3765
3769
.
32.
P.
Boersma
and
D.
Weenick
, “
Praat: Doing phonetics by computer (version 6.0.43) [computer program]
,” http://www.praat.org (Last viewed January 28, 2021).
33.
Y.
Xu
and
Q. E.
Wang
, “
Pitch targets and their realization: Evidence from Mandarin Chinese
,”
Speech Commun.
33
(
4
),
319
337
(
2001
).
34.
S.
Prom-On
,
Y.
Xu
, and
B.
Thipakorn
, “
Modeling tone and intonation in Mandarin and English as a process of target approximation
,”
J. Acoust. Soc. Am.
125
(
1
),
405
424
(
2009
).
35.
P.
Birkholz
,
P.
Schmager
, and
Y.
Xu
, “
Estimation of pitch targets from speech signals by joint regularized optimization
,” in
Proceedings of EUSIPCO
,
Rome, Italy
(
September 3–7
,
2018
), pp.
2075
2079
.
36.
See supplementary materials at https://github.com/TUD-STKS/Microprosody for the segment sequences, gestural scores, and the relevant code to produce the stimuli files.
37.
J.
’t Hart
, “
Differential sensitivity to pitch distance, particularly in speech
,”
J. Acoust. Soc. Am.
69
(
3
),
811
821
(
1981
).
38.
D. H.
Klatt
and
L. C.
Klatt
, “
Analysis, synthesis, and perception of voice quality variations among female and male talkers
,”
J. Acoust. Soc. Am.
87
(
2
),
820
857
(
1990
).
39.
M.
Wester
,
C.
Valentini-Botinhao
, and
G. E.
Henter
, “
Are we using enough listeners? No!—An empirically-supported critique of Interspeech 2014 TTS evaluations
,” in
Proceedings of Interspeech
,
Dresden, Germany
(
September 6–10
,
2015
), pp.
3476
3480
.
40.
J.
Kirby
,
F.
Kleber
,
J.
Siddins
, and
J.
Harrington
, “
Effects of prosodic prominence on obstruent-intrinsic F0 and VOT in German
,” in
Proceedings of the 10th International Conference on Speech Prosody
,
Virtual Conference
(
May 23–24
,
2020
), pp.
210
214
.
You do not currently have access to this content.