A method for synthesizing vocal-tract spectra from phoneme sequences by mimicking the speech production process of humans is presented. The model consists of four main processes and is particularly characterized by an adaptive formation of articulatory movements. First, our model determines the time when each phoneme is articulated. Next, it generates articulatory constraints that must be met for the production of each phoneme, and then it generates trajectories of the articulatory movements that satisfy the constraints. Finally, the time sequence of spectra is estimated from the produced articulatory trajectories. The articulatory constraint of each phoneme does not change with the phonemic context, but the contextual variability of speech is reproduced because of the dynamic articulatory model. The accuracy of the synthesis model was evaluated using data collected by the simultaneous measurement of speech and articulatory movements. The accuracy of the phonemic timing estimates were measured and compared the synthesized results to the measured results. Experimental results showed that the model captured the contextual variability of both the articulatory movements and speech acoustics.

1.
Blumstein
,
S. E.
(
1986
). “
On acoustic invariance in speech
,” in
Invariance and Variability in Speech Processes
, edited by
J. S.
Perkell
and
D. H.
Klatt
(
Lawrence Erlbaum
,
N.J.
).
2.
Coker
,
C. H.
, and
Fujimura
,
O.
(
1966
). “
Model for specification of the vocaltract area function
,”
J. Acoust. Soc. Am.
40
,
1271
.
3.
Coker
,
C. H.
(
1976
). “
A model of articulatory dynamics and control
,”
Proc. IEEE
64
,
452
460
.
4.
Broad
,
D. J.
, and
Clermont
,
F.
(
1987
). “
A methodology for modeling vowel formant contours in CVC context
,”
J. Acoust. Soc. Am.
81
,
155
165
.
5.
Engwall
,
O.
(
1999
). “
Modeling of the vocal tract in three dimensions
,”
Proc. Eurospeech
1999
,
113
116
.
6.
Farnetani
,
E.
(
1989
). “
V-C-V lingual coarticulation and its spatiotemporal domain
,” in
Speech Production and Modeling
, edited by
W. J.
Hardcastle
and
A.
Marchal
(
Kluwer
,
Dordrecht
).
7.
Flanagan
,
J. L.
(
1972
).
Speech Analysis, Synthesis, and Perception
, 2nd ed. (
Springer-Verlag
,
Berlin
).
8.
Furui
,
S.
(
2000
).
Digital Speech Processing, Synthesis, and Recognition
, 2nd ed. (
Marcel Dekker
,).
9.
Gay
,
T.
(
1977
). “
Articulatory movements in VCV sequences
,”
J. Acoust. Soc. Am.
62
,
185
193
.
10.
Hayashi
,
C.
(
1954
). “
Multidimensional quantification—With the applications to analysis of social phenomena—
,”
Ann. Inst. Stat. Math.
5
,
121
143
.
11.
Hiroya
,
S.
, and
Honda
,
M.
(
2004
). “
Estimation of articulatory movements from speech acoustics using an HMM-based speech production model
,”
IEEE Trans. Acoust., Speech, Signal Process.
12
,
175
185
.
12.
Jelinek
,
F.
(
1997
).
Statistical Methods for Speech Recognition
(
MIT Press
,
Cambridge, Mass.
).
13.
Kaburagi
,
T.
, and
Honda
,
M.
(
1994
). “
Determination of sagittal tongue shape from the positions of points on the tongue surface
,”
J. Acoust. Soc. Am.
96
,
1356
1366
.
14.
Kaburagi
,
T.
, and
Honda
,
M.
(
1996
). “
A model of articulator trajectory formation based on the motor tasks of vocal-tract shapes
,”
J. Acoust. Soc. Am.
99
,
3154
3170
.
15.
Kaburagi
,
T.
, and
Honda
,
M.
(
2001
). “
Dynamic articulatory model based on multidimensional invariant-feature task representation
,”
J. Acoust. Soc. Am.
110
,
441
452
.
16.
Kaburagi
,
T.
, and
Honda
,
M.
(
2002
). “
Electromagnetic articulograph based on a nonparametric representation of the magnetic field
,”
J. Acoust. Soc. Am.
111
,
1414
1421
.
17.
Kaburagi
,
T.
,
Wakamiya
,
K.
, and
Honda
,
M.
(
2005
). “
Three-dimensional electromagnetic articulography: A measurement principle
,”
J. Acoust. Soc. Am.
118
,
428
443
.
18.
Kakita
,
Y.
,
Fujimura
,
O.
, and
Honda
,
K.
(
1985
). “
Computation of mapping from muscular contraction patterns to formant patterns in vowel space
,” in
Phonetic Linguistics
, edited by
V. A.
Fromkin
(
Academic
,
New York
).
19.
MacNeilage
,
P. F.
, and
DeClerk
,
J. L.
(
1969
). “
On the motor control of coarticulation in CVC monosyllables
,”
J. Acoust. Soc. Am.
45
,
1217
1233
.
20.
Maeda
,
S.
(
1982
). “
A digital simulation method of vocal-tract system
,”
Speech Commun.
1
,
199
229
.
21.
Mermelstein
,
P.
(
1972
). “
Articulatory model for the study of speech production
,”
J. Acoust. Soc. Am.
53
,
1070
1082
.
22.
Öhman
,
S. E. G.
(
1966
). “
Coarticulation in VCV utterances: Spectrographic measurements
,”
J. Acoust. Soc. Am.
39
,
151
168
.
23.
Perkell
,
J. S.
,
Cohen
,
M. H.
,
Svirsky
,
M. A.
,
Matthies
,
M. L.
,
Garabieta
,
I.
, and
Jackson
,
M. T. T.
(
1992
). “
Electromagnetic midsagittal articulometer (EMMA) systems for transducing speech articulatory movements
,”
J. Acoust. Soc. Am.
92
,
3078
3096
.
24.
Rabiner
,
L.
, and
Juang
,
B. H.
(
1993
).
Fundamentals of Speech Recognition
(
Prentice-Hall International
,
Englewood Cliffs, N.J.
).
25.
Rokkaku
,
M.
,
Hashimoto
,
K.
,
Imaizumi
,
S.
,
Niimi
,
S.
, and
Kiritani
,
S.
(
1986
). “
Measurements of the three-dimensional shape of the vocal tract based on the magnetic resonance imaging technique
,”
Ann. Bull. RILP
20
,
47
54
.
26.
Saltzman
,
E. L.
(
1979
). “
Levels of sensorimotor representation
,”
J. Math. Psychol.
20
,
91
163
.
27.
Saltzman
,
E.
, and
Munhall
,
K. G.
(
1989
). “
A dynamical approach to gestural pattering in speech production
,”
Ecological Psychol.
1
,
333
382
.
28.
Sharf
,
D. J.
, and
Ohde
,
R. N.
(
1981
). “
Physiological, acoustic, and perceptual aspects of coarticulation: Implications for the remediation of articulatory disorders
,” in
Speech and Language: Advances in Basic Research and Practice, Vol. 5
, edited by
N. J.
Lass
(
Academic
,
New York
).
29.
Sondhi
,
M. M.
, and
Schroeter
,
J.
(
1987
). “
A hybrid time-frequency domain articulatory speech synthesizer
,”
IEEE Trans. Acoust., Speech, Signal Process.
35
,
955
967
.
30.
Stevens
,
K. N.
, and
House
,
A. S.
(
1963
). “
Perturbations of vowel articulations by consonantal context: An acoustical study
,”
J. Speech Hear. Res.
6
,
111
128
.
31.
Stone
,
M.
(
1990
). “
A three-dimensional model of tongue movement based on ultrasound and x-ray microbeam data
,”
J. Acoust. Soc. Am.
87
,
2207
2217
.
You do not currently have access to this content.