A transformation function (TF) that reconstructs neutral speech articulatory trajectories (NATs) from whispered speech articulatory trajectories (WATs) is investigated, such that the dynamic time warped (DTW) distance between the transformed whispered and the original neutral articulatory movements is minimized. Three candidate TFs are considered: an affine function with a diagonal matrix (Ad) which reconstructs one NAT from the corresponding WAT, an affine function with a full matrix (Af) and a deep neural network (DNN) based nonlinear function which reconstruct each NAT from all WATs. Experiments reveal that the transformation could be approximated well by Af, since it generalizes better across subjects and achieves the least DTW distance of 5.20 (±1.27) mm (on average), with an improvement of 7.47%, 4.76%, and 7.64% (relative) compared to that with Ad, DNN, and the best baseline scheme, respectively. Further analysis to understand the differences in neutral and whispered articulation reveals that the whispered articulators exhibit exaggerated movements in order to reconstruct the lip movements during neutral speech. It is also observed that among the articulators considered in the study, the tongue exhibits a higher precision and stability while whispering, implying that subjects control their tongue movements carefully in order to render an intelligible whispered speech.

1.
3D Electromagnetic Articulograph
(
1979
), http://www.articulograph.de/ (Last viewed September 14, 2017).
2.
Ahmadi
,
F.
,
McLoughlin
,
I. V.
, and
Sharifzadeh
,
H. R.
(
2008
). “
Analysis-by-synthesis method for whisper-speech reconstruction
,” in
IEEE Asia Pacific Conference on Circuits and Systems, APCCAS
, pp.
1280
1283
.
3.
Aryal
,
S.
, and
Gutierrez-Osuna
,
R.
(
2016
). “
Data driven articulatory synthesis with deep neural networks
,”
Comput. Speech Lang.
36
(
C
),
260
273
.
4.
Beskow
,
J.
(
2003
). “
Talking heads-models and applications for multimodal speech synthesis
,” Ph.D. thesis, Institutionen för Talöverföring och Musikakustik, Stockholm, Sweden.
5.
Chollet
,
F.
(
2015
). “
keras
,” https://github.com/fchollet/keras (Last viewed September 14, 2017).
6.
Coleman
,
J.
,
Grabe
,
E.
, and
Braun
,
B.
(
2002
). “
Larynx movements and intonation in whispered speech
,” Summary of research supported by British Academy.
7.
Curry
,
R.
(
1937
). “
The mechanism of pitch change in the voice
,”
J. Physiol.
91
(
3
),
254
258
.
8.
Denby
,
B.
,
Schultz
,
T.
,
Honda
,
K.
,
Hueber
,
T.
,
Gilbert
,
J. M.
, and
Brumberg
,
J. S.
(
2010
). “
Silent speech interfaces
,”
Speech Commun.
52
(
4
),
270
287
.
9.
Fagan
,
M.
,
Ell
,
S.
,
Gilbert
,
J.
,
Sarrazin
,
E.
, and
Chapman
,
P.
(
2008
). “
Development of a (silent) speech recognition system for patients following laryngectomy
,”
Med. Eng. Phys.
30
(
4
),
419
425
.
10.
Fagel
,
S.
, and
Clemens
,
C.
(
2004
). “
An articulation model for audiovisual speech synthesis—determination, adjustment, evaluation
,”
Speech Commun.
44
(
1
),
141
154
.
11.
Garofolo
,
J. S.
,
Lamel
,
L. F.
,
Fisher
,
W. M.
,
Fiscus
,
J. G.
, and
Pallett
,
D. S.
(
1993
). “
DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM. NIST speech disc 1-1.1
,” NASA STI/Recon Technical Report No. 93.
12.
Ghosh
,
P. K.
, and
Narayanan
,
S.
(
2010
). “
A generalized smoothness criterion for acoustic-to-articulatory inversion
,”
J. Acoust. Soc. Am.
128
(
4
),
2162
2172
.
13.
Gilchrist
,
A. G.
(
1973
). “
Rehabilitation after laryngectomy
,”
Acta Oto-Laryngologica
75
(
2-6
),
511
518
.
14.
Gonzalez
,
J. A.
,
Cheah
,
L. A.
,
Gilbert
,
J. M.
,
Bai
,
J.
,
Ell
,
S. R.
,
Green
,
P. D.
, and
Moore
,
R. K.
(
2016
). “
A silent speech system based on permanent magnet articulography and direct synthesis
,”
Comput. Speech Lang.
39
,
67
87
.
15.
Higashikawa
,
M.
,
Green
,
J.
,
Moore
,
C.
, and
Minifie
,
F.
(
2003
). “
Lip kinematics for /p/ and /b/ production during whispered and voiced speech
,”
Folia Phoniatr. Logop.
55
,
1
9
.
16.
Jackson
,
P. J.
, and
Singampalli
,
V. D.
(
2008
). “
Statistical identification of critical, dependent and redundant articulators
,”
J. Acoust. Soc. Am.
123
(
5
),
3321
3321
.
17.
Janke
,
M.
,
Wand
,
M.
,
Heistermann
,
T.
,
Schultz
,
T.
, and
Prahallad
,
K.
(
2014
). “
Fundamental frequency generation for whisper-to-audible speech conversion
,” in
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
, pp.
2579
2583
.
18.
Jovičić
,
S. T.
, and
Šarić
,
Z.
(
2008
). “
Acoustic analysis of consonants in whispered speech
,”
J. Voice
22
(
3
),
263
274
.
19.
Kingma
,
D. P.
, and
Ba
,
J.
(
2014
). “
Adam: A method for stochastic optimization
,” arXiv:1412.6980.
20.
Lee
,
K. F.
, and
Hon
,
H. W.
(
1989
). “
Speaker-independent phone recognition using hidden Markov models
,”
IEEE Trans. Acoust. Speech. Sign. Process.
37
(
11
),
1641
1648
.
21.
Ludlow
,
C. L.
(
2005
). “
Central nervous system control of the laryngeal muscles in humans
,”
Respirat. Physiol. Neurobiol.
147
(
2
),
205
222
.
22.
Mcloughlin
,
I. V.
,
Sharifzadeh
,
H. R.
,
Tan
,
S. L.
,
Li
,
J.
, and
Song
,
Y.
(
2015
). “
Reconstruction of phonated speech from whispers using formant-derived plausible pitch modulation
,”
ACM Trans. Access. Comput. (TACCESS)
6
(
4
),
12
.
23.
Morris
,
R. W.
, and
Clements
,
M. A.
(
2002
). “
Reconstruction of speech from whispers
,”
Med. Eng. Phys.
24
(
7
),
515
520
.
24.
Müller
,
M.
(
2007
). “
Dynamic time warping
,” Information retrieval for music and motion, pp.
69
84
.
25.
Osfar
,
M. J.
(
2011
). “
Articulation of whispered alveolar consonants
,” Master's thesis,
University of Illinois at Urbana-Champaign
,
Champaign, IL
.
26.
Parnell
,
M.
,
Amerman
,
J. D.
, and
Wells
,
G. B.
(
1977
). “
Closure and constriction duration for alveolar consonants during voiced and whispered speaking conditions
,”
J. Acoust. Soc. Am.
61
,
612
613
.
27.
Povey
,
D.
,
Ghoshal
,
A.
,
Boulianne
,
G.
,
Burget
,
L.
,
Glembek
,
O.
,
Goel
,
N.
,
Hannemann
,
M.
,
Motlicek
,
P.
,
Qian
,
Y.
,
Schwarz
,
P.
,
Silovsky
,
J.
,
Stemmer
,
G.
, and
Vesely
,
K.
(
2011
). “
The Kaldi Speech Recognition Toolkit
,” in
IEEE Workshop on Automatic Speech Recognition and Understanding
.
28.
Qiao
,
Y.
, and
Yasuhara
,
M.
(
2006
). “
Affine invariant dynamic time warping and its application to online rotated handwriting recognition
,” in
18th International Conference on Pattern Recognition (ICPR'06)
, Vol. 2, pp.
905
908
.
29.
Scanlon
,
P.
,
Ellis
,
D. P. W.
, and
Reilly
,
R. B.
(
2007
). “
Using broad phonetic group experts for improved speech recognition
,”
IEEE Trans. Audio Speech Lang. Process.
15
(
3
),
803
812
.
30.
Schönle
,
P. W.
,
Gräbe
,
K.
,
Wenig
,
P.
,
Höhne
,
J.
,
Schrader
,
J.
, and
Conrad
,
B.
(
1987
). “
Electromagnetic articulography: Use of alternating magnetic fields for tracking movements of multiple points inside and outside the vocal tract
,”
Brain Lang.
31
(
1
),
26
35
.
31.
Schwartz
,
M. F.
(
1972
). “
Bilabial closure durations for /p/, /b/, and /m/ in voiced and whispered vowel environments
,”
J. Acoust. Soc. Am.
51
,
2025
2029
.
32.
Sharifzadeh
,
H. R.
,
McLoughlin
,
I. V.
, and
Ahmadi
,
F.
(
2010
). “
Reconstruction of normal sounding speech for laryngectomy patients through a modified CELP codec
,”
IEEE Trans. Biomed. Eng.
57
(
10
),
2448
2458
.
33.
Tartter
,
V. C.
(
1989
). “
What's in a whisper?
,”
J. Acoust. Soc. Am.
86
,
1678
1683
.
34.
Team
,
T. T. D.
,
Al-Rfou
,
R.
,
Alain
,
G.
,
Almahairi
,
A.
,
Angermueller
,
C.
,
Bahdanau
,
D.
,
Ballas
,
N.
,
Bastien
,
F.
,
Bayer
,
J.
,
Belikov
,
A.
, et al (
2016
). “
Theano: A python framework for fast computation of mathematical expressions
,” arXiv:1605.02688.
35.
Toda
,
T.
, and
Shikano
,
K.
(
2005
). “
NAM-to-speech conversion with Gaussian mixture models
,”
in
INTERSPEECH
, pp.
1957
1960
.
36.
Toutios
,
A.
, and
Maeda
,
S.
(
2012
). “
Articulatory VCV synthesis from EMA data
,” in
INTERSPEECH
, pp.
2566
2569
.
37.
Toutios
,
A.
, and
Narayanan
,
S.
(
2013
). “
Articulatory synthesis of French connected speech from EMA data
,” in
INTERSPEECH
, pp.
2738
2742
.
38.
Wang
,
J.
,
Hahm
,
S.
, and
Mau
,
T.
(
2015
). “
Determining an optimal set of flesh points on tongue, lips, and jaw for continuous silent speech recognition
,” in
Proceedings of SLPAT 2015: 6th Workshop on Speech and Language Processing for Assistive Technologies
, Association for Computational Linguistics, Dresden, Germany, pp.
79
85
.
39.
Wang
,
J.
,
Samal
,
A.
, and
Green
,
J. R.
(
2014
). “
Preliminary test of a real-time, interactive silent speech interface based on electromagnetic articulograph
,” in
SLPAT@ACL
, Association for Computational Linguistics, pp.
38
45
.
40.
Wang
,
J.
,
Samal
,
A.
,
Green
,
J. R.
, and
Rudzicz
,
F.
(
2012a
). “
Sentence recognition from articulatory movements for silent speech interfaces
,” in
ICASSP
, IEEE, pp.
4985
4988
.
41.
Wang
,
J.
,
Samal
,
A.
,
Green
,
J. R.
, and
Rudzicz
,
F.
(
2012b
). “
Whole-word recognition from articulatory movements for silent speech interfaces
,” in
INTERSPEECH
, ISCA, pp.
1327
1330
.
42.
Wrench
,
A.
(
1999
). “
MOCHA-TIMIT
,” speech database.
43.
Wszołek
,
W.
,
Modrzejewski
,
M.
, and
Przysiezny
,
M.
(
2014
). “
Acoustic analysis of esophageal speech in patients after total laryngectomy
,”
Arch. Acoust.
32
(
4
),
151
158
.
44.
Yoshioka
,
H.
(
2008
). “
The role of tongue articulation for /s/ and /z/ production in whispered speech
,” in
Proceedings of Acoustics
, pp.
2335
2338
.

Supplementary Material

You do not currently have access to this content.