This paper presents a quantitative and comprehensive study of the lip movements of a given speaker in different speech/nonspeech contexts, with a particular focus on silences (i.e., when no sound is produced by the speaker). The aim is to characterize the relationship between “lip activity” and “speech activity” and then to use visual speech information as a voice activity detector (VAD). To this aim, an original audiovisual corpus was recorded with two speakers involved in a face-to-face spontaneous dialog, although being in separate rooms. Each speaker communicated with the other using a microphone, a camera, a screen, and headphones. This system was used to capture separate audio stimuli for each speaker and to synchronously monitor the speaker’s lip movements. A comprehensive analysis was carried out on the lip shapes and lip movements in either silence or nonsilence (i.e., speech+nonspeech audible events). A single visual parameter, defined to characterize the lip movements, was shown to be efficient for the detection of silence sections. This results in a visual VAD that can be used in any kind of environment noise, including intricate and highly nonstationary noises, e.g., multiple and/or moving noise sources or competing speech signals.

1.
Abrard
,
F.
, and
Deville
,
Y.
(
2003
). “
Blind separation of dependent sources using the “time-frequency ratio of mixture” approach
,” in
Proceedings of the International Symposium on Signal Processing and Its Applications (ISSPA)
,
Paris, France
, pp.
81
84
.
2.
Abry
,
C.
, and
Boë
,
L. J.
(
1986
). “
Laws for lips
,”
Speech Commun.
5
,
97
104
.
3.
Aubrey
,
A.
,
Rivet
,
B.
,
Hicks
,
Y.
,
Girin
,
L.
,
Chambers
,
J.
, and
Jutten
,
C.
(
2007
). “
Comparison of appearance models and retinal filtering for visual voice activity detection
,” in
Proceedings of the European Signal Processing Conference (EUSIPCO)
,
Poznan, Poland
.
4.
Bailly
,
G.
, and
Badin
,
P.
(
2002
). “
Seeing tongue movements from outside
,” in
Proceedings of the International Conference on Spoken Language Processing (ICSLP)
,
Denver, CO
, pp.
1913
1916
.
5.
Bailly
,
G.
,
Berard
,
M.
,
Elisei
,
F.
, and
Odisio
,
M.
(
2003
). “
Audiovisual speech synthesis
,”
Speech Technol.
,
6
,
331
346
.
6.
Barker
,
J. P.
, and
Berthommier
,
F.
(
1999
). “
Estimation of speech acoustics from visual speech features: A comparison of linear and non-linear models
,” in
Proceedings of the Conference on Audio-Visual Speech Processing (AVSP)
,
Santa Cruz, CA
, pp.
112
117
.
7.
Benoît
,
C.
,
Guiard-Marigny
,
T.
,
Le Goff
,
B.
, and
Adjoudani
,
A.
(
1996
). “
Which components of the face humans and machines best speechread?
” in
Speechreading by Man and Machine: Models, Systems and Applications
,
NATO Advanced Studies Institute, Series F: Computer and System Sciences
, edited by
D. G.
Stork
and
M. E.
Hennecke
(
Springer
,
New York
), pp.
315
328
.
8.
Benoît
,
C.
,
Lallouache
,
T.
,
Mohamadi
,
T.
, and
Abry
,
C.
(
1992
). “
A set of French visemes for visual speech synthesis
,” in
Talking Machines: Thèories, Models, and Designs
, edited by
G.
Bailly
,
C.
Benoit
, and
T. R.
Sawallis
(
North-Holland
,
Amsterdam
), pp.
485
504
.
9.
Benoît
,
C.
,
Mohamadi
,
T.
, and
Kandel
,
S.
(
1994
). “
Effects of phonetic context on audio-visual intelligibility of French
,”
J. Speech Hear. Res.
37
,
1195
1293
.
10.
Bernstein
,
L. E.
,
Takayanagi
,
S.
, and
Auer
,
E. T.
, Jr.
(
2004
). “
Auditory speech detection in noise enhanced by lipreading
,”
Speech Commun.
44
,
5
18
.
11.
Bertelson
,
P.
(
1999
). “
Ventriloquism: A case of crossmodal perceptual grouping
,” in
Cognitive Contributions to the Perception of Spatial and Temporal Events
, edited by
G.
Aschersleben
,
T.
Bachmann
, and
J.
Müsseler
(
Elsevier
,
Amsterdam
), pp.
347
362
.
12.
Calvert
,
G. A.
, and
Campbell
,
R.
(
2003
). “
Reading speech from still and moving faces: The neural substrates of visible speech
,”
J. Cogn Neurosci.
15
,
57
70
.
13.
Campbell
,
N.
(
2007
). “
Approaches to conversational speech rhythm: Speech activity in two-person telephone dialogues
,” in
Proceedings of the International Congress of Phonetic Sciences (ICPhS)
,
Sarrebrücken, Germany
, pp.
343
348
.
14.
Cosi
,
P.
,
Fusaro
,
A.
, and
Tisato
,
G.
(
2003
). “
LUCIA: A new Italian talking-head based on a modified Cohen-Massaro’s labial coarticulation model
,” in
Proceedings of the European Conference on Speech Communication and Technology (EuroSpeech)
,
Geneva, Switzerland
, pp.
2269
2272
.
15.
De Cueto
,
P.
,
Neti
,
C.
, and
Senior
,
A. W.
(
2000
). “
Audio-visual intent-to-speak detection in human-computer interaction
,” in
Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
,
Istanbul, Turkey
, pp.
2373
2376
.
16.
Deligne
,
S.
,
Potamianos
,
G.
, and
Neti
,
C.
(
2002
). “
Audio-visual speech enhancement with AVCDCN (audiovisual codebook dependent cepstral normalization)
,” in
Proceedings of the International Conference on Spoken Language Processing (ICSLP)
,
Denver, CO
, pp.
1449
1452
.
17.
Ephraim
,
Y.
, and
Malah
,
D.
(
1984
). “
Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator
,”
IEEE Trans. Acoust., Speech, Signal Process.
32
,
1109
1121
.
18.
Erber
,
N. P.
(
1975
). “
Auditory-visual perception of speech
,”
J. Speech Hear Disord.
40
,
481
492
.
19.
Gibert
,
G.
,
Bailly
,
G.
,
Beautemps
,
D.
,
Elisei
,
F.
, and
Brun
,
R.
(
2005
). “
Analysis and synthesis of three-dimensional movements of the head, face, and of a speaker using cued speech
,”
J. Acoust. Soc. Am.
118
,
1144
1153
.
20.
Girin
,
L.
(
2004
). “
Joint matrix quantization of face parameters and LPC coefficients for low bit rate audiovisual speech coding
,”
IEEE Trans. Speech Audio Process.
12
,
265
276
.
21.
Girin
,
L.
,
Schwartz
,
J.-L.
, and
Feng
,
G.
(
2001
). “
Audio-visual enhancement of speech noise
,”
J. Acoust. Soc. Am.
109
,
3007
3020
.
22.
Goecke
,
R.
, and
Millar
,
J. B.
(
2003
). “
Statistical analysis of relationship between audio and video speech parameters Australian English
,” in
Proceedings of the Conference on Audio-Visual Speech Processing (AVSP)
,
Saint-Jorioz, France
, pp.
133
138
.
23.
Grant
,
K. W.
, and
Seitz
,
P.
(
2000
). “
The use of visible speech cues for improving auditory detection of spoken sentences
,”
J. Acoust. Soc. Am.
108
,
1197
1208
.
24.
Huang
,
J.
,
Liu
,
Z.
,
Wang
,
Y.
,
Chen
,
Y.
, and
Wong
,
E.
(
1999
). “
Integration of multimodal feature for video scene classification based on HMM
,” in
Proceedings of the Workshop Meeting on Multimedia Signal Processing (MMSP)
,
Copenhagen, Denmark
, pp.
53
58
.
25.
Iyengar
,
G.
, and
Neti
,
C.
(
2001
). “
A vision-based microphone switch for speech intent detection
,” in
Proceedings of the Workshop at the International Conference on Computer Vision (ICCV) on Recognition, Analysis and Tracking of Face and Gestures in Real Time Systems (RATFG-RTS)
,
Vancouver, Canada
, pp.
101
105
.
26.
Jiang
,
J.
,
Alwan
,
A.
,
Keating
,
P. A.
,
Auer
,
E. T.
, and
Bernstein
,
L. E.
(
2002
). “
On the relationship between face movements, tongue movements and speech acoustics
,”
EURASIP J. Appl. Signal Process.
11
,
1174
1188
.
27.
Kim
,
J.
, and
Davis
,
C.
(
2004
). “
Investigating the audio-visual speech detection advantage
,”
Speech Commun.
44
,
19
30
.
28.
Lallouache
,
T.
(
1990
). “
Un poste visage-parole. Acquisition et traitement des contours labiaux (A device for the capture and processing of lip contours)
,” in
Proceedings of the XVIII Journées d’Étude sur la Parole (JEP)
,
Montréal, Canada
, pp.
282
286
(in French).
29.
Lane
,
H.
, and
Tranel
,
B.
(
1971
). “
The Lombard sign and the role of hearing in speech
,”
J. Speech Hear. Res.
14
,
677
709
.
30.
Le Bouquin-Jeannès
,
R.
, and
Faucon
,
G.
(
1995
). “
Study of a voice activity detector and its influence on a noise reduction system
,”
Speech Commun.
16
,
245
254
.
31.
Liu
,
P.
, and
Wang
,
Z.
(
2004
). “
Voice activity detection using visual information
,” in
Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
,
Montreal, Canada
, pp.
609
612
.
32.
Lombard
,
E.
(
1911
). “
Le signe de l’élévation de la voix (The sign of voice rise)
,”
Annales des maladies de l’oreille et du larynx
37
, pp.
101
119
, (in French).
33.
Macho
,
D.
,
Padrell
,
J.
,
Abad
,
A.
,
Nadeu
,
C.
,
Hernando
,
J.
,
McDonough
,
J.
,
Wölfel
,
M.
,
Klee
,
U.
,
Omologo
,
M.
,
Brutti
,
A.
,
Svaizer
,
P.
,
Potamianos
,
G.
, and
Chu
,
S. M.
(
2005
). “
Automatic speech activity detection, source localization and speech recognition on the CHIL seminar corpus
,” in
International Conference on Multimedia and Expo (ICME)
,
Amsterdam, The Netherlands
, pp.
876
879
.
34.
McGurk
,
H.
, and
McDonald
,
J.
(
1976
). “
Hearing lips and seeing voices
,”
Nature (London)
264
,
746
748
.
35.
Munhall
,
K. G.
,
Gribble
,
P.
,
Sacco
,
L.
, and
Ward
,
M.
(
1996
). “
Temporal constraints on the McGurk effect
,”
Percept. Psychophys.
58
,
351
362
.
36.
Munhall
,
K. G.
,
Servos
,
P.
,
Santi
,
A.
, and
Goodale
,
M.
(
2002
). “
Dynamic visual speech perception in a patient with visual form agnosia
,”
NeuroReport
13
,
1793
1796
.
37.
Munhall
,
K. G.
, and
Vatikiotis-Bateson
,
E.
(
1998
). “
The moving face during speech communication
,” in
Hearing by Eye II: Advances in the Psychology of Speechreading and Auditory-Visual Speech
, edited by
R.
Campbell
,
B.
Dodd
, and
D.
Burnham
(
Psychology
,
London
), pp.
123
139
.
38.
Petajan
,
E. D.
(
1984
). “
Automatic lipreading to enhance speech recognition
,” in
Proceedings of the Global Telecommunications Conference (GLOBCOM)
,
Atlanta, GA
, pp.
265
272
.
39.
Potamianos
,
G.
,
Neti
,
C.
, and
Deligne
,
S.
(
2003a
). “
Joint audio-visual speech processing for recognition and enhancement
,” in
Proceedings of the Conference on Audio-Visual Speech Processing (AVSP)
,
Saint-Jorioz, France
, pp.
95
104
.
40.
Potamianos
,
G.
,
Neti
,
C.
, and
Gravier
,
G.
(
2003b
). “
Recent advances in the automatic recognition of visual speech
,”
Proc. IEEE
91
,
1306
1326
.
41.
Ramirez
,
J.
,
Segura
,
J. C.
,
Bemtez
,
C.
,
de la Torre
,
A.
, and
Rubio
,
A.
(
2004
). “
Efficient voice activity detection algorithms using long-term speech information
,”
Speech Commun.
42
,
271
287
.
42.
Ramírez
,
J.
,
Segura
,
J. C.
,
Benítez
,
C.
,
García
,
L.
, and
Rubio
,
A.
(
2005
). “
Statistical voice activity detection using a multiple observation likelihood ratio test
,”
IEEE Signal Process. Lett.
12
,
689
692
.
43.
Rao
,
R.
, and
Chen
,
T.
(
1996
). “
Cross-modal predictive coding for talking head sequences
,” in
Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
,
Atlanta, GA
, pp.
2058
2061
.
44.
Rivet
,
B.
,
Girin
,
L.
, and
Jutten
,
C.
(
2007a
). “
Mixing audiovisual speech processing and blind source separation for the extraction of speech signals from convolutive mixtures
,”
IEEE Trans. Audio, Speech, Lang. Process.
15
,
96
108
.
45.
Rivet
,
B.
,
Girin
,
L.
, and
Jutten
,
C.
(
2007b
). “
Visual voice activity detection as a help for speech source separation from convolutive mixtures
,”
Speech Commun.
49
,
667
677
.
46.
Robert-Ribes
,
J.
,
Schwartz
,
J. L.
,
Lallouache
,
T.
, and
Escudier
,
P.
(
1998
). “
Complementary and synergy in bimodal speech: Auditory, visual, and audio-visual identification of French oral vowels in noise
,”
J. Acoust. Soc. Am.
6
,
3677
3689
.
47.
Rosenblum
,
L. D.
,
Johnson
,
J. A.
, and
Saldana
,
H. M.
(
1996
). “
Visual kinematic information for embellishing speech in noise
,”
J. Speech Hear. Res.
39
,
1159
1170
.
48.
Rosenblum
,
L. D.
, and
Saldana
,
H. M.
(
1996
). “
An audiovisual test of kinematic primitives for visual speech perception
,”
J. Exp. Psychol. Hum. Percept. Perform.
22
,
318
331
.
49.
Schwartz
,
J. L.
,
Berthommier
,
F.
, and
Savariaux
,
C.
(
2004
). “
Seeing to hear better: Evidence for early audio-visual interactions in speech identification
,”
Cognition
93
,
69
78
.
50.
Sodoyer
,
D.
,
Girin
,
L.
,
Jutten
,
C.
, and
Schwartz
,
J. L.
(
2002
). “
Separation of audio-visual speech sources: A new approach exploiting the audiovisual coherence of speech stimuli
,”
EURASIP J. Appl. Signal Process.
11
,
1165
1173
.
51.
Sodoyer
,
D.
,
Girin
,
L.
,
Jutten
,
C.
, and
Schwartz
,
J. L.
(
2004
). “
Further experiments on audio-visual speech source separation
,”
Speech Commun.
44
,
113
125
.
52.
Sodoyer
,
D.
,
Rivet
,
B.
,
Girin
,
L.
,
Jutten
,
C.
, and
Schwartz
,
J. L.
(
2006
). “
An analysis of visual speech information applied to voice activity detection
,” in
Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
,
Toulouse, France
, pp.
601
604
.
53.
Sohn
,
J.
,
Kim
,
N. S.
, and
Sung
,
W.
(
1999
). “
A statistical model based voice activity detection
,”
IEEE Signal Process. Lett.
6
,
1
3
.
54.
Sumby
,
W. H.
, and
Pollack
,
I.
(
1954
). “
Visual contribution to speech intelligibility in noise
,”
J. Acoust. Soc. Am.
26
,
212
215
.
55.
Summerfield
,
Q.
(
1979
). “
Use of visual information for phonetic perception
,”
Phonetica
36
,
314
331
.
56.
Summerfield
,
Q.
(
1987
). “
Some preliminaries to a comprehensive account of audio-visual speech perception
,” in
Hearing by Eye: The Psychology of Lip-Reading
, edited by
B.
Dodd
and
R.
Campbell
(
Erlbaum
,
London
), pp.
3
51
.
57.
Tanyer
,
S. G.
, and
Ozer
,
H.
(
2000
). “
Voice activity detection in nonstationary noise
,”
IEEE Trans. Speech Audio Process.
8
,
478
482
.
58.
Thomas
,
S. M.
, and
Jordan
,
T. R.
(
2004
). “
Contributions of oral and extraoral facial movement to visual and audiovisual speech perception
,”
J. Exp. Psychol.
30
,
873
888
.
59.
Wang
,
W.
,
Cosker
,
D.
,
Hicks
,
Y.
,
Sanei
,
S.
, and
Chambers
,
J. A.
(
2005
). “
Video assisted speech source separation
,” in
Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
,
Philadelphia
, pp.
425
428
.
60.
Yehia
,
H.
,
Kuratate
,
T.
, and
Vatikiotis-Bateson
,
E.
(
2000
). “
Facial animation and head motion driven by speech acoustics
,” in
Proceedings of the Seminar on Speech Production: Models and Data and CREST Workshop on Models of Speech Production: Motor Planning and Articulatory Modelling
,
Kloster Seeon, Germany
, pp.
265
268
.
61.
Yehia
,
H.
,
Rubin
,
P.
, and
Vatikiotis-Bateson
,
E.
(
1998
). “
Quantitative association of vocal-tract and facial behavior
,”
Speech Commun.
26
,
23
43
.
You do not currently have access to this content.