The efficacy of audio-visual interactions in speech perception comes from two kinds of factors. First, at the information level, there is some “complementarity” of audition and vision: It seems that some speech features, mainly concerned with manner of articulation, are best transmitted by the audio channel, while some other features, mostly describing place of articulation, are best transmitted by the video channel. Second, at the information processing level, there is some “synergy” between audition and vision: The audio-visual global identification scores in a number of different tasks involving acoustic noise are generally greater than both the auditory-alone and the visual-alone scores. However, these two properties have been generally demonstrated until now in rather global terms. In the present work, audio-visual interactions at the feature level are studied for French oral vowels which contrast three series, namely front unrounded, front rounded, and back rounded vowels. A set of experiments on the auditory, visual, and audio-visual identification of vowels embedded in various amounts of noise demonstrate that complementarity and synergy in bimodal speech appear to hold for a bundle of individual phonetic features describing place contrasts in oral vowels. At the information level (complementarity), in the audio channel the height feature is the most robust, backness the second most robust one, and rounding the least, while in the video channel rounding is better than height, and backness is almost invisible. At the information processing (synergy) level, transmitted information scores show that all individual features are better transmitted with the ear and the eye together than with each sensor individually.

1.
Abry, C., and Boë, L. J. (1980). “A la recherche de corrélats géométriques discriminants pour l’opposition d’arrondissement vocalique en français,” in Labialité et Phonétique, edited by C. Abry, L. J. Boë, P. Corsi, R. Descout, M. Gentil, and P. Graillot (Publications de l’Université des Langues et Lettres de Grenoble, Grenoble), pp. 217–237.
2.
Abry
,
C.
, and
Boë
,
L. J.
(
1986
). “
Laws for lips
,”
Speech Commun.
5
,
97
104
.
3.
Benoı̂t, Ch., Lallouache, T., Mohamadi, T., and Abry, Ch. (1992). “A set of visual French visemes for visual speech synthesis,” in Talking Machines: Theories, Models and Designs, edited by G. Bailly, C. Benoı̂t, and T. R. Sawallis (Elsevier Science, Amsterdam), pp. 485–504.
4.
Benoı̂t
,
C.
,
Mohamadi
,
T.
, and
Kandel
,
S. D.
(
1994
). “
Effects of phonetic context on audio-visual intelligibility of French
,”
J. Speech Hear. Res.
37
,
1195
1203
.
5.
Blamey
,
P. J.
,
Cowan
,
R. S. C.
,
Alcantara
,
J. I.
,
Whitford
,
L. A.
, and
Clark
,
G. M.
(
1989
). “
Speech perception using combinations of auditory, visual, and tactile information
,”
Journal of Rehabilitation Research and Development
26
,
15
24
.
6.
Breeuwer
,
M.
, and
Plomp
,
R.
(
1986
). “
Speechreading supplemented with auditorily presented speech parameters
,”
J. Acoust. Soc. Am.
79
,
481
499
.
7.
Campbell, R., Dodd, B., and Burnham, D., Eds. (1998). Hearing by Eye, II. Perspectives and Directions in Research on Audiovisual Aspects of Language Processing (Erlbaum/Psychology Press, Hillsdale, NJ).
8.
Carlson, R., Fant, G., and Granström, B. (1975). “Two-formant models, pitch and vowel perception,” in Auditory Analysis and Perception of Speech, edited by G. Fant and M. A. Tatham (Academic, London), pp. 55–82.
9.
Cathiard, M. A. (1994). “La perception visuelle de l’anticipation des gestes vocaliques: Cohérence des événements audibles et visibles dans le flux de la parole,” Doctoral Report, Cognitive Psychology, Université Pierre Mendès France, Grenoble.
10.
Cathiard
,
M. A.
,
Tiberghien
,
G.
, and
Abry
,
C.
(
1992
). “
Face and profile identification skills for lip-rounding
,”
Bulletin de la Communication Parlée
2
,
43
58
.
11.
Cohen, M. M., and Massaro, D. W. (1995). “Perceiving visual and auditory information in consonant-vowel and vowel syllables,” in Levels in Speech Communication: Relations and Interactions, edited by C. Sorin, J. Mariani, H. Méloni, and J. Schoentgen (Elsevier Science, Amsterdam), pp. 25–37.
12.
Delattre
,
P.
(
1959
). “
Rapports entre la durée vocalique, le timbre et la structure syllabique en français
,”
The French Review
23
,
547
552
.
13.
Delattre
,
P.
(
1962
). “
Some factors of vowel duration and their cross-linguistic validity
,”
J. Acoust. Soc. Am.
34
,
1141
1142
.
14.
Erber
,
N. P.
(
1975
). “
Auditory-visual perception of speech
,”
J. Speech Hear. Disorder
40
,
481
492
.
15.
Gentil, M. (1981). “Etude de la perception de la parole: Lecture labiale et sosies labiaux,” Technical report IBM, France.
16.
Gottfried
,
T.
, and
Beddor
,
P. S.
(
1988
). “
Perception of temporal and spectral information in French vowels
,”
Language and Speech
31
,
57
75
.
17.
Grant
,
K. W.
,
Ardell
,
L. H.
,
Kuhl
,
P. K.
, and
Sparks
,
D. W.
(
1985
). “
The contribution of fundamental frequency, amplitude envelope, and voicing duration cues to speechreading in normal-hearing subjects
,”
J. Acoust. Soc. Am.
77
,
671
677
.
18.
Itakura
,
F.
(
1975
). “
Minimum prediction residual principle applied to speech recognition
,”
IEEE Trans. Acoust. Speech Signal Process.
23,
67
72
.
19.
Jackson
,
P. L.
,
Montgomery
,
A. A.
, and
Binnie
,
C. A.
(
1976
). “
Perceptual dimensions underlying vowel lipreading performance
,”
J. Speech Hear. Res.
19
,
796
812
.
20.
Jakobson, R., Fant, G., and Halle, M. (1952). Preliminaries to Speech Analysis (MIT, Cambridge, MA).
21.
Lallouache, M. T. (1990). “Un poste ‘visage-parole.’ Acquisition et traitement de contours labiaux,” Proceedings of the XVIII Journées d’Études sur la Parole, Montréal, pp. 282–286.
22.
Lindblom, B. E. F. (1986). “Phonetic universal in vowel systems,” in Experimental Phonology, edited by J. J. Ohala (Academic, Orlando), pp. 13–44.
23.
Lindblom, B. E. F. (1990). “Explaining phonetic variations: a sketch of the H&H theory,” in Explanations for Languages Universals, edited by B. Butterworth, B. Comrie, and Ö. Dahl (Mouton, New York), pp. 181–203.
24.
Linker, W. (1982). “Articulatory and acoustic correlates of labial activity in vowels: a cross-linguistic study,” UCLA Working Papers on Phonetics 56.
25.
Lisker
,
L.
, and
Rossi
,
M.
(
1992
). “
Auditory and visual cueing of the [±rounded] feature of vowels
,”
Language and Speech
35
,
391
417
.
26.
Lonchamp
,
F.
(
1981
). “
Multidimensional vocalic perceptual space: how many dimensions?
,”
J. Acoust. Soc. Am.
69
,
S94
.
27.
MacLeod
,
A.
, and
Summerfield
,
Q.
(
1987
). “
Quantifying the contribution of vision to speech perception in noise
,”
Br. J. Audiol.
21
,
131
141
.
28.
Mantakas, M. (1989). “Application du second formant effectif F’2 à l’étude de l’opposition d’arrondissement des voyelles antérieures du français,” Thèse de Docteur de l’INPG, Systèmes Electroniques.
29.
Massaro, D. W. (1987). Speech Perception by Ear and Eye: A Paradigm for Psychological Inquiry (Erlbaum, London).
30.
Massaro
,
D. W.
(
1989
). “
Multiple book review of speech perception by ear and eye: A paradigm for psychological inquiry
,”
Behav. Brain Sci.
12
,
741
794
.
31.
Massaro
,
D. W.
, and
Cohen
,
M. M.
(
1983
). “
Evaluation and integration of visual and auditory information in speech perception
,”
J. Exp. Psychol.: Human Perception and Performance
9
,
753
771
.
32.
McGurk
,
H.
, and
MacDonald
,
J.
(
1976
). “
Hearing lips and seeing voices
,”
Nature (London)
264
,
746
748
.
33.
Miller
,
G. A.
, and
Nicely
,
P. E.
(
1955
). “
An analysis of perceptual confusions among some English consonants
,”
J. Acoust. Soc. Am.
27
,
338
358
.
34.
Montgomery
,
A. A.
, and
Jackson
,
P. L.
(
1983
). “
Physical characteristics of the lips underlying vowel lipreading performance
,”
J. Acoust. Soc. Am.
73
,
2134
2144
.
35.
Mourand-Dornier (1980). “Le rôle de la lecture labiale dans la reconnaissance de la parole,” Doctoral Report, Medical Sciences, Université de Franche-Comté.
36.
Plant
,
G. L.
(
1980
). “
Visual identification of Australian vowels and diphthongs
,”
Australian J. Audiol.
2
,
83
91
.
37.
Plomp, R. (1970). “Timbre as a multidimensional attribute of complex tones” in Frequency Analysis and Periodicity Detection in Hearing, edited by R. Plomp and G. F. Smoorenburg (Sijthoff, Leiden), pp. 397–414.
38.
Plomp, R. (1975). “Auditory Analysis and Timbre Perception,” in Auditory Analysis and Perception of Speech, edited by G. Fant and M. A. Tatham (Academic, London), pp. 7–22.
39.
Pols, L. C. W. (1975). “Analysis and synthesis of speech using a broad-band spectral representation,” in Auditory Analysis and Perception of Speech, edited by G. Fant and M. A. A. Tatham (Academic, London), pp. 23–36.
40.
Robert-Ribes, J. (1995). “Modèles d’intégration audiovisuelle de signaux linguistiques: De la perception humaine à la reconnaissance automatique des voyelles,” Doctoral Report, Signal-Image-Parole, INPG.
41.
Robert-Ribes, J., Piquemal, M., Schwartz, J. L., and Escudier, P. (1996). “Exploiting sensor fusion architectures and stimuli complementarity in AV speech recognition,” in Speechreading by Man and Machine: Models, Systems and Applications, edited by D. G. Stork and M. Hennecke, NATO ASI Series (Springer, Verlag, New York), pp. 193–210.
42.
Schroeder, M. R., Atal, B. S., and Hall, J. L. (1979). “Objective measure of certain speech signal degradations based on masking properties of human auditory perception,” in Frontiers of Speech Communication Research, edited by B. Lindblom and S. Ohman (Academic, London), pp. 217–229.
43.
Schwartz
,
J. L.
,
Beautemps
,
D.
,
Abry
,
C.
, and
Escudier
,
P.
(
1993
). “
Interindividual and cross-linguistic strategies for the production of the [i] vs [y] contrast
,”
J. Phon.
21
,
411
425
.
44.
Schwartz
,
J. L.
,
Boë
,
L. J.
,
Vallée
,
N.
, and
Abry
,
C.
(
1997
a). “
Major trends in vowel inventories
,”
J. Phon.
25
,
233
253
.
45.
Schwartz
,
J. L.
,
Boë
,
L. J.
,
Vallée
,
N.
, and
Abry
,
C.
(
1997
b). “
The dispersion-focalization theory of vowel systems
,”
J. Phon.
25
,
255
286
.
46.
Schwartz, J. L., Robert-Ribes, J., and Escudier, P. (1998). “Ten years after Summerfield … a taxinomy of models for AV fusion in speech perception,” in Hearing by Eye, II. Perspectives and Directions in Research on Audiovisual Aspects of Language Processing, edited by R. Campbell, B. Dodd, and D. Burnham (Erlbaum/Psychology Press, Hillsdale, NJ), pp. 85–108.
47.
Stork, D. G., and Hennecke, M., Eds. (1996). Speechreading by Man and Machine: Models, Systems and Applications (NATO ASI Series, Springer-Verlag, New York).
48.
Summerfield
,
Q.
(
1979
). “
Use of visual information for phonetic perception
,”
Phonetica
36
,
314
331
.
49.
Summerfield, Q. (1985). “Speech-processing alternatives for electrical stimulation,” in Cochlear Implants, edited by M. Merzenich and R. Schindler (Raven, New York), pp. 195–222.
50.
Summerfield, Q. (1987). “Some preliminaries to a comprehensive account of audio-visual speech perception,” in Hearing by Eye: The Psychology of Lipreading, edited by B. Dodd and R. Campbell (Erlbaum, London), pp. 3–51.
51.
Summerfield, Q., MacLeod, A., McGrath, M., and Brooke, M. (1989). “Lips, teeth and the benefits of lipreading,” in Handbook of Research on Face Processing, edited by A. W. Young and H. D. Ellis (Elsevier Science, North-Holland), pp. 223–233.
52.
Syrdal
,
A.
(
1985
). “
Aspects of a model of the auditory representation of American English vowels
,”
Speech Commun.
4
,
121
135
.
53.
Teissier, P., Robert-Ribes, J., Schwartz, J. L., and Guérin-Dugué, A. (submitted). “Comparing models for audio-visual fusion in a noisy-vowel recognition task,” IEEE Trans. Speech Audio Process.
54.
Tseva
,
A.
(
1989
). “
L’arrondissement dans l’identification visuelle des voyelles du français
,”
Bulletin du Laboratoire de la Communication Parlée
3
,
149
186
.
55.
VanSon
,
N.
,
Huiskamp
,
T. M. I.
,
Bosman
,
A. J.
, and
Smoorenburg
,
G. F.
(
1994
). “
Viseme classifications of Dutch consonants and vowels
,”
J. Acoust. Soc. Am.
96
,
1341
1355
.
56.
Ventsel, H. (1973). Théorie des probabilités (MIR, Moscow).
57.
Walden
,
B. E.
,
Prosek
,
R. A.
,
Montgomery
,
A. A.
,
Scherr
,
C. K.
, and
Jones
,
C. J.
(
1977
). “
Effects of training on the visual recognition of consonants
,”
J. Speech Hear. Res.
20
,
130
145
.
58.
Yuhas
,
B. P.
,
Goldstein
,
M. H.
,
Sejnowski
,
T. J.
, and
Jenkins
,
R. E.
(
1990
). “
Neural network models of sensory integration for improved vowel recognition
,”
Proc. IEEE
78
,
1658
1668
.
59.
Zerling
,
J. P.
(
1992
). “
Frontal lip shape for French and English vowels
,”
J. Phon.
20
,
3
14
.
60.
Zwicker, E., and Feldtkeller, R. (1969). Das Ohr als Nachrichtenempfänger (S. Hirzel Verlag, Stuttgart).
This content is only available via PDF.
You do not currently have access to this content.