Speech perception is a crucial function of the human auditory system, but speech is not only an acoustic signal—visual cues from a talker’s face and articulators (lips, teeth, and tongue) carry considerable linguistic information. These cues offer substantial and important improvements to speech comprehension when the acoustic signal suffers degradations like background noise or impaired hearing. However, useful visual cues are not always available, such as when talking on the phone or listening to a podcast. We are developing a system for generating a realistic speaking face from speech audio input. The system uses novel deep neural networks trained on a large audio-visual speech corpus. It is designed to run in real time so that it can be used as an assistive listening device. Previous systems have shown improvements in speech perception only for the most degraded speech. Our design differs notably from earlier ones in that it does not use a language model—instead, it makes a direct transformation from speech audio to face video. This allows the temporal coherence between the acoustic and visual modalities to be preserved, which has been shown to be crucial to cross-modal perceptual binding.
Skip Nav Destination
,
,
,
,
,
Article navigation
March 2018
Meeting abstract. No PDF available.
March 01 2018
Toward a visual assistive listening device: Real-time synthesis of a virtual talking face from acoustic speech using deep neural networks Free
Lele Chen;
Lele Chen
Dept. of Comput. Sci., Univ. of Rochester, Rochester, NY
Search for other works by this author on:
Emre Eskimez;
Emre Eskimez
Dept. of Elec. and Comput. Eng., Univ. of Rochester, Rochester, NY
Search for other works by this author on:
Zhiheng Li;
Zhiheng Li
Dept. of Comput. Sci., Univ. of Rochester, Rochester, NY
Search for other works by this author on:
Zhiyao Duan;
Zhiyao Duan
Dept. of Elec. and Comput. Eng., Univ. of Rochester, Rochester, NY
Search for other works by this author on:
Chenliang Xu;
Chenliang Xu
Dept. of Comput. Sci., Univ. of Rochester, Rochester, NY
Search for other works by this author on:
Ross K. Maddox
Ross K. Maddox
Departments of Biomedical Eng. and Neurosci., Univ. of Rochester, 601 Elmwood Ave., Box 603, Rm. 5.7425, Rochester, NY 14642, [email protected]
Search for other works by this author on:
Lele Chen
Emre Eskimez
Zhiheng Li
Zhiyao Duan
Chenliang Xu
Ross K. Maddox
Dept. of Comput. Sci., Univ. of Rochester, Rochester, NY
J. Acoust. Soc. Am. 143, 1813 (2018)
Citation
Lele Chen, Emre Eskimez, Zhiheng Li, Zhiyao Duan, Chenliang Xu, Ross K. Maddox; Toward a visual assistive listening device: Real-time synthesis of a virtual talking face from acoustic speech using deep neural networks. J. Acoust. Soc. Am. 1 March 2018; 143 (3_Supplement): 1813. https://doi.org/10.1121/1.5035944
Download citation file:
88
Views
Citing articles via
Focality of sound source placement by higher (ninth) order ambisonics and perceptual effects of spectral reproduction errors
Nima Zargarnezhad, Bruno Mesquita, et al.
A survey of sound source localization with deep learning methods
Pierre-Amaury Grumiaux, Srđan Kitić, et al.
Variation in global and intonational pitch settings among black and white speakers of Southern American English
Aini Li, Ruaridh Purse, et al.
Related Content
The social requirements of speech perception distract drivers, not cell phones
J. Acoust. Soc. Am. (May 2017)
Converting an empty office to a podcast studio
J. Acoust. Soc. Am. (October 2021)
Podcast recording room design considerations and best practices
J. Acoust. Soc. Am. (October 2022)
Perceiving talking faces
J. Acoust. Soc. Am. (May 1995)
Speechreading talking faces
J. Acoust. Soc. Am. (October 1996)