Speech perception is a crucial function of the human auditory system, but speech is not only an acoustic signal—visual cues from a talker’s face and articulators (lips, teeth, and tongue) carry considerable linguistic information. These cues offer substantial and important improvements to speech comprehension when the acoustic signal suffers degradations like background noise or impaired hearing. However, useful visual cues are not always available, such as when talking on the phone or listening to a podcast. We are developing a system for generating a realistic speaking face from speech audio input. The system uses novel deep neural networks trained on a large audio-visual speech corpus. It is designed to run in real time so that it can be used as an assistive listening device. Previous systems have shown improvements in speech perception only for the most degraded speech. Our design differs notably from earlier ones in that it does not use a language model—instead, it makes a direct transformation from speech audio to face video. This allows the temporal coherence between the acoustic and visual modalities to be preserved, which has been shown to be crucial to cross-modal perceptual binding.