Auditory and visual perceptual processes interact during the identification of speech sounds. Some evaluations of this interaction have utilized a comparison of performance on audio and audiovisual word recognition tasks. A measure derived from these data, R, can be used as an index of the perceptual gain due to multisensory stimulation relative to unimodal stimulation. Recent evidence has indicated that cross‐modal relationships between the acoustic and optical forms of speech stimuli exist. Furthermore, this cross‐modal information may be used by the perceptual mechanisms responsible for integrating disparate sensory signals. However, little is known about the ways in which acoustic and optic signals carry cross‐modal information. The present experiment manipulated the acoustic form of speech in systematic ways that selectively disrupted candidate sources of cross‐modal information in the acoustic signal. Participants were then asked to perform a simple word recognition task with the transformed words in either auditory‐alone or audiovisual presentation conditions. It was predicted that audiovisual gain would be relatively high for those transformations in which the relative spacing of formants was preserved but would be nonexistent for those transformations that destroy the relative spacing of formants. The results are discussed in terms of existing theories of audiovisual speech perception.