Speech perception involves multiple input modalities. Research has indicated that perceivers may establish a cross-modal association between auditory and visual-spatial events to aid perception. Such intermodal relations can be particularly beneficial for non-native perceivers who need additional resources to process challenging new sounds. This study examines how co-speech hand gestures mimicking pitch contours in space affect non-native Mandarin tone perception. Native English as well as Mandarin perceivers identified tones with either congruent or incongruent auditory-facial and gestural (AF/G) input. Perceivers also identified congruent and incongruent auditory-facial (A/F) stimuli. Native Mandarin results showed the expected ceiling-level performance in the congruent A/F and AF/G conditions. In the incongruent conditions, while A/F identification was primarily auditory-based, AF/G identification was partially based on gestures, demonstrating the use of gestures as valid cues in tone identification. The English perceivers’ performance was poor in the congruent A/F condition, but improved significantly in AF/G. While the incongruent A/F identification showed some reliance on facial information, incongruent AF/G identification relied more on gestural than auditory-facial information. These results indicate positive effects of facial and especially gestural input on non-native tone perception, suggesting that cross-modal (visual-spatial) resources can be recruited to aid auditory perception when phonetic demands are high.