A major determinant of success when separating a speech signal from a noisy environment is the intelligibility of the extracted speech. The fraction of words correctly recognized by listeners is often used as the “gold standard”, but reliable computational metrics would be preferred because they are less labor-intensive than collecting listener judgements. We compare listener intelligibility to three acoustically derived metrics: (a) speech-to-interference ratios (SIRs) estimated from the processed mixtures (ESIR), (b) average coherence (Coh) and (c) speech-based Speech Transmission Index (sSTI). Sentences were recorded against restaurant babble, white Gaussian noise, and nonstationary noise by four microphones at different SNRs ranging from +6 dB to −8 dB. Treatments included (1) the original mixture; (2) the mixture processed by a critically determined 4-channel blind source separation (BSS) algorithm; (3) the speech component extracted from the mixture using a least-mean squared (LMS) algorithm; (4) an LMS algorithm used to remove the two noises and then 2-channel BSS to separate the sentences from the babble; and (5) pristine speech recorded with no noise. The metrics are compared to intelligibility results from listening tests. The Coh and sSTI metrics show the best fit to listener intelligibility across talkers.

This content is only available via PDF.