Computational auditory scene analysis is increasingly presented in the literature as a set of auditory-inspired techniques for estimating “Ideal Binary Masks” (IBM), i.e., time-frequency domain segregations of the attended source and the acoustic background based on a local signal-to-noise ratio objective (Wang and Brown, 2006). This talk argues that although IBMs may be a useful stand-in when evaluating signal-processing systems, they can provide a misleading perspective when considering models of auditory cognition. First, there is no evidence that human cognition computes or requires an explicit binary mask representation (ideal or otherwise). Second, evaluation of an IBM requires artificially-mixed acoustic scenes in order to provide access to the ground truth mask. It is possible that systems that work well on artificially mixed acoustic scenes will fail to generalize to real data. The danger in predicting real performance from results obtained on artificial mixtures is seen in an analysis of systems submitted to the recent CHiME distant microphone speech recognition challenges which evaluates on both types of data (http://spandh.dcs.shef.ac.uk/chime). It is argued that rather than presume specific internal representations, auditory scene analysis systems can be best evaluated by direct comparison of human and machine percepts, e.g., in the case of a speech recognition task, comparison of human and machine transcriptions at a phonetic level.
Skip Nav Destination
Article navigation
May 2017
Meeting abstract. No PDF available.
May 01 2017
Evaluation of scene analysis using real and simulated acoustic mixtures: Lessons learnt from the CHiME speech recognition challenges Free
Jon P. Barker
Jon P. Barker
Comput. Sci., Univ. of Sheffield, Regent Court, 211 Portobello, Sheffield S1 4DP, United Kingdom, [email protected]
Search for other works by this author on:
Jon P. Barker
Comput. Sci., Univ. of Sheffield, Regent Court, 211 Portobello, Sheffield S1 4DP, United Kingdom, [email protected]
J. Acoust. Soc. Am. 141, 3693 (2017)
Citation
Jon P. Barker; Evaluation of scene analysis using real and simulated acoustic mixtures: Lessons learnt from the CHiME speech recognition challenges. J. Acoust. Soc. Am. 1 May 2017; 141 (5_Supplement): 3693. https://doi.org/10.1121/1.4988044
Download citation file:
Citing articles via
Focality of sound source placement by higher (ninth) order ambisonics and perceptual effects of spectral reproduction errors
Nima Zargarnezhad, Bruno Mesquita, et al.
A survey of sound source localization with deep learning methods
Pierre-Amaury Grumiaux, Srđan Kitić, et al.
Drawer-like tunable ventilated sound barrier
Yong Ge, Yi-jun Guan, et al.
Related Content
Separable spectro-temporal Gabor filter bank features: Reducing the complexity of robust features for automatic speech recognition
J. Acoust. Soc. Am. (April 2015)
Sound and shape of pyeongyoung, stone chime and pyeonjong, bell chime
J. Acoust. Soc. Am. (May 2017)
Automated detection of alarm sounds
J. Acoust. Soc. Am. (July 2012)
Acoustic transients from the impact force excitation of beams and wind chimes
J. Acoust. Soc. Am. (October 2019)
Sound analysis and synthesis of Marquis Yi of Zeng's chime-bell set
Proc. Mtgs. Acoust. (May 2013)