Deep learning models have become potential candidates for auditory neuroscience research, thanks to their recent successes in a variety of auditory tasks, yet these models often lack interpretability to fully understand the exact computations that have been performed. Here, we proposed a parametrized neural network layer, which computes specific spectro-temporal modulations based on Gabor filters [learnable spectro-temporal filters (STRFs)] and is fully interpretable. We evaluated this layer on speech activity detection, speaker verification, urban sound classification, and zebra finch call type classification. We found that models based on learnable STRFs are on par for all tasks with state-of-the-art and obtain the best performance for speech activity detection. As this layer remains a Gabor filter, it is fully interpretable. Thus, we used quantitative measures to describe distribution of the learned spectro-temporal modulations. Filters adapted to each task and focused mostly on low temporal and spectral modulations. The analyses show that the filters learned on human speech have similar spectro-temporal parameters as the ones measured directly in the human auditory cortex. Finally, we observed that the tasks organized in a meaningful way: the human vocalization tasks closer to each other and bird vocalizations far away from human vocalizations and urban sounds tasks.
Skip Nav Destination
,
,
,
Article navigation
July 2021
July 14 2021
Learning spectro-temporal representations of complex sounds with parameterized neural networksa)
Special Collection:
Machine Learning in Acoustics
Rachid Riad;
Rachid Riad
b)
1
Ecole des Hautes Etudes en Sciences Sociales, CNRS, Institut National de Recherche informatique et Automatique, Département d'Études Cognitives, Ecole Normale Supérieure-Paris Sciences et Lettres University
, 29 Rue d'Ulm, 75005 Paris, France
Search for other works by this author on:
Julien Karadayi;
Julien Karadayi
1
Ecole des Hautes Etudes en Sciences Sociales, CNRS, Institut National de Recherche informatique et Automatique, Département d'Études Cognitives, Ecole Normale Supérieure-Paris Sciences et Lettres University
, 29 Rue d'Ulm, 75005 Paris, France
Search for other works by this author on:
Anne-Catherine Bachoud-Lévi;
Anne-Catherine Bachoud-Lévi
2
NeuroPsychologie Interventionnelle, Département d'Études Cognitives, Ecole Normale Supérieure, Institut National de la Santé et de la Recherche Médicale, Institut Mondor de Recherche Biomédicale, Neuratris, Université Paris-Est Créteil, Paris Sciences et Lettres University
, 29 Rue d'Ulm, 75005 Paris, France
Search for other works by this author on:
Emmanuel Dupoux
Emmanuel Dupoux
c)
1
Ecole des Hautes Etudes en Sciences Sociales, CNRS, Institut National de Recherche informatique et Automatique, Département d'Études Cognitives, Ecole Normale Supérieure-Paris Sciences et Lettres University
, 29 Rue d'Ulm, 75005 Paris, France
Search for other works by this author on:
Rachid Riad
1,b)
Julien Karadayi
1
Anne-Catherine Bachoud-Lévi
2
Emmanuel Dupoux
1,c)
1
Ecole des Hautes Etudes en Sciences Sociales, CNRS, Institut National de Recherche informatique et Automatique, Département d'Études Cognitives, Ecole Normale Supérieure-Paris Sciences et Lettres University
, 29 Rue d'Ulm, 75005 Paris, France
2
NeuroPsychologie Interventionnelle, Département d'Études Cognitives, Ecole Normale Supérieure, Institut National de la Santé et de la Recherche Médicale, Institut Mondor de Recherche Biomédicale, Neuratris, Université Paris-Est Créteil, Paris Sciences et Lettres University
, 29 Rue d'Ulm, 75005 Paris, France
b)
Also at: NeuroPsychologie Interventionnelle, Ecole Normale Supérieure, 75005 Paris, France. Electronic mail: [email protected], ORCID: 0000-0002-7753-1219.
c)
Also at: Facebook AI Research, Paris, France.
a)
This paper is part of a special issue on Machine Learning in Acoustics.
J. Acoust. Soc. Am. 150, 353–366 (2021)
Article history
Received:
February 18 2021
Accepted:
June 08 2021
Citation
Rachid Riad, Julien Karadayi, Anne-Catherine Bachoud-Lévi, Emmanuel Dupoux; Learning spectro-temporal representations of complex sounds with parameterized neural networks. J. Acoust. Soc. Am. 1 July 2021; 150 (1): 353–366. https://doi.org/10.1121/10.0005482
Download citation file:
Pay-Per-View Access
$40.00
Sign In
You could not be signed in. Please check your credentials and make sure you have an active account and try again.
Citing articles via
Focality of sound source placement by higher (ninth) order ambisonics and perceptual effects of spectral reproduction errors
Nima Zargarnezhad, Bruno Mesquita, et al.
Related Content
General properties of auditory spectro-temporal receptive fields
J. Acoust. Soc. Am. (December 2019)
Characterization of time‐varying responses to dynamic broadband spectra in primary auditory cortex
J. Acoust. Soc. Am. (May 2000)
Noise robust representation of speech in the primary auditory cortex.
J. Acoust. Soc. Am. (March 2010)
Topographical distribution of spectrotemporal receptive field properties in the bat primary auditory cortex
J. Acoust. Soc. Am. (April 2022)
Dynamic component analysis for multi-input/multi-output problems, with application to speech and neurophysiology
J. Acoust. Soc. Am. (November 2013)