Deep learning models have become potential candidates for auditory neuroscience research, thanks to their recent successes in a variety of auditory tasks, yet these models often lack interpretability to fully understand the exact computations that have been performed. Here, we proposed a parametrized neural network layer, which computes specific spectro-temporal modulations based on Gabor filters [learnable spectro-temporal filters (STRFs)] and is fully interpretable. We evaluated this layer on speech activity detection, speaker verification, urban sound classification, and zebra finch call type classification. We found that models based on learnable STRFs are on par for all tasks with state-of-the-art and obtain the best performance for speech activity detection. As this layer remains a Gabor filter, it is fully interpretable. Thus, we used quantitative measures to describe distribution of the learned spectro-temporal modulations. Filters adapted to each task and focused mostly on low temporal and spectral modulations. The analyses show that the filters learned on human speech have similar spectro-temporal parameters as the ones measured directly in the human auditory cortex. Finally, we observed that the tasks organized in a meaningful way: the human vocalization tasks closer to each other and bird vocalizations far away from human vocalizations and urban sounds tasks.
Skip Nav Destination
Article navigation
July 2021
July 14 2021
Learning spectro-temporal representations of complex sounds with parameterized neural networksa)
Special Collection:
Machine Learning in Acoustics
Rachid Riad;
Rachid Riad
b)
1
Ecole des Hautes Etudes en Sciences Sociales, CNRS, Institut National de Recherche informatique et Automatique, Département d'Études Cognitives, Ecole Normale Supérieure-Paris Sciences et Lettres University
, 29 Rue d'Ulm, 75005 Paris, France
Search for other works by this author on:
Julien Karadayi;
Julien Karadayi
1
Ecole des Hautes Etudes en Sciences Sociales, CNRS, Institut National de Recherche informatique et Automatique, Département d'Études Cognitives, Ecole Normale Supérieure-Paris Sciences et Lettres University
, 29 Rue d'Ulm, 75005 Paris, France
Search for other works by this author on:
Anne-Catherine Bachoud-Lévi;
Anne-Catherine Bachoud-Lévi
2
NeuroPsychologie Interventionnelle, Département d'Études Cognitives, Ecole Normale Supérieure, Institut National de la Santé et de la Recherche Médicale, Institut Mondor de Recherche Biomédicale, Neuratris, Université Paris-Est Créteil, Paris Sciences et Lettres University
, 29 Rue d'Ulm, 75005 Paris, France
Search for other works by this author on:
Emmanuel Dupoux
Emmanuel Dupoux
c)
1
Ecole des Hautes Etudes en Sciences Sociales, CNRS, Institut National de Recherche informatique et Automatique, Département d'Études Cognitives, Ecole Normale Supérieure-Paris Sciences et Lettres University
, 29 Rue d'Ulm, 75005 Paris, France
Search for other works by this author on:
b)
Also at: NeuroPsychologie Interventionnelle, Ecole Normale Supérieure, 75005 Paris, France. Electronic mail: riadrachid3@gmail.com, ORCID: 0000-0002-7753-1219.
c)
Also at: Facebook AI Research, Paris, France.
a)
This paper is part of a special issue on Machine Learning in Acoustics.
J. Acoust. Soc. Am. 150, 353–366 (2021)
Article history
Received:
February 18 2021
Accepted:
June 08 2021
Citation
Rachid Riad, Julien Karadayi, Anne-Catherine Bachoud-Lévi, Emmanuel Dupoux; Learning spectro-temporal representations of complex sounds with parameterized neural networks. J. Acoust. Soc. Am. 1 July 2021; 150 (1): 353–366. https://doi.org/10.1121/10.0005482
Download citation file:
Sign in
Don't already have an account? Register
Sign In
You could not be signed in. Please check your credentials and make sure you have an active account and try again.
Sign in via your Institution
Sign in via your InstitutionPay-Per-View Access
$40.00