Deep learning models have become potential candidates for auditory neuroscience research, thanks to their recent successes in a variety of auditory tasks, yet these models often lack interpretability to fully understand the exact computations that have been performed. Here, we proposed a parametrized neural network layer, which computes specific spectro-temporal modulations based on Gabor filters [learnable spectro-temporal filters (STRFs)] and is fully interpretable. We evaluated this layer on speech activity detection, speaker verification, urban sound classification, and zebra finch call type classification. We found that models based on learnable STRFs are on par for all tasks with state-of-the-art and obtain the best performance for speech activity detection. As this layer remains a Gabor filter, it is fully interpretable. Thus, we used quantitative measures to describe distribution of the learned spectro-temporal modulations. Filters adapted to each task and focused mostly on low temporal and spectral modulations. The analyses show that the filters learned on human speech have similar spectro-temporal parameters as the ones measured directly in the human auditory cortex. Finally, we observed that the tasks organized in a meaningful way: the human vocalization tasks closer to each other and bird vocalizations far away from human vocalizations and urban sounds tasks.

1.
Alekseev
,
A.
, and
Bobe
,
A.
(
2019
). “
Gabornet: Gabor filters with learnable parameters in deep convolutional neural network
,” in
Proceedings of the 2019 International Conference on Engineering and Telecommunication (EnT)
, November 20–21, Dolgoprudny, Russia, pp.
1
4
.
2.
Amodei
,
D.
,
Ananthanarayanan
,
S.
,
Anubhai
,
R.
,
Bai
,
J.
,
Battenberg
,
E.
,
Case
,
C.
,
Casper
,
J.
,
Catanzaro
,
B.
,
Cheng
,
Q.
,
Chen
,
G.
,
Chen
,
J.
,
Chen
,
J.
,
Chen
,
Z.
,
Chrzanowski
,
M.
,
Coates
,
A.
,
Diamos
,
G.
,
Ding
,
K.
,
Du
,
N.
,
Elsen
,
E.
,
Engel
,
J.
,
Fang
,
W.
,
Fan
,
L.
,
Fougner
,
C.
,
Gao
,
L.
,
Gong
,
C.
,
Hannun
,
A.
,
Han
,
T.
,
Johannes
,
L. V.
,
Jiang
,
B.
,
Ju
,
C.
,
Jun
,
B.
,
LeGresley
,
P.
,
Lin
,
L.
,
Liu
,
J.
,
Liu
,
Y.
,
Li
,
W.
,
Li
,
X.
,
Ma
,
D.
,
Narang
,
S.
,
Ng
,
A.
,
Ozair
,
S.
,
Peng
,
Y.
,
Prenger
,
R.
,
Qian
,
S.
,
Quan
,
Z.
,
Raiman
,
J.
,
Rao
,
V.
,
Satheesh
,
S.
,
Seetapun
,
D.
,
Sengupta
,
S.
,
Srinet
,
K.
,
Sriram
,
A.
,
Tang
,
H.
,
Tang
,
L.
,
Wang
,
C.
,
Wang
,
J.
,
Wang
,
K.
,
Wang
,
Y.
,
Wang
,
Z.
,
Wang
,
Z.
,
Wu
,
S.
,
Wei
,
L.
,
Xiao
,
B.
,
Xie
,
W.
,
Xie
,
Y.
,
Yogatama
,
D.
,
Yuan
,
B.
,
Zhan
,
J.
, and
Zhu
,
Z.
(
2016
). “
Deep speech 2: End-to-end speech recognition in English and Mandarin
,” in
Proceedings of the 33rd International Conference on Machine Learning
, New York, June 20–22, pp.
173
182
.
3.
Arnault
,
A.
,
Hanssens
,
B.
, and
Riche
,
N.
(
2020
). “
Urban sound classification: Striving towards a fair comparison
,” arXiv:2010.11805.
4.
Barker
,
J.
,
Watanabe
,
S.
,
Vincent
,
E.
, and
Trmal
,
J.
(
2018
). “
The fifth ‘chime’ speech separation and recognition challenge: Dataset, task and baselines
,” in
Proceedings of the 19th Annual Conference of the International Speech Communication Association (INTERSPEECH 2018)
, September 2–6, Hyderabad, India.
5.
Bellur
,
A.
, and
Elhilali
,
M.
(
2015
). “
Detection of speech tokens in noise using adaptive spectrotemporal receptive fields
,” in
Proceedings of the 2015 49th Annual Conference on Information Sciences and Systems (CISS)
, March 18–20, Baltimore, MD, pp.
1
6
.
6.
Bredin
,
H.
(
2017
). “
pyannote.metrics: A toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems
,” in
Proceedings of the 18th Annual Conference of the International Speech Communication Association (INTERSPEECH 2017)
, August 20–24, Stockholm, Sweden.
7.
Bredin
,
H.
,
Yin
,
R.
,
Coria
,
J. M.
,
Gelly
,
G.
,
Korshunov
,
P.
,
Lavechin
,
M.
,
Fustes
,
D.
,
Titeux
,
H.
,
Bouaziz
,
W.
, and
Gill
,
M.-P.
(
2020
). “
Pyannote.Audio: Neural building blocks for speaker diarization
,” in
Proceedings of ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
, May 4–8, Barcelona, Spain, pp.
7124
7128
.
8.
Chang
,
S.-Y.
, and
Morgan
,
N.
(
2014
). “
Robust CNN-based speech recognition with Gabor filter kernels
,” in
Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH 2016)
, September 8–12, San Francisco, CA.
9.
Cheuk
,
K. W.
,
Anderson
,
H.
,
Agres
,
K.
, and
Herremans
,
D.
(
2020
). “
nnaudio: An on-the-fly GPU audio to spectrogram conversion toolbox using 1D convolutional neural networks
,”
IEEE Access
8
,
161981
162003
.
10.
Chi
,
T.
,
Ru
,
P.
, and
Shamma
,
S. A.
(
2005
). “
Multiresolution spectrotemporal analysis of complex sounds
,”
J. Acoust. Soc. Am.
118
(
2
),
887
906
.
11.
Chung
,
J. S.
,
Nagrani
,
A.
, and
Zisserman
,
A.
(
2018
). “
Voxceleb2: Deep speaker recognition
,” in
Proceedings of the 19th Annual Conference of the International Speech Communication Association (INTERSPEECH 2018)
, September 2–6, Hyderabad, India, pp.
1086
1090
.
12.
Coria
,
J. M.
,
Bredin
,
H.
,
Ghannay
,
S.
, and
Rosset
,
S.
(
2020
). “
A comparison of metric learning loss functions for end-to-end speaker verification
,” in
Statistical Language and Speech Processing
, edited by
L.
Espinosa-Anke
,
C.
Martín-Vide
, and
I.
Spasić
(
Springer International Publishing
,
Cham, Switzerland
), pp.
137
148
.
13.
Cuturi
,
M.
(
2013
). “
Sinkhorn distances: Lightspeed computation of optimal transport
,” in
Proceedings of Advances in Neural Information Processing Systems 26 (NIPS 2013)
, December 5–10, Lake Tahoe, NV, pp.
2292
2300
.
14.
Depireux
,
D. A.
,
Simon
,
J. Z.
,
Klein
,
D. J.
, and
Shamma
,
S. A.
(
2001
). “
Spectro-temporal response field characterization with dynamic ripples in ferret primary auditory cortex
,”
J. Neurophysiol.
85
(
3
),
1220
1234
.
15.
Edraki
,
A.
,
Chan
,
W.-Y.
,
Jensen
,
J.
, and
Fogerty
,
D.
(
2019
). “
Improvement and assessment of spectro-temporal modulation analysis for speech intelligibility estimation
,” in
Proceedings of the 20th Annual Conference of the International Speech Communication Association (INTERSPEECH 2019)
, September 15–19, Graz, Austria, pp.
1378
1382
.
16.
Efron
,
B.
, and
Tibshirani
,
R. J.
(
1994
).
An Introduction to the Bootstrap
(
CRC
,
Boca Raton, FL
).
17.
Elhilali
,
M.
,
Chi
,
T.
, and
Shamma
,
S. A.
(
2003
). “
A spectro-temporal modulation index (STMI) for assessment of speech intelligibility
,”
Speech Commun.
41
(
2
),
331
348
.
18.
Elie
,
J. E.
, and
Theunissen
,
F. E.
(
2016
). “
The vocal repertoire of the domesticated zebra finch: A data-driven approach to decipher the information-bearing acoustic features of communication signals
,”
Anim. Cogn.
19
(
2
),
285
315
.
19.
Elliott
,
T. M.
, and
Theunissen
,
F. E.
(
2009
). “
The modulation transfer function for speech intelligibility
,”
PLoS Comput. Biol.
5
(
3
),
e1000302
.
20.
Espi
,
M.
,
Fujimoto
,
M.
,
Kinoshita
,
K.
, and
Nakatani
,
T.
(
2015
). “
Exploiting spectro-temporal locality in deep learning based acoustic event detection
,”
EURASIP J. Audio Speech Music Process.
2015
(
1
),
1
12
.
21.
Ezzat
,
T.
,
Bouvrie
,
J.
, and
Poggio
,
T.
(
2007
). “
Spectro-temporal analysis of speech using 2-D Gabor filters
,” in
Proceedings of the Eighth Annual Conference of the International Speech Communication Association (INTERSPEECH 2007)
, August 27–31, Antwerp, Belgium.
22.
Flamary
,
R.
, and
Courty
,
N.
(
2017
). “
POT: Python optimal transport
,” https://pythonot.github.io/ (Last viewed 7/7/2021).
23.
Flinker
,
A.
,
Doyle
,
W.
,
Mehta
,
A.
,
Devinsky
,
O.
, and
Poeppel
,
D.
(
2019
). “
Rapid task-related plasticity of spectrotemporal receptive fields in primary auditory cortex
,”
Nat. Hum. Behav.
3
(
4
),
393
405
.
24.
Francis
,
N. A.
,
Elgueda
,
D.
,
Englitz
,
B.
,
Fritz
,
J. B.
, and
Shamma
,
S. A.
(
2018
). “
Laminar profile of task-related plasticity in ferret primary auditory cortex
,”
Sci. Rep.
8
(
1
),
16375
.
25.
Fritz
,
J.
,
Shamma
,
S.
,
Elhilali
,
M.
, and
Klein
,
D.
(
2003
). “
Rapid task-related plasticity of spectrotemporal receptive fields in primary auditory cortex
,”
Nat. Neurosci.
6
(
11
),
1216
1223
.
26.
Gabor
,
D.
(
1946
). “
Theory of communication. part 1: The analysis of information
,”
J. Inst. Electr. Eng. Part III Radio Commun. Eng.
93
(
26
),
429
441
.
27.
Hullett
,
P. W.
,
Hamilton
,
L. S.
,
Mesgarani
,
N.
,
Schreiner
,
C. E.
, and
Chang
,
E. F.
(
2016
). “
Human superior temporal gyrus organization of spectrotemporal modulation tuning derived from speech stimuli
,”
J. Neurosci.
36
(
6
),
2014
2026
.
28.
Imperl
,
B.
,
Kačič
,
Z.
, and
Horvat
,
B.
(
1997
). “
A study of harmonic features for the speaker recognition
,”
Speech Commun.
22
(
4
),
385
402
.
29.
Jääskeläinen
,
I. P.
,
Ahveninen
,
J.
,
Belliveau
,
J. W.
,
Raij
,
T.
, and
Sams
,
M.
(
2007
). “
Short-term plasticity in auditory cognition
,”
Trends Neurosci.
30
(
12
),
653
661
.
30.
Kell
,
A. J.
, and
McDermott
,
J. H.
(
2019
). “
Deep neural network models of sensory systems: Windows onto the role of task constraints
,”
Curr. Opin. Neurobiol.
55
,
121
132
.
31.
Kell
,
A. J.
,
Yamins
,
D. L.
,
Shook
,
E. N.
,
Norman-Haignere
,
S. V.
, and
McDermott
,
J. H.
(
2018
). “
A task-optimized neural network replicates human auditory behavior, predicts brain responses, and reveals a cortical processing hierarchy
,”
Neuron
98
(
3
),
630
644
.
32.
Kingma
,
D. P.
, and
Ba
,
J.
(
2014
). “
Adam: A method for stochastic optimization
,” arXiv:1412.6980.
33.
Kong
,
Q.
,
Cao
,
Y.
,
Iqbal
,
T.
,
Wang
,
Y.
,
Wang
,
W.
, and
Plumbley
,
M. D.
(
2020
). “
PANNs: Large-scale pretrained audio neural networks for audio pattern recognition
,”
IEEE/ACM Trans. Audio Speech Lang. Process.
28
,
2880
2894
.
34.
Koumura
,
T.
,
Terashima
,
H.
, and
Furukawa
,
S.
(
2019
). “
Cascaded tuning to amplitude modulation for natural sound recognition
,”
J. Neurosci.
39
(
28
),
5517
5533
.
35.
Lei
,
H.
,
Meyer
,
B. T.
, and
Mirghafori
,
N.
(
2012
). “
Spectro-temporal Gabor features for speaker recognition
,” in
Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
, March 25–30, Kyoto, Japan, pp.
4241
4244
.
36.
Liu
,
L.
,
Jiang
,
H.
,
He
,
P.
,
Chen
,
W.
,
Liu
,
X.
,
Gao
,
J.
, and
Han
,
J.
(
2019
). “
On the variance of the adaptive learning rate and beyond
,” in
Proceedings of the International Conference on Learning Representations
, May 6–9, New Orleans, LA.
37.
Lostanlen
,
V.
(
2017
). “
Convolutional operators in the time-frequency domain
,” Ph.D. thesis,
Université Paris Sciences et Lettres
, Paris, France.
38.
Massoudi
,
R.
,
Van Wanrooij
,
M. M.
,
Versnel
,
H.
, and
Van Opstal
,
A. J.
(
2015
). “
Spectrotemporal response properties of core auditory cortex neurons in awake monkey
,”
PLoS One
10
(
2
),
e0116118
.
39.
McCowan
,
I.
,
Carletta
,
J.
,
Kraaij
,
W.
,
Ashby
,
S.
,
Bourban
,
S.
,
Flynn
,
M.
,
Guillemot
,
M.
,
Hain
,
T.
,
Kadlec
,
J.
,
Karaiskos
,
V.
,
Kronenthal
,
M.
,
Lathoud
,
G.
,
Lincoln
,
M.
,
Lisowska
,
A.
,
Post
,
W.
,
Reidsma
,
D.
, and
Wellner
,
P.
(
2005
). “
The AMI meeting corpus
,” in
Proceedings of Measuring Behavior 2005, 5th International Conference on Methods and Techniques in Behavioral Research
, August 30–September 2, Wageningen, Netherlands, pp.
137
140
.
40.
McCracken
,
K. G.
, and
Sheldon
,
F. H.
(
1997
). “
Avian vocalizations and phylogenetic signal
,”
Proc. Nat. Acad. Sci. U.S.A.
94
(
8
),
3833
3836
.
41.
McDermott
,
J. H.
(
2018
). “
Audition
,” in
Stevens' Handbook of Experimental Psychology and Cognitive Neuroscience
, Vol.
2
(
Wiley
,
New York
), pp.
1
57
.
42.
Mesgarani
,
N.
,
Slaney
,
M.
, and
Shamma
,
S. A.
(
2006
). “
Discrimination of speech from nonspeech based on multiscale spectro-temporal modulations
,”
IEEE Trans. Audio Speech Lang. Process.
14
(
3
),
920
930
.
43.
Meyer
,
A. F.
,
Williamson
,
R. S.
,
Linden
,
J. F.
, and
Sahani
,
M.
(
2017
). “
Models of neuronal stimulus-response functions: Elaboration, estimation, and evaluation
,”
Front. Syst. Neurosci.
10
,
109
.
44.
Młynarski
,
W.
, and
McDermott
,
J. H.
(
2018
). “
Learning midlevel auditory codes from natural sound statistics
,”
Neural Comput.
30
(
3
),
631
669
.
45.
Młynarski
,
W.
, and
McDermott
,
J. H.
(
2019
). “
Ecological origins of perceptual grouping principles in the auditory system
,”
Proc. Natl. Acad. Sci. U.S.A.
116
(
50
),
25355
25364
.
46.
Nagrani
,
A.
,
Chung
,
J. S.
, and
Zisserman
,
A.
(
2017
). “
Voxceleb: A large-scale speaker identification dataset
,” in
Proc. Interspeech
, pp.
2616
2620
.
47.
Ondel
,
L.
,
Li
,
R.
,
Sell
,
G.
, and
Hermansky
,
H.
(
2019
). “
Deriving spectro-temporal properties of hearing from speech data
,” in
Proceedings of ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
, May 12–17, Brighton, UK, pp.
411
415
.
48.
Peyré
,
G.
, and
Cuturi
,
M.
(
2019
). “
Computational optimal transport: With applications to data science
,”
Found. Trends Mach. Learn.
11
(
5
),
355
607
.
49.
Pillow
,
J.
, and
Sahani
,
M.
(
2019
). “
Editorial Overview: Machine Learning, Big Data, and Neuroscience
,”
Curr. Opin. Neurobiol.
55
,
iii
iv
.
50.
Poeppel
,
D.
(
2003
). “
The analysis of speech in different temporal integration windows: Cerebral lateralization as ‘asymmetric sampling in time
,’ ”
Speech Commun.
41
(
1
),
245
255
.
51.
Ravanelli
,
M.
, and
Bengio
,
Y.
(
2018
). “
Speaker recognition from raw waveform with sincnet
,” in
Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT)
, December 18–21, Athens, Greece, pp.
1021
1028
.
52.
Saddler
,
M. R.
,
Gonzalez
,
R.
, and
McDermott
,
J. H.
(
2020
). “
Deep neural network models reveal interplay of peripheral coding and stimulus statistics in pitch perception
,” bioRxiv2020.11.19.389999.
53.
Salamon
,
J.
, and
Bello
,
J. P.
(
2017
). “
Deep convolutional neural networks and data augmentation for environmental sound classification
,”
IEEE Signal Process. Lett.
24
(
3
),
279
283
.
54.
Salamon
,
J.
,
Jacoby
,
C.
, and
Bello
,
J. P.
(
2014
). “
A dataset and taxonomy for urban sound research
,” in
Proceedings of the 22nd ACM International Conference on Multimedia
, November 3–7, Orlando, FL, pp.
1041
1044
.
55.
Santoro
,
R.
,
Moerel
,
M.
,
De Martino
,
F.
,
Valente
,
G.
,
Ugurbil
,
K.
,
Yacoub
,
E.
, and
Formisano
,
E.
(
2017
). “
Reconstructing the spectrotemporal modulations of real-life sounds from fmri response patterns
,”
Proc. Natl. Acad. Sci. U.S.A.
114
(
18
),
4799
4804
.
56.
Schädler
,
M. R.
,
Meyer
,
B. T.
, and
Kollmeier
,
B.
(
2012
). “
Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition
,”
J. Acoust. Soc. Am.
131
(
5
),
4134
4151
.
57.
Schönwiesner
,
M.
, and
Zatorre
,
R.
(
2009
). “
Spectro-temporal modulation transfer function of single voxels in the human auditory cortex measured with high-resolution fMRI
,”
Proc. Natl. Acad. Sci. U.S.A.
106
(
34
),
14611
14616
.
58.
Shamma
,
S. A.
(
1996
). “
Auditory cortical representation of complex acoustic spectra as inferred from the ripple analysis method
,”
Network Comput. Neural Syst.
7
(
3
),
439
476
.
59.
Singh
,
N. C.
, and
Theunissen
,
F. E.
(
2003
). “
Modulation spectra of natural sounds and ethological theories of auditory processing
,”
J. Acoust. Soc. Am.
114
(
6
),
3394
3411
.
60.
Snyder
,
D.
,
Chen
,
G.
, and
Povey
,
D.
(
2015
). “
Musan: A music, speech, and noise corpus
,” arXiv:1510.08484.
61.
Snyder
,
D.
,
Garcia-Romero
,
D.
,
Sell
,
G.
,
Povey
,
D.
, and
Khudanpur
,
S.
(
2018
). “
X-vectors: Robust DNN embeddings for speaker recognition
,” in
2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
, April 15–20, Calgary, Canada, pp.
5329
5333
.
62.
Stevens
,
S. S.
,
Volkmann
,
J.
, and
Newman
,
E. B.
(
1937
). “
A scale for the measurement of the psychological magnitude pitch
,”
J. Acoust. Soc. Am.
8
(
3
),
185
190
.
63.
Tanno
,
R.
,
Arulkumaran
,
K.
,
Alexander
,
D.
,
Criminisi
,
A.
, and
Nori
,
A.
(
2019
). “
Adaptive neural trees
,” in
Proceedings of the 36th International Conference on Machine Learning
, June 9–15, Long Beach, CA, pp.
6166
6175
.
64.
Theunissen
,
F.
,
Sen
,
K.
, and
Doupe
,
A.
(
2000
). “
Spectral-temporal receptive fields of nonlinear auditory neurons obtained using natural sounds
,”
J. Neurosci.
20
(
6
),
2315
2331
.
65.
Thoret
,
E.
,
Andrillon
,
T.
,
Léger
,
D.
, and
Pressnitzer
,
D.
(
2020
). “
Probing machine-learning classifiers using noise, bubbles, and reverse correlation
,” bioRxiv2020.06.22.165688.
66.
Ulyanov
,
D.
,
Vedaldi
,
A.
, and
Lempitsky
,
V.
(
2016
). “
Instance normalization: The missing ingredient for fast stylization
,” arXiv:1607.08022.
67.
Vuong
,
T.
,
Xia
,
Y.
, and
Stern
,
R. M.
(
2020
). “
Learnable spectro-temporal receptive fields for robust voice type discrimination
,” in
Proceedings of the 21st Annual Conference of the International Speech Communication Association (INTERSPEECH 2020)
, October 25–29, Shanghai, China, pp.
1957
1961
.
68.
Williamson
,
R. S.
,
Ahrens
,
M. B.
,
Linden
,
J. F.
, and
Sahani
,
M.
(
2016
). “
Input-specific gain modulation by local sensory context shapes cortical and thalamic responses to complex sounds
,”
Neuron
91
(
2
),
467
481
.
69.
Woolley
,
S. M.
,
Fremouw
,
T. E.
,
Hsu
,
A.
, and
Theunissen
,
F. E.
(
2005
). “
Tuning for spectro-temporal modulations as a mechanism for auditory discrimination of natural sounds
,”
Nat. Neurosci.
8
(
10
),
1371
1379
.
70.
Yarkoni
,
T.
, and
Westfall
,
J.
(
2017
). “
Choosing prediction over explanation in psychology: Lessons from machine learning
,”
Perspect. Psychol. Sci.
12
(
6
),
1100
1122
.
71.
Zeghidour
,
N.
,
Usunier
,
N.
,
Synnaeve
,
G.
,
Collobert
,
R.
, and
Dupoux
,
E.
(
2018
). “
End-to-end speech recognition from the raw waveform
,” in
Proceedings of the 19th Annual Conference of the International Speech Communication Association (INTERSPEECH 2018)
, September 2–6, Hyderabad, India, pp.
781
785
.
72.
Zhang
,
M.
,
Lucas
,
J.
,
Ba
,
J.
, and
Hinton
,
G. E.
(
2019
). “
Lookahead optimizer: k steps forward, 1 step back
,” in
Proceedings of Advances in Neural Information Processing Systems 32 (NeurIPS 2019)
, December 8–14, Vancouver, Canada, Vol.
32
, pp.
9597
9608
.
You do not currently have access to this content.