Machine listening systems for environmental acoustic monitoring face a shortage of expert annotations to be used as training data. To circumvent this issue, the emerging paradigm of self-supervised learning proposes to pre-train audio classifiers on a task whose ground truth is trivially available. Alternatively, training set synthesis consists in annotating a small corpus of acoustic events of interest, which are then automatically mixed at random to form a larger corpus of polyphonic scenes. Prior studies have considered these two paradigms in isolation but rarely ever in conjunction. Furthermore, the impact of data curation in training set synthesis remains unclear. To fill this gap in research, this article proposes a two-stage approach. In the self-supervised stage, we formulate a pretext task (Audio2Vec skip-gram inpainting) on unlabeled spectrograms from an acoustic sensor network. Then, in the supervised stage, we formulate a downstream task of multilabel urban sound classification on synthetic scenes. We find that training set synthesis benefits overall performance more than self-supervised learning. Interestingly, the geographical origin of the acoustic events in training set synthesis appears to have a decisive impact.

1.
Abeßer
,
J.
,
Gotze
,
M.
,
Kuhnlenz
,
S.
,
Grafe
,
R.
,
Kuhn
,
C.
,
ClauB
,
T.
, and
Lukashevich
,
H.
(
2018
). “
A distributed sensor network for monitoring noise level and noise sources in urban environments
,” in
Proceedings of the IEEE International Conference on Future Internet of Things and Cloud (FiCloud)
, August 6–8, Barcelona, Spain, pp.
318
324
.
2.
Andén
,
J.
,
Lostanlen
,
V.
, and
Mallat
,
S.
(
2019
). “
Joint time–frequency scattering
,”
IEEE Trans. Signal Process.
67
(
14
),
3704
3718
.
3.
Antoni
,
J.
(
2010
). “
Orthogonal-like fractional-octave-band filters
,”
J. Acoust. Soc. Am.
127
,
884
895
.
4.
Ardouin
,
J.
,
Charpentier
,
L.
,
Lagrange
,
M.
,
Gontier
,
F.
,
Fortin
,
N.
,
Ecotière
,
D.
,
Picaut
,
J.
, and
Mietlicky
,
C.
(
2018
). “
An innovative low-cost sensor for urban sound monitoring
,” in
INTER-NOISE and NOISE-CON Congress and Conference Proceedings
, August 26–29, Chicago, IL, Vol.
258
, pp.
2226
2237
.
5.
Aumond
,
P.
,
Can
,
A.
,
De Coensel
,
B.
,
Botteldooren
,
D.
,
Ribeiro
,
C.
, and
Lavandier
,
C.
(
2017
). “
Modeling soundscape pleasantness using perceptual assessments and acoustic measurements along paths in urban context
,”
Acta Acust. united Acust.
103
(
3
),
430
443
.
6.
Basner
,
M.
,
Babisch
,
W.
,
Davis
,
A.
,
Brink
,
M.
,
Clark
,
C.
,
Janssen
,
S.
, and
Stansfeld
,
S.
(
2014
). “
Auditory and non-auditory effects of noise on health
,”
Lancet
383
(
9925
),
1325
1332
.
7.
Beery
,
S.
,
Van Horn
,
G.
, and
Perona
,
P.
(
2018
). “
Recognition in terra incognita
,” in
Proceedings of the European Conference on Computer Vision (ECCV)
, September 8–20, Munich, Germany, pp.
456
473
.
8.
Bello
,
J. P.
,
Silva
,
C.
,
Nov
,
O.
,
DuBois
,
R. L.
,
Arora
,
A.
,
Salamon
,
J.
,
Mydlarz
,
C.
, and
Doraiswamy
,
H.
(
2019
). “
SONYC: A system for monitoring, analyzing, and mitigating urban noise pollution
,”
Commun. ACM
62
(
2
),
68
77
.
9.
Bellucci
,
P.
,
Peruzzi
,
L.
, and
Zambon
,
G.
(
2017
). “
LIFE DYNAMAP project: The case study of Rome
,”
Appl. Acoust.
117
,
193
206
.
10.
Berglund
,
B.
, and
Nilsson
,
M. E.
(
2006
). “
On a tool for measuring soundscape quality in urban residential areas
,”
Acta Acust. united Acust.
92
(
6
),
938
944
.
11.
Botteldooren
,
D.
,
Dekoninck
,
L.
,
Meeussen
,
C.
, and
Van Renterghem
,
T.
(
2018
). “
Early stage sound planning in urban re-development: The Antwerp case study
,” in
Proceedings of the International Congress and Exposition on Noise Control Engineering (Inter-Noise)
, August 26–29, Chicago, IL.
12.
Bristow
,
A.
, and
Thanos
,
S.
(
2015
). “
What do hedonic studies of the costs of road traffic noise nuisance tell us?
,”
J. Acoust. Soc. Am.
138
(
3
),
1750
1750
.
13.
Brocolini
,
L.
,
Lavandier
,
C.
,
Quoy
,
M.
, and
Ribeiro
,
C.
(
2013
). “
Measurements of acoustic environments for urban soundscapes: Choice of homogeneous periods, optimization of durations, and selection of indicators
,”
J. Acoust. Soc. Am.
134
(
1
),
813
821
.
14.
Bronzaft
,
A. L.
(
2002
). “
Noise pollution: A hazard to physical and mental well-being
,” in
Handbook of Environmental Psychology
(
Wiley
,
New York
), Chap. 32, pp.
499
510
.
15.
Brown
,
A.
,
Kang
,
J.
, and
Gjestland
,
T.
(
2011
). “
Towards standardization in soundscape preference assessment
,”
Appl. Acoust.
72
(
6
),
387
392
.
16.
Cartwright
,
M.
,
Cramer
,
J.
,
Salamon
,
J.
, and
Bello
,
J. P.
(
2019a
). “
TriCycle: Audio representation learning from sensor network data using self-supervision
,” in
Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
, October 20–23, New Paltz, NY, pp.
278
282
.
17.
Cartwright
,
M.
,
Dove
,
G.
,
Méndez Méndez
,
A. E.
,
Bello
,
J. P.
, and
Nov
,
O.
(
2019b
). “
Crowdsourcing multi-label audio annotation tasks with citizen scientists
,” in
Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems
, May 4–9, Glasgow, Scotland, pp.
1
11
.
18.
Cartwright
,
M.
,
Mendez
,
A. E. M.
,
Cramer
,
J.
,
Lostanlen
,
V.
,
Dove
,
G.
,
Wu
,
H.-H.
,
Salamon
,
J.
,
Nov
,
O.
, and
Bello
,
J.
(
2019c
). “
SONYC Urban Sound Tagging (SONYC-UST): A multilabel dataset from an urban acoustic sensor network
,” in
Proceedings of the International Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE)
, October 25–26, New York.
19.
Cartwright
,
M.
,
Seals
,
A.
,
Salamon
,
J.
,
Williams
,
A.
,
Mikloska
,
S.
,
MacConnell
,
D.
,
Law
,
E.
,
Bello
,
J. P.
, and
Nov
,
O.
(
2017
). “
Seeing sound: Investigating the effects of visualizations and complexity on crowdsourced audio annotations
,”
Proc. ACM Hum. Comput. Interact.
1
(
CSCW
),
1
21
.
20.
CENSE
(
2019
). “
Caractérisation des environnements sonores urbains
,” https://cense.ifsttar.fr/ (Last viewed 06/08/2021).
21.
Cerutti
,
G.
,
Prasad
,
R.
,
Brutti
,
A.
, and
Farella
,
E.
(
2020
). “
Compact recurrent neural networks for acoustic event detection on low-energy low-complexity platforms
,”
IEEE J. Sel. Top. Signal Process.
14
,
654
.
22.
Cho
,
K.
,
Van Merriënboer
,
B.
,
Gulcehre
,
C.
,
Bahdanau
,
D.
,
Bougares
,
F.
,
Schwenk
,
H.
, and
Bengio
,
Y.
(
2014
). “
Learning phrase representations using rnn encoder-decoder for statistical machine translation
,”
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing
, October 25–29, Doha, Qatar, pp.
1724
1734
.
23.
Chung
,
Y.-A.
, and
Glass
,
J.
(
2017
). “
Learning word embeddings from speech
,” in
Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017)
, December 4–9, Long Beach, CA.
24.
Cohen-Hadria
,
A.
,
Cartwright
,
M.
,
McFee
,
B.
, and
Bello
,
J. P.
(
2019
). “
Voice anonymization in urban sound recordings
,” in
Proceedings of the IEEE International Workshop on Machine Learning for Signal Processing (MLSP)
, October 13–16, Pittsburgh, PA, pp.
1
6
.
25.
Cramer
,
J.
,
Lostanlen
,
V.
,
Farnsworth
,
A.
,
Salamon
,
J.
, and
Bello
,
J. P.
(
2020
). “
Chirping up the right tree: Incorporating biological taxonomies into deep bioacoustic classifiers
,” in
Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP)
,
IEEE
, pp.
901
905
.
26.
Cramer
,
J.
,
Wu
,
H.-H.
,
Salamon
,
J.
, and
Bello
,
J. P.
(
2019
). “
Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings
,” in
Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP)
, May 4–8, Barcelona, Spain, pp.
3852
3856
.
27.
Das
,
A.
,
Borisov
,
N.
, and
Caesar
,
M.
(
2014
). “
Do you hear what I hear? Fingerprinting smart devices through embedded acoustic components
,” in
Proceedings of the SIGSAC Conference on Computer and Communications Security (CCS
), November 3–7, Scottsdale, AZ, pp.
441
452
.
28.
Esselink
,
B.
(
2000
).
A Practical Guide to Localization
(
John Benjamins Publishing
,
Amsterdam
).
29.
Fonseca
,
E.
,
Favory
,
X.
,
Pons
,
J.
,
Font
,
F.
, and
Serra
,
X.
(
2021
). “
FSD50k: An open dataset of human-labeled sound events
,” (published online 2020); arXiv:2010.00475, https://10.5281/zenodo.4060432.
30.
Fonseca
,
E.
,
Plakal
,
M.
,
Ellis
,
D. P.
,
Font
,
F.
,
Favory
,
X.
, and
Serra
,
X.
(
2019
). “
Learning sound event classifiers from web audio with noisy labels
,” in
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
, May 12–17, Brighton, UK, pp.
21
25
.
31.
Font
,
F.
,
Roma
,
G.
, and
Serra
,
X.
(
2013
). “
Freesound technical demo
,” in
Proceedings of the ACM International Conference on Multimedia
, September 23, New York, pp.
411
412
.
32.
Gaidon
,
A.
,
Lopez
,
A.
, and
Perronnin
,
F.
(
2018
). “
The reasonable effectiveness of synthetic visual data
,”
Int. J. Comput. Vision
126
(
9
),
899
901
.
33.
Gemmeke
,
J. F.
,
Ellis
,
D. P.
,
Freedman
,
D.
,
Jansen
,
A.
,
Lawrence
,
W.
,
Moore
,
R. C.
,
Plakal
,
M.
, and
Ritter
,
M.
(
2017
). “
Audio Set: An ontology and human-labeled dataset for audio events
,” in
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
, March 5–9, New Orleans, LA, pp.
776
780
.
34.
Giryes
,
R.
,
Sapiro
,
G.
, and
Bronstein
,
A. M.
(
2016
). “
Deep neural networks with random Gaussian weights: A universal classification strategy?
,”
IEEE Trans. Signal Process.
64
(
13
),
3444
3457
.
35.
Gloaguen
,
J.-R.
,
Can
,
A.
,
Lagrange
,
M.
, and
Petiot
,
J.-F.
(
2019
). “
Road traffic sound level estimation from realistic urban sound mixtures by non-negative matrix factorization
,”
Appl. Acoust.
143
,
229
238
.
36.
Gontier
,
F.
,
Lagrange
,
M.
,
Aumond
,
P.
,
Can
,
A.
, and
Lavandier
,
C.
(
2017
). “
An efficient audio coding scheme for quantitative and qualitative large scale acoustic monitoring using the sensor grid approach
,”
Sensors
17
(
12
),
2758
.
37.
Gontier
,
F.
,
Lavandier
,
C.
,
Aumond
,
P.
,
Lagrange
,
M.
, and
Petiot
,
J.-F.
(
2019
). “
Estimation of the perceived time of presence of sources in urban acoustic environments using deep learning techniques
,”
Acta Acust. united Acust.
105
(
6
),
1053
1066
.
38.
Hammer
,
M. S.
,
Swinburn
,
T. K.
, and
Neitzel
,
R. L.
(
2014
). “
Environmental noise pollution in the United States: Developing an effective public health response
,”
Environ. Health Perspect.
122
(
2
),
115
119
.
39.
Howard
,
A. G.
,
Zhu
,
M.
,
Chen
,
B.
,
Kalenichenko
,
D.
,
Wang
,
W.
,
Weyand
,
T.
,
Andreetto
,
M.
, and
Adam
,
H.
(
2017
). “
Mobilenets: Efficient convolutional neural networks for mobile vision applications
,” arXiv:1704.04861.
40.
Ioffe
,
S.
, and
Szegedy
,
C.
(
2015
). “
Batch normalization: Accelerating deep network training by reducing internal covariate shift
,” in
Proceedings of the 32nd International Conference on Machine Learning
, July 6–11, Lille, France, Vol.
37
, pp.
448
456
.
41.
Kingma
,
D. P.
, and
Ba
,
J.
(
2014
). “
Adam: A method for stochastic optimization
,” arXiv:1412.6980.
42.
Kolesnikov
,
A.
,
Zhai
,
X.
, and
Beyer
,
L.
(
2019
). “
Revisiting self-supervised visual representation learning
,” in
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
, June 15–20, Long Beach, CA, pp.
1920
1929
.
43.
Lafay
,
G.
,
Lagrange
,
M.
,
Rossignol
,
M.
,
Benetos
,
E.
, and
Roebel
,
A.
(
2016
). “
A morphological model for simulating acoustic scenes and its application to sound event detection
,”
IEEE/ACM Trans. Audio Speech Language Process.
24
(
10
),
1854
1864
.
44.
Lagrange
,
M.
(
2018
). “
simScene
,” https://bitbucket.org/mlagrange/simscene (Last viewed 06/08/2021).
45.
Lagrange
,
M.
(
2021
). “
gontier2021training
,” https://github.com/mathieulagrange/gontier2021training (Last viewed 06/08/2021).
46.
Lagrange
,
M.
,
Lafay
,
G.
,
Défréville
,
B.
, and
Aucouturier
,
J.-J.
(
2015
). “
The bag-of-frames approach: A not-so-sufficient model for urban soundscapes
,”
J. Acoust. Soc. Am.
138
(
5
),
EL487
EL492
.
47.
Lee
,
K.
, and
Nam
,
J.
(
2019
). “
Learning a joint embedding space of monophonic and mixed music signals for singing voice
,” in
Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference
, November 4–8, Delft, Netherlands.
48.
Lostanlen
,
V.
,
Salamon
,
J.
,
Farnsworth
,
A.
,
Kelling
,
S.
, and
Bello
,
J. P.
(
2018
). “
Birdvox-full-night: A dataset and benchmark for avian flight call detection
,” in
Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
, April 15–20, Calgary, Canada, pp.
266
270
.
49.
Lostanlen
,
V.
,
Salamon
,
J.
,
Farnsworth
,
A.
,
Kelling
,
S.
, and
Bello
,
J. P.
(
2019
). “
Robust sound event detection in bioacoustic sensor networks
,”
PLoS One
14
,
e0214168
.
50.
McFee
,
B.
,
Salamon
,
J.
, and
Bello
,
P.
(
2018
). “
Adaptive pooling operators for weakly labeled sound event detection
,”
IEEE/ACM Trans. Audio Speech Language Process.
26
(
11
),
2180
2193
.
51.
Méndez Méndez
,
A. E.
,
Cartwright
,
M.
, and
Bello
,
J. P.
(
2019
). “
Machine-crowd-expert model for increasing user engagement and annotation quality
,” in
Extended Abstracts of the CHI Conference on Human Factors in Computing Systems
, May 4–9, Glasgow, Scotland, pp.
1
6
.
52.
Mendoza
,
E.
,
Lostanlen
,
V.
,
Salamon
,
J.
,
Farnsworth
,
A.
,
Kelling
,
S.
, and
Bello
,
J. P.
(
2019
). “BirdVox-scaper-10k: A synthetic dataset for multilabel species classification of flight calls from 10-second audio recordings (version 1.0) [data set],”
Zenodo
, (Last viewed 06/08/2021).
53.
Mesaros
,
A.
,
Heittola
,
T.
,
Benetos
,
E.
,
Foster
,
P.
,
Lagrange
,
M.
,
Virtanen
,
T.
, and
Plumbley
,
M. D.
(
2018
). “
Detection and classification of acoustic scenes and events: Outcome of the DCASE 2016 challenge
,”
IEEE/ACM Trans. Audio Speech Language Process.
26
(
2
),
379
393
.
54.
Mikolov
,
T.
,
Chen
,
K.
,
Corrado
,
G.
, and
Dean
,
J.
(
2013
). “
Efficient estimation of word representations in vector space
,” in
Proceedings of the International Conference on Learning Representations (ICLR)
, May 2–4, Scottsdale, AZ.
55.
Mydlarz
,
C.
,
Shamoon
,
C.
, and
Bello
,
J. P.
(
2017
). “
Noise monitoring and enforcement in New York City using a remote acoustic sensor network
,” in
Proceedings of INTER-NOISE and NOISE-CON Congress
, August 27–30, Hong Kong, Vol.
255
, pp.
5509
5520
.
56.
Mydlarz
,
C.
,
Sharma
,
M.
,
Lockerman
,
Y.
,
Steers
,
B.
,
Silva
,
C.
, and
Bello
,
J. P.
(
2019
). “
The life of a New York City noise sensor network
,”
Sensor
19
(
6
),
1415
.
57.
New York City Department of Health and Mental Hygiene
(
2014
). “
Ambient noise disruption in New York City
,” Epi Data Brief 45 (
New York City Department of Health and Mental Hygiene
,
New York
).
58.
Panayotov
,
V.
,
Chen
,
G.
,
Povey
,
D.
, and
Khudanpur
,
S.
(
2015
). “
Librispeech: An ASR corpus based on public-domain audio books
,” in
Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
, April 19–24, South Brisbane, Australia, pp.
5206
5210
.
59.
Park
,
T. H.
,
Turner
,
J.
,
Musick
,
M.
,
Lee
,
J. H.
,
Jacoby
,
C.
,
Mydlarz
,
C.
, and
Salamon
,
J.
(
2014
). “
Sensing urban soundscapes
,” in
Proceedings of the EDBT/ICDT Workshop
, March 28, 2014, Athens, Greece, pp.
375
382
.
60.
Pascual
,
S.
,
Ravanelli
,
M.
,
Serrà
,
J.
,
Bonafonte
,
A.
, and
Bengio
,
Y.
(
2019
). “
Learning problem-agnostic speech representations from multiple self-supervised tasks
,” in
Proceedings of the International Speech Communication Association Conference (INTERSPEECH)
, September 15–19, Graz, Austria, pp.
161
165
.
61.
Pathak
,
D.
,
Krahenbuhl
,
P.
,
Donahue
,
J.
,
Darrell
,
T.
, and
Efros
,
A. A.
(
2016
). “
Context encoders: Feature learning by inpainting
,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, June 27–30, Las Vegas, NV, pp.
2536
2544
.
62.
Picaut
,
J.
,
Can
,
A.
,
Fortin
,
N.
,
Ardouin
,
J.
, and
Lagrange
,
M.
(
2020
). “
Low-cost sensors for urban noise monitoring networks—A literature review
,”
Sensor
20
(
8
),
2256
.
63.
Piczak
,
K. J.
(
2015
). “
Environmental sound classification with convolutional neural networks
,” in
Proceedings of the IEEE International Workshop on Machine Learning for Signal Processing (MLSP)
, October 19–20, Dalian, China, pp.
1
6
.
64.
Pijanowski
,
B. C.
,
Farina
,
A.
,
Gage
,
S. H.
,
Dumyahn
,
S. L.
, and
Krause
,
B. L.
(
2011
). “
What is soundscape ecology? An introduction and overview of an emerging new science
,”
Landscape Ecol.
26
(
9
),
1213
1232
.
65.
Poikselkä
,
M.
,
Holma
,
H.
,
Hongisto
,
J.
,
Kallio
,
J.
, and
Toskala
,
A.
(
2012
).
Voice over LTE: VoLTE
(
Wiley
,
New York
).
66.
Ricciardi
,
P.
,
Delaitre
,
P.
,
Lavandier
,
C.
,
Torchia
,
F.
, and
Aumond
,
P.
(
2015
). “
Sound quality indicators for urban places in Paris cross-validated by Milan data
,”
J. Acoust Soc. Am.
138
(
4
),
2337
2348
.
67.
Romanou
,
A.
(
2018
). “
The necessity of the implementation of privacy by design in sectors where data protection concerns arise
,”
Comput. Law Security Rev.
34
(
1
),
99
110
.
68.
Salamon
,
J.
, and
Bello
,
J. P.
(
2017
). “
Deep convolutional neural networks and data augmentation for environmental sound classification
,”
IEEE Signal Process. Lett.
24
(
3
),
279
283
.
69.
Salamon
,
J.
,
Jacoby
,
C.
, and
Bello
,
J. P.
(
2014
). “
A dataset and taxonomy for urban sound research
,” in
Proceedings of the ACM International Conference on Multimedia
, November 3–7, New York, pp.
1041
1044
.
70.
Salamon
,
J.
,
MacConnell
,
D.
,
Cartwright
,
M.
,
Li
,
P.
, and
Bello
,
J. P.
(
2017a
). “
Scaper: A library for soundscape synthesis and augmentation
,” in
Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
, October 15–18, New Paltz, NY, pp.
344
348
.
71.
Salamon
,
J.
,
MacConnell
,
D.
,
Cartwright
,
M.
,
Li
,
P.
, and
Bello
,
J. P.
(
2017b
). “URBAN-SED (version 2.0.0) [data set],”
Zenodo
, (Last viewed 06/08/2021).
72.
Sheng
,
Z.
,
Pfersich
,
S.
,
Eldridge
,
A.
,
Zhou
,
J.
,
Tian
,
D.
, and
Leung
,
V. C.
(
2019
). “
Wireless acoustic sensor networks and edge computing for rapid acoustic monitoring
,”
IEEE/CAA J. Automatica Sin.
6
(
1
),
64
74
.
73.
Stowell
,
D.
,
Giannoulis
,
D.
,
Benetos
,
E.
,
Lagrange
,
M.
, and
Plumbley
,
M. D.
(
2015
). “
Detection and classification of acoustic scenes and events
,”
IEEE Trans. Multimedia
17
(
10
),
1733
.
74.
Su
,
J.
,
Jin
,
Z.
, and
Finkelstein
,
A.
(
2020
). “
Acoustic matching by embedding impulse responses
,” in
Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
, May 4–8, Barcelona, Spain, pp.
426
430
.
75.
Tagliasacchi
,
M.
,
Gfeller
,
B.
,
de Chaumont Quitry
,
F.
, and
Roblek
,
D.
(
2020
). “
Pre-training audio representations with self-supervision
,”
IEEE Signal Process. Lett.
27
,
600
604
.
76.
Tung
,
H.-Y. F.
,
Tung
,
H.-W.
,
Yumer
,
E.
, and
Fragkiadaki
,
K.
(
2017
). “
Self-supervised learning of motion capture
,” in
Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017)
, December 4–9, Long Beach, CA.
77.
Turchet
,
L.
,
Fazekas
,
G.
,
Lagrange
,
M.
,
Ghadikolaei
,
H. S.
, and
Fischione
,
C.
(
2020
). “
The Internet of Audio Things: State-of-the-art, vision, and challenges
,”
IEEE Internet Things J.
7
,
10233
.
78.
Turpault
,
N.
, and
Serizel
,
R.
(
2020
). “
Desed_synthetic (version v2.2)
,” Zenodo, (Last viewed 06/08/2021).
79.
United Nations
(
2018
). “
World Urbanization Prospects: The 2018 Revision, Methodology
,” Working Paper ESA/P/WP.252, Department of Economic and Social Affairs, Population Division (United Nations, New York).
80.
Vidaña-Vila
,
E.
,
Navarro
,
J.
,
Borda-Fortuny
,
C.
,
Stowell
,
D.
, and
Alsina-Pagès
,
R. M.
(
2020
). “
Low-cost distributed acoustic sensor network for real-time urban sound monitoring
,”
Electron
9
(
12
),
2119
.
81.
Virtanen
,
T.
,
Plumbley
,
M. D.
, and
Ellis
,
D.
(
2018
).
Computational Analysis of Sound Scenes and Events
(
Springer
,
New York
).
82.
Zhao
,
H.
,
Gan
,
C.
,
Rouditchenko
,
A.
,
Vondrick
,
C.
,
McDermott
,
J.
, and
Torralba
,
A.
(
2018
). “
The sound of pixels
,” in
Proceedings of the European Conference on Computer Vision (ECCV
), September 8–14, Munich, Germany, pp.
570
586
.
83.
Zhu
,
B.
,
Xu
,
K.
,
Kong
,
Q.
,
Wang
,
H.
, and
Peng
,
Y.
(
2020
). “
Audio tagging by cross filtering noisy labels
,”
IEEE/ACM Trans. Audio Speech Language Process.
28
,
2073
2083
.
You do not currently have access to this content.