Machine listening systems for environmental acoustic monitoring face a shortage of expert annotations to be used as training data. To circumvent this issue, the emerging paradigm of self-supervised learning proposes to pre-train audio classifiers on a task whose ground truth is trivially available. Alternatively, training set synthesis consists in annotating a small corpus of acoustic events of interest, which are then automatically mixed at random to form a larger corpus of polyphonic scenes. Prior studies have considered these two paradigms in isolation but rarely ever in conjunction. Furthermore, the impact of data curation in training set synthesis remains unclear. To fill this gap in research, this article proposes a two-stage approach. In the self-supervised stage, we formulate a pretext task (Audio2Vec skip-gram inpainting) on unlabeled spectrograms from an acoustic sensor network. Then, in the supervised stage, we formulate a downstream task of multilabel urban sound classification on synthetic scenes. We find that training set synthesis benefits overall performance more than self-supervised learning. Interestingly, the geographical origin of the acoustic events in training set synthesis appears to have a decisive impact.
Skip Nav Destination
Article navigation
June 2021
June 16 2021
Polyphonic training set synthesis improves self-supervised urban sound classificationa)
Special Collection:
Machine Learning in Acoustics
Félix Gontier;
Félix Gontier
1
CNRS, LS2N
, F-44322 Nantes, France
Search for other works by this author on:
Vincent Lostanlen;
Vincent Lostanlen
1
CNRS, LS2N
, F-44322 Nantes, France
Search for other works by this author on:
Mathieu Lagrange
;
Nicolas Fortin;
Nicolas Fortin
2
Unité Mixte de Recherche en Acoustique Environnementale, Université Gustave Eiffel, Centre d’Etudes et d’Expertise sur les Risques, l’Environnement, la Mobilité et l’Aménagement
, F-44344 Bouguenais, France
Search for other works by this author on:
Catherine Lavandier;
Catherine Lavandier
3
CY Cergy Paris Université École Nationale Supérieure de l'électronique et de ses Applications (ENSEA), CNRS, ETIS
, F-95000 Cergy, France
Search for other works by this author on:
Jean-François Petiot
Jean-François Petiot
4
École Centrale de Nantes
, LS2N, F-44322 Nantes, France
Search for other works by this author on:
b)
Electronic mail: [email protected], ORCID: 0000-0002-1253-4427.
a)
This paper is part of a special issue on Machine Learning in Acoustics.
J. Acoust. Soc. Am. 149, 4309–4326 (2021)
Article history
Received:
February 02 2021
Accepted:
May 25 2021
Citation
Félix Gontier, Vincent Lostanlen, Mathieu Lagrange, Nicolas Fortin, Catherine Lavandier, Jean-François Petiot; Polyphonic training set synthesis improves self-supervised urban sound classification. J. Acoust. Soc. Am. 1 June 2021; 149 (6): 4309–4326. https://doi.org/10.1121/10.0005277
Download citation file:
Pay-Per-View Access
$40.00
Sign In
You could not be signed in. Please check your credentials and make sure you have an active account and try again.
Citing articles via
All we know about anechoic chambers
Michael Vorländer
Day-to-day loudness assessments of indoor soundscapes: Exploring the impact of loudness indicators, person, and situation
Siegbert Versümer, Jochen Steffens, et al.
A survey of sound source localization with deep learning methods
Pierre-Amaury Grumiaux, Srđan Kitić, et al.
Related Content
AI-based soundscape analysis: Jointly identifying sound sources and predicting annoyance
J. Acoust. Soc. Am. (November 2023)
Multidimensional analyses of the noise impacts of COVID-19 lockdown
J. Acoust. Soc. Am. (February 2022)
Gunshot detection from audio excerpts of urban sounds using transfer learning
Proc. Mtgs. Acoust. (September 2023)