In acoustic scene classification (ASC), acoustic features play a crucial role in the extraction of scene information, which can be stored over different time scales. Moreover, the limited size of the dataset may lead to a biased model with a poor performance for recordings from unseen cities and confusing scene classes. This paper proposes a long-term wavelet feature that captures discriminative long-term scene information. The extracted scalogram requires a lower storage capacity and can be classified faster and more accurately compared with classic Mel filter bank coefficients (FBank). Furthermore, a data augmentation scheme is adopted to improve the generalization of the ASC systems, which extends the database iteratively with auxiliary classifier generative adversarial neural networks (ACGANs) and a deep learning-based sample filter. Experiments were conducted on datasets from the Detection and Classification of Acoustic Scenes and Events (DCASE) challenges. The DCASE17 and DCASE19 datasets marked a performance boost of the proposed techniques compared with the FBank classifier. Moreover, the ACGAN-based data augmentation scheme achieved an absolute accuracy improvement of 6.10% on recordings from unseen cities, far exceeding classic augmentation methods.

1.
D.
Barchiesi
,
D.
Giannoulis
,
D.
Stowell
, and
M. D.
Plumbley
, “
Acoustic scene classification: Classifying environments from the sounds they produce
,”
IEEE Signal Process. Mag.
32
(
3
),
16
34
(
2015
).
2.
A.
Mesaros
,
T.
Heittola
, and
T.
Virtanen
, “
TUT database for acoustic scene classification and sound event detection
,” in
Proceedings of the 24th European Signal Processing Conference, EUSIPCO 2016
, Budapest, Hungary (August 29–September 2,
2016
), pp.
1128
1132
.
3.
A.
Mesaros
,
T.
Heittola
,
A.
Diment
,
B.
Elizalde
,
A.
Shah
,
E.
Vincent
,
B.
Raj
, and
T.
Virtanen
, “
DCASE2017 challenge setup: Tasks, datasets and baseline system
,” in
Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)
, Munich, Germany (November 16–17,
2017
), pp.
85
92
.
4.
M.
Bugalho
,
J.
Portelo
,
I.
Trancoso
,
T.
Pellegrini
, and
A.
Abad
, “
Detecting audio events for semantic video search
,” in
Proceedings of INTERSPEECH 2009, 10th Annual Conference of the International Speech Communication Association
, Brighton, UK (September 6–10,
2009
), pp.
1151
1154
.
5.
A. J.
Eronen
,
V. T.
Peltonen
,
J. T.
Tuomi
,
A.
Klapuri
,
S.
Fagerlund
,
T.
Sorsa
,
G.
Lorho
, and
J.
Huopaniemi
, “
Audio-based context recognition
,”
IEEE Trans. Speech Audio Process.
14
(
1
),
321
329
(
2006
).
6.
R. M.
Schafer
,
The Soundscape: Our Sonic Environment and the Tuning of the World
(
Inner Traditions/Bear & Company
,
Rochester, VT
,
1993
), pp.
1
320
.
7.
B.
McFee
,
J.
Salamon
, and
J. P.
Bello
, “
Adaptive pooling operators for weakly labeled sound event detection
,”
IEEE ACM Trans. Audio Speech Lang. Process.
26
(
11
),
2180
2193
(
2018
).
8.
Y.
Liu
,
H.
Chen
,
Y.
Wang
, and
P.
Zhang
, “
Power pooling: An adaptive pooling function for weakly labelled sound event detection
,” arXiv:2010.09985 (
2020
).
9.
S. S.
Stevens
,
J.
Volkmann
, and
E. B.
Newman
, “
A scale for the measurement of the psychological magnitude pitch
,”
J. Acoust. Soc. Am.
8
(
3
),
185
190
(
1937
).
10.
S.
Davis
and
P.
Mermelstein
, “
Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences
,”
IEEE ACM Trans. Audio Speech Lang. Process.
28
(
4
),
357
366
(
1980
).
11.
T.
Ko
,
V.
Peddinti
,
D.
Povey
, and, and
S.
Khudanpur
, “
Audio augmentation for speech recognition
,” in
Proceedings of INTERSPEECH 2015, 16th Annual Conference of the International Speech Communication Association
, Dresden, Germany (September 6–10,
2015
), pp.
3586
3589
.
12.
M.
Dorfer
,
B.
Lehner
,
H.
Eghbal-zadeh
,
H.
Christop
,
P.
Fabian
, and
W.
Gerhard
, “
Acoustic scene classification with fully convolutional neural networks and I-vectors
,” DCASE2018 Challenge (
2018
).
13.
N.
Sawhney
and
P.
Maes
, “
Situational awareness from environmental sounds
,” Project Report (MIT, Cambridge, MA,
1997
).
14.
A.
Mesaros
,
T.
Heittola
,
E.
Benetos
,
P.
Foster
,
M.
Lagrange
,
T.
Virtanen
, and
M. D.
Plumbley
, “
Detection and classification of acoustic scenes and events: Outcome of the DCASE 2016 challenge
,”
IEEE ACM Trans. Audio Speech Lang. Process.
26
(
2
),
379
393
(
2018
).
15.
M. D.
Plumbley
,
C.
Kroos
,
J. P.
Bello
,
G.
Richard
,
D. P.
Ellis
, and
A.
Mesaros
,
Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018)
, Surrey, UK (November 19–20,
2018
), pp.
1
223
.
16.
A.
Mesaros
,
T.
Heittola
, and
T.
Virtanen
, “
Acoustic scene classification in DCASE 2019 challenge: Closed and open set classification and data mismatch setups
,” in
Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019)
, New York (October 25–26,
2019
), pp.
164
168
.
17.
B.
Clarkson
,
N.
Sawhney
, and
A.
Pentland
, “
Auditory context awareness via wearable computing
,”
Energy
400
(
600
),
20
(
1998
).
18.
A. J.
Eronen
,
J. T.
Tuomi
,
A.
Klapuri
,
S.
Fagerlund
,
T.
Sorsa
,
G.
Lorho
, and
J.
Huopaniemi
, “
Audio-based context awareness—acoustic modeling and perceptual evaluation
,” in
Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP '03
, Hong Kong (April 6–10,
2003
), pp.
529
532
.
19.
Y.
Han
and
J.
Park
, “
Convolutional neural networks with binaural representations and background subtraction for acoustic scene classification
,” DCASE2017 Challenge (
2017
).
20.
A.
Mesaros
,
T.
Heittola
, and
T.
Virtanen
, “
A multi-device dataset for urban acoustic scene classification
,” in
Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018)
, Surrey, UK (November 19–20,
2018
), pp.
9
13
.
21.
Y.
Sakashita
and
M.
Aono
, “
Acoustic scene classification by ensemble of spectrograms based on adaptive temporal divisions
,” DCASE2018 Challenge (
2018
).
22.
H.
Chen
,
Z.
Liu
,
Z.
Liu
,
P.
Zhang
, and
Y.
Yan
, “
Integrating the data augmentation scheme with various classifiers for acoustic scene modeling
,” DCASE2019 Challenge (
2019
).
23.
Z.
Weiping
,
Y.
Jiantao
,
X.
Xiaotao
,
L.
Xiangtao
, and
P.
Shaohu
, “
Acoustic scene classification using deep convolutional neural network and multiple spectrograms fusion
,” DCASE2017 Challenge (
2017
).
24.
S.
Waldekar
and
G.
Saha
, “
Classification of audio scenes with novel features in a fused system framework
,”
Digit. Signal Process.
75
,
71
82
(
2018
).
25.
C.
Paseddula
and
S. V.
Gangashetty
, “
Late fusion framework for acoustic scene classification using LPCC, SCMC, and log-Mel band energies with deep neural networks
,”
Appl. Acoust.
172
,
107568
(
2021
).
26.
J. C.
Brown
, “
Calculation of a constant Q spectral transform
,”
J. Acoust. Soc. Am.
89
(
1
),
425
434
(
1991
).
27.
F.
Argenti
,
P.
Nesi
, and
G.
Pantaleo
, “
Automatic transcription of polyphonic music based on the constant-Q bispectral analysis
,”
IEEE Trans. Speech Audio Process
19
(
6
),
1610
1630
(
2011
).
28.
D.
Giannoulis
,
D.
Barchiesi
,
A.
Klapuri
, and, and
M. D.
Plumbley
, “
On the disjointess of sources in music using different time-frequency representations
,” in
Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2011
(October 16–19,
2011
), pp.
261
264
.
29.
J.
Andén
and
S.
Mallat
, “
Deep scattering spectrum
,”
IEEE Trans. Signal Process.
62
(
16
),
4114
4128
(
2014
).
30.
T.
Lidy
and
A.
Schindler
, “
CQT-based convolutional neural networks for audio scene classification and domestic audio tagging
,” DCASE2016 Challenge (
2016
).
31.
Q.
Kun
,
R.
Zhao
,
P.
Vedhas
,
Y.
Zijiang
,
Z.
Zixing
, and
S.
Björn
, “
Wavelets revisited for the classification of acoustic scenes
,” DCASE2017 Challenge (
2017
).
32.
S.
Amiriparian
,
N.
Cummins
,
M.
Freitag
,
K.
Qian
,
Z.
Ren
,
V.
Pandit
, and
B.
Schuller
, “
The combined Augsburg/Passau/TUM/ICL system for DCASE 2017
,” DCASE2017 Challenge (
2017
).
33.
H.
Zeinali
,
L.
Burget
, and
H.
Cernocky
, “
Convolutional neural networks and x-vector embedding for DCASE2018 acoustic scene classification challenge
,” DCASE2018 Challenge (
2018
).
34.
Z.
Huang
and
D.
Jiang
, “
Acoustic scene classification based on deep convolutional neuralnetwork with spatial-temporal attention pooling
,” DCASE2019 Challenge (
2019
).
35.
V.
Bisot
,
R.
Serizel
,
S.
Essid
, and
G.
Richard
, “
Nonnegative feature learning methods for acoustic scene classification
,” DCASE2017 Challenge (
2017
).
36.
H.
Chen
,
P.
Zhang
,
H.
Bai
,
Q.
Yuan
,
X.
Bao
, and, and
Y.
Yan
, “
Deep convolutional neural network with scalogram for audio scene modeling
,” in
Proceedings of INTERSPEECH 2018, 19th Annual Conference of the International Speech Communication Association
, Hyderabad, India, September 2–6,
2018
), pp.
3304
3308
.
37.
H.
Chen
,
P.
Zhang
, and, and
Y.
Yan
, “
An audio scene classification framework with embedded filters and a DCT-based temporal module
,” in
Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2019
, Brighton, UK (May 12–17,
2019
), pp.
835
839
.
38.
S.
Chu
,
S. S.
Narayanan
,
C. J.
Kuo
, and, and
M. J.
Mataric
, “
Where am I? Scene recognition for mobile robots using audio features
,” in
Proceedings of the 2006 IEEE International Conference on Multimedia and Expo, ICME 2006
, Toronto, Canada (July 9–12,
2006
), pp.
885
888
.
39.
L.
Ma
,
B.
Milner
, and
D. J.
Smith
, “
Acoustic environment classification
,”
ACM Trans. Speech Lang. Process.
3
(
2
),
1
22
(
2006
).
40.
B.
Elizalde
,
A.
Kumar
,
A.
Shah
,
R.
Badlani
,
E.
Vincent
,
B.
Raj
, and
I.
Lane
, “
Experiments on the DCASE challenge 2016: Acoustic scene classification and sound event detection in real life recording
,” DCASE2016 Challenge (
2016
).
41.
A.
Rakotomamonjy
and
G.
Gasso
, “
Histogram of gradients of time-frequency representations for audio scene classification
,”
IEEE ACM Trans. Audio Speech Lang. Process.
23
(
1
),
142
153
(
2015
).
42.
V.
Bisot
,
R.
Serizel
,
S.
Essid
, and
G.
Richard
, “
Supervised nonnegative matrix factorization for acoustic scene classification
,” DCASE2016 Challenge (
2016
).
43.
J.
Ye
,
T.
Kobayashi
,
M.
Murakawa
, and
T.
Higuchi
, “
Acoustic scene classification based on sound textures and events
,” in
Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, MM '15
, Brisbane, Australia (October 26–30,
2015
), pp.
1291
1294
.
44.
S.
Mun
,
S.
Park
,
Y.
Lee
, and
H.
Ko
, “
Deep neural network bottleneck feature for acoustic scene classification
,” DCASE2016 Challenge (
2016
).
45.
N.
Moritz
,
J.
Schröder
,
S.
Goetze
,
J.
Anemüller
, and
B.
Kollmeier
, “
Acoustic scene classification using time-delay neural networks and amplitude modulation filter bank features
,” DCASE2016 Challenge (
2016
).
46.
L.
Zhang
,
J.
Han
, and, and
Z.
Shi
, “
Atresn-net: Capturing attentive temporal relations in semantic neighborhood for acoustic scene classification
,” in
Proceedings of INTERSPEECH 2020, 21st Annual Conference of the International Speech Communication Association
, Virtual Event, Shanghai, China (October 25–29,
2020
), pp.
1181
1185
.
47.
I.
Kukanov
,
V.
Hautamäki
, and
K. A.
Lee
, “
Recurrent neural network and maximal figure of merit for acoustic event detection
,” DCASE2017 Challenge (
2017
).
48.
S.
Mun
,
S.
Park
,
D.
Han
, and
H.
Ko
, “
Generative adversarial network based acoustic scene training set augmentation and selection using SVM hyper-plane
,” DCASE2017 Challenge (
2017
).
49.
L.
Mingle
and
L.
Yanxiong
, “
The system for acoustic scene classification using Resnet
,” DCASE2019 Challenge (
2019
).
50.
H.
Zhu
,
C.
Ren
,
J.
Wang
,
S.
Li
,
L.
Wang
, and
L.
Yang
, “
DCASE 2019 challenge task1 technical report
,” DCASE2019 Challenge (
2019
).
51.
W.
Wang
and
M.
Liu
, “
The SEIE-SCUT systems for acoustic scene classification using CNN ensemble
,” DCASE2019 Challenge (
2019
).
52.
M.
Mandel
,
J.
Salamon
, and
D. P. W.
Ellis
,
Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019)
, New York (October 25–26,
2019
), pp.
1
274
.
53.
J.
Li
,
W.
Dai
,
F.
Metze
,
S.
Qu
, and, and
S.
Das
, “
A comparison of deep learning methods for environmental sound detection
,” in
Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017
, New Orleans, LA (March 5–9,
2017
), pp.
126
130
.
54.
I. J.
Goodfellow
,
J.
Pouget-Abadie
,
M.
Mirza
,
B.
Xu
,
D.
Warde-Farley
,
S.
Ozair
,
A. C.
Courville
, and
Y.
Bengio
, “
Generative adversarial nets
,” in
Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS'14)
, Montreal, Canada (December 8–13,
2014
), Vol.
2
, pp.
2672
2680
.
55.
M.
Mirza
and
S.
Osindero
, “
Conditional generative adversarial nets
,” arXiv:1411.1784 (
2014
).
56.
A.
Odena
,
C.
Olah
, and, and
J.
Shlens
, “
Conditional image synthesis with auxiliary classifier GANs
,” in
Proceedings of the 34th International Conference on Machine Learning, ICML 2017
, Sydney, Australia (August 6–11,
2017
).
57.
A.
Makhzani
,
J.
Shlens
,
N.
Jaitly
, and
I. J.
Goodfellow
, “
Adversarial autoencoders
,” arXiv:1511.05644 (
2015
).
58.
E. L.
Denton
,
S.
Chintala
,
A.
Szlam
, and
R.
Fergus
, “
Deep generative image models using a Laplacian pyramid of adversarial networks
,” in
Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS'15)
, Montreal, Canada (December 7–12,
2015
), Vol.
1
, pp.
1486
1494
.
59.
P.
Isola
,
J.
Zhu
,
T.
Zhou
, and
A. A.
Efros
, “
Image-to-image translation with conditional adversarial networks
,” in
Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017
, Honolulu, HI (July 21–26,
2017
), pp.
5967
5976
.
60.
C.
Ledig
,
L.
Theis
,
F.
Huszar
,
J.
Caballero
,
A.
Cunningham
,
A.
Acosta
,
A. P.
Aitken
,
A.
Tejani
,
J.
Totz
,
Z.
Wang
, and
W.
Shi
, “
Photo-realistic single image super-resolution using a generative adversarial network
,” in
Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017
, Honolulu, HI (July 21–26,
2017
), pp.
105
114
.
61.
A.
Antoniou
,
A. J.
Storkey
, and
H.
Edwards
, “
Data augmentation generative adversarial networks
,” arXiv:1711.04340 (
2017
).
62.
X.
Xia
,
R.
Togneri
,
F.
Sohel
, and
D.
Huang
, “
Auxiliary classifier generative adversarial network with soft labels in imbalanced acoustic event detection
,”
IEEE Trans. Multim.
21
(
6
),
1359
1371
(
2019
).
63.
X.
Zheng
and
J.
Yan
, “
Acoustic scene classification combining log-mel CNN model and end-to-end model
,” DCASE2019 Challenge (
2019
).
64.
K.
Simonyan
and
A.
Zisserman
, “
Very deep convolutional networks for large-scale image recognition
,” in
Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015
, San Diego, CA (May 7–9,
2015
).
65.
E.
Shelhamer
,
J.
Long
, and
T.
Darrell
, “
Fully convolutional networks for semantic segmentation
,”
IEEE Trans. Pattern Anal. Mach. Intell.
39
(
4
),
640
651
(
2017
).
66.
A.
Radford
,
L.
Metz
, and
S.
Chintala
, “
Unsupervised representation learning with deep convolutional generative adversarial networks
,” in
Proceedings of the 4th International Conference on Learning Representations, ICLR 2016
, San Juan, Puerto Rico (May 2–4,
2016
).
67.
S.
Young
,
G.
Evermann
,
M.
Gales
,
D.
Kershaw
,
G.
Moore
,
J.
Odell
,
D.
Ollason
,
D.
Povey
,
V.
Valtchev
, and
P.
Woodland
,
The HTK Book (for HTK Version 3.4)
(
Cambridge University
,
Cambridge, UK
,
2006
).
68.
A.
Paszke
,
S.
Gross
,
F.
Massa
,
A.
Lerer
,
J.
Bradbury
,
G.
Chanan
,
T.
Killeen
,
Z.
Lin
,
N.
Gimelshein
,
L.
Antiga
,
A.
Desmaison
,
A.
Kopf
,
E.
Yang
,
Z.
DeVito
,
M.
Raison
,
A.
Tejani
,
S.
Chilamkurthy
,
B.
Steiner
,
L.
Fang
,
J.
Bai
, and
S.
Chintala
, “
PyTorch: An imperative style, high-performance deep learning library
,” in
Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019)
, Vancouver, Canada (December 8–14,
2019
), pp.
8024
8035
.
69.
D. P.
Kingma
and
J.
Ba
, “
Adam: A method for stochastic optimization
,” in
Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015
, San Diego, CA (May 7–9,
2015
).
70.
H.
Zhang
,
M.
Cissé
,
Y. N.
Dauphin
, and
D.
Lopez-Paz
, “
mixup: Beyond empirical risk minimization
,” in
Proceedings of the 6th International Conference on Learning Representations, ICLR 2018
, Vancouver, Canada (April 30–May 3,
2018
).
71.
D. S.
Park
,
W.
Chan
,
Y.
Zhang
,
C.
Chiu
,
B.
Zoph
,
E. D.
Cubuk
, and
Q. V.
Le
, “
Specaugment: A simple data augmentation method for automatic speech recognition
,” in
Proceedings of INTERSPEECH 2019, 20th Annual Conference of the International Speech Communication Association
, Graz, Austria (September 15–19,
2019
), pp.
2613
2617
.
You do not currently have access to this content.