Bioacoustic classification often suffers from the lack of labeled data. This hinders the effective utilization of state-of-the-art deep learning models in bioacoustics. To overcome this problem, the authors propose a deep metric learning-based framework that provides effective classification, even when only a small number of per-class training examples are available. The proposed framework utilizes a multiscale convolutional neural network and the proposed dynamic variant of the triplet loss to learn a transformation space where intra-class separation is minimized and inter-class separation is maximized by a dynamically increasing margin. The process of learning this transformation is known as deep metric learning. The triplet loss analyzes three examples (referred to as a triplet) at a time to perform deep metric learning. The number of possible triplets increases cubically with the dataset size, making triplet loss more suitable than the cross-entropy loss in data-scarce conditions. Experiments on three different publicly available datasets show that the proposed framework performs better than existing bioacoustic classification methods. Experimental results also demonstrate the superiority of dynamic triplet loss over cross-entropy loss in data-scarce conditions. Furthermore, unlike existing bioacoustic classification methods, the proposed framework has been extended to provide open-set classification.

1.
F.
van Bommel
, “
Birds in Europe: Population estimates, trends and conservation status
,”
Br. Birds
98
,
269
271
(
2005
).
2.
S. A.
Cushman
, “
Effects of habitat loss and fragmentation on amphibians: A review and prospectus
,”
Biol. Conserv.
128
(
2
),
231
240
(
2006
).
3.
A. L.
Borker
,
M. W.
McKown
,
J. T.
Ackerman
,
C. A.
Eagles-Smith
,
B. R.
Tershy
, and
D. A.
Croll
, “
Vocal activity as a low cost and scalable index of seabird colony size
,”
Conserv. Biol.
28
(
4
),
1100
1108
(
2014
).
4.
B. J.
Furnas
and
R. L.
Callas
, “
Using automated recorders and occupancy models to monitor common forest birds across a large geographic region
,”
J. Wildl. Manage.
79
(
2
),
325
337
(
2015
).
5.
T. S.
Brandes
, “
Automated sound recording and analysis techniques for bird surveys and conservation
,”
Bird Conserv. Int.
18
(
S1
),
S163
S173
(
2008
).
6.
B.
Gatto
,
J.
Colonna
,
E. M.
dos Santos
, and
E. F.
Nakamura
, “
Mutual singular spectrum analysis for bioacoustics classification
,” in
Proceedings of Mach. Learn. Sig. Process. (MLSP)
(September 2017).
7.
E.
Cakir
,
G.
Parascandolo
,
T.
Heittola
,
H.
Huttunen
,
T.
Virtanen
,
E.
Cakir
,
G.
Parascandolo
,
T.
Heittola
,
H.
Huttunen
, and
T.
Virtanen
, “
Convolutional recurrent neural networks for polyphonic sound event detection
,”
IEEE Trans. Audio, Speech, Lang. Process.
25
(
6
),
1291
1303
(
2017
).
8.
Y.
Xu
,
Q.
Kong
,
W.
Wang
, and
M. D.
Plumbley
, “
Large-scale weakly supervised audio classification using gated convolutional neural network
,” in
Proceedings of Int. Conf. Acoust. Speech, Signal Process. (ICASSP)
(
2018
), pp.
121
125
.
9.
S.
Hershey
,
S.
Chaudhuri
,
D. P.
Ellis
,
J. F.
Gemmeke
,
A.
Jansen
,
R. C.
Moore
,
M.
Plakal
,
D.
Platt
,
R. A.
Saurous
,
B.
Seybold
,
M.
Slaney
,
R. J.
Weiss
, and
K.
Wilson
, “
CNN architectures for large-scale audio classification
,” in
Proceedings of Int. Conf. Acoust. Speech, Signal Process. (ICASSP)
(
2017
), pp.
131
135
.
10.
J.
Salamon
and
J. P.
Bello
, “
Deep convolutional neural networks and data augmentation for environmental sound classification
,”
IEEE Signal Process. Lett.
24
(
3
),
279
283
(
2017
).
11.
R.
Lu
,
Z.
Duan
, and
C.
Zhang
, “
Metric learning based data augmentation for environmental sound classification
,” in
Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics
(
2017
), pp.
1
5
.
12.
P.
Hamel
,
M. E. P.
Davies
,
K.
Yoshii
, and
M.
Goto
, “
Transfer learning in MIR: Sharing learned latent representations for music audio classification and similarity
,” in
Proceedings of Int. Conf. Music Info. Retrieval
(
2013
).
13.
S.
Ntalampiras
, “
Bird species identification via transfer learning from music genres
,”
Ecol. Inf.
44
,
76
81
(
2018
).
14.
N.
Tajbakhsh
,
J. Y.
Shin
,
S. R.
Gurudu
,
R. T.
Hurst
,
C. B.
Kendall
,
M. B.
Gotway
, and
J.
Liang
, “
Convolutional neural networks for medical image analysis: Full training or fine tuning?
,”
Trans. Med. Imag.
35
(
5
),
1299
1312
(
2016
).
15.
S.
Poria
,
I.
Chaturvedi
,
E.
Cambria
, and
A.
Hussain
, “
Convolutional MKL based multimodal emotion recognition and sentiment analysis
,” in
Proceedings of Int. Conf. Data Mining
(
2016
), pp.
439
448
.
16.
D.
Stowell
,
M. D.
Wood
,
H.
Pamuła
,
Y.
Stylianou
, and
H.
Glotin
, “
Automatic acoustic detection of birds through deep learning: The first bird audio detection challenge
,”
Meth. Ecol. Evol.
10
(
3
),
368
380
(
2018
).
17.
E.
Cakir
,
S.
Adavanne
,
G.
Parascandolo
,
K.
Drossos
, and
T.
Virtanen
, “
Convolutional recurrent neural networks for bird audio detection
,” in
Proceedings of European Sig. Process. Conf. (EUSIPCO)
(
2017
), pp.
1744
1748
.
18.
T.
Grill
and
J.
Schlüter
, “
Two convolutional neural networks for bird detection in audio signals
,” in
Proceedings of European Sig. Process. Conf. (EUSIPCO)
(
2017
), pp.
1764
1768
.
19.
T.
Pellegrini
, “
Densely connected CNNs for bird audio detection
,” in
Proceedings of Eusipco
(
2017
), pp.
1734
1738
.
20.
V.
Lostanlen
,
J.
Salamon
,
A.
Farnsworth
,
S.
Kelling
, and
J. P.
Bello
, “
Birdvox-full-night: A dataset and benchmark for avian flight call detection
,” in
Proceedings of Int. Conf. Acoust. Speech, Signal Process. (ICASSP)
(
2018
), pp.
266
270
.
21.
J.
Salamon
,
J. P.
Bello
,
A.
Farnsworth
, and
S.
Kelling
, “
Fusing shallow and deep learning for bioacoustic bird species classification
,” in
Proceedings of Int. Conf. Acoust. Speech, Signal Process. (ICASSP)
(
2017
), pp.
141
145
.
22.
A. K.
Ibrahim
,
H.
Zhuang
,
L. M.
Chérubin
,
M. T.
Schärer-Umpierre
, and
N.
Erdol
, “
Automatic classification of grouper species by their sounds using deep neural networks
,”
J. Acoust. Soc. Am.
144
(
3
),
EL196
EL202
(
2018
).
23.
B. P.
Tóth
and
B.
Czeba
, “
Convolutional neural networks for large-scale bird song classification in noisy environment
,” in
CLEF (Working Notes)
(
2016
), pp.
560
568
.
24.
E.
Sprengel
,
M.
Jaggi
,
Y.
Kilcher
, and
T.
Hofmann
, “
Audio based bird species identification using deep learning techniques
,” in
CLEF (Working Notes)
(
2016
), pp.
547
559
.
25.
K. J.
Piczak
, “
Recognizing bird species in audio recordings using deep convolutional neural networks
,” in
CLEF (Working Notes)
(
2016
), pp.
534
543
.
26.
D.
Stowell
and
M. D.
Plumbley
, “
Automatic large-scale classification of bird sounds is strongly improved by unsupervised feature learning
,”
PeerJ
2
,
e488
(
2014
).
27.
A.
Thakur
,
V.
Abrol
,
P.
Sharma
, and
P.
Rajan
, “
Compressed convex spectral embedding for bird species classification
,” in
Proceedings of Int. Conf. Acoust. Speech, Signal Process. (ICASSP)
(April,
2018
).
28.
A.
Thakur
,
V. Abrol
,
P.
Sharma
, and
P.
Rajan
, “
Deep convex representations: Feature representations for bioacoustics classification
,” in
Proceedings of Interspeech
(
2018
).
29.
Y.
Chen
,
J.
Mairal
, and
Z.
Harchaoui
, “
Fast and robust archtypal analysis for representation learning
,” in
Proceedings of Comp. Vision Pattern Recog. (CVPR)
(
2014
), pp.
1478
1485
.
30.
K.
Qian
,
Z.
Zhang
,
A.
Baird
, and
B.
Schuller
, “
Active learning for bird sound classification via a kernel-based extreme learning machine
,”
J. Acoust. Soc. Am.
142
(
4
),
1796
1804
(
2017
).
31.
D.
Chakraborty
,
P. Mukker
,
P.
Rajan
, and
A.
Dileep
, “
Bird call identification using dynamic kernel based support vector machines and deep neural networks
,” in
Proceedings of Int. Conf. Mach. Learn. App.
(
2016
).
32.
A.
Thakur
,
R.
Jyothi
,
P.
Rajan
, and
A.
Dileep
, “
Rapid bird activity detection using probabilistic sequence kernels
,” in
Proceedings of European Sig. Process. Conf. (EUSIPCO)
(
2017
), pp.
1754
1758
.
33.
V.
Abrol
,
P.
Sharma
,
A.
Thakur
,
P.
Rajan
,
A.
Dileep
, and
A. K.
Sao
, “
Archetypal analysis based sparse convex sequence kernel for bird activity detection
,” in
Signal Processing Conference (EUSIPCO), 2017 25th European
(IEEE,
2017
), pp.
1774
1778
.
34.
D.
Yi
,
Z.
Lei
,
S.
Liao
, and
S. Z.
Li
, “
Deep metric learning for person re-identification
,” in
Proceedings of Int. Conf. Pattern Recognition (ICPR)
(
2014
), pp.
34
39
.
35.
F.
Schroff
,
D.
Kalenichenko
, and
J.
Philbin
, “
Facenet: A unified embedding for face recognition and clustering
,” in
Proceedings of Comp. Vis. Pattern Recog.
(
2015
), pp.
815
823
.
36.
O.
Rippel
,
M.
Paluri
,
P.
Dollar
, and
L.
Bourdev
, “
Metric learning with adaptive density discrimination
,” in
Proceedings of Int. Conf. Learn. Represent.
(
2016
).
37.
G.
Pahariya
,
B.
Ravindran
, and
S.
Das
, “
Dynamic class learning approach for smart CBIR
,” in
National Conference on Computer Vision, Pattern Recognition, Image Processing, and Graphics
(
2018
), pp.
327
337
.
38.
C.
Szegedy
,
S.
Ioffe
,
V.
Vanhoucke
, and
A. A.
Alemi
, “
Inception-v4, inception-resnet and the impact of residual connections on learning
,” in
Proceedings of AAAI
(
2017
), Vol. 4, p.
12
.
39.
J.
Driedger
,
M.
Müller
, and
S.
Disch
, “
Extending harmonic-percussive separation of audio signals
,” in
Proceedings of ISMIR
(
2014
), pp.
611
616
.
40.
A.
Pankajakshan
,
A.
Thakur
,
D.
Thapar
,
P.
Rajan
, and
A.
Nigam
, “
All-conv net for bird activity detection: Significance of learned pooling
,” in
Proc. Interspeech
(
2018
).
41.
A.
Krogh
and
J. A.
Hertz
, “
A simple weight decay can improve generalization
,” in
Proceedings of Advances in Neural Information Processing Systems
(
1992
), pp.
950
957
.
42.
L.
Maaten
and
G.
Hinton
, “
Visualizing data using t-sne
,”
J. Mach. Learn. Res.
9
,
2579
2605
(
2008
).
46.
J.
Salamon
,
J. P.
Bello
,
A.
Farnsworth
,
M.
Robbins
,
S.
Keen
,
H.
Klinck
, and
S.
Kelling
, “
Towards the automatic classification of avian flight calls for bioacoustic monitoring
,”
PLoS One
11
(
11
),
e0166866
(
2016
).
47.
B.
Schuller
,
S.
Steidl
,
A.
Batliner
,
A.
Vinciarelli
,
K.
Scherer
,
F.
Ringeva
,
M.
Chetouani
,
F.
Weninger
,
F.
Eyben
,
E.
Marchi
,
M.
Mortillaro
,
H.
Salamin
,
A.
Polychroniou
,
F.
Valente
, and
S.
Kim
, “
The INTERSPEECH 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism
,” in
Proceedings of Interspeech
(
2013
).
48.
V.
Van Asch
, “
Macro and micro-averaged evaluation measures
,” Belgium: CLiPS (
2013
).
49.
See supplementary material at https://doi.org/10.1121/1.5118245 for more experiments and other details.

Supplementary Material

You do not currently have access to this content.