Deep learning is one established tool for carrying out classification tasks on complex, multi-dimensional data. Since audio recordings contain a frequency and temporal component, long-term monitoring of bioacoustics recordings is made more feasible with these computational frameworks. Unfortunately, these neural networks are rarely designed for the task of open set classification in which examples belonging to the training classes must not only be correctly classified but also crucially separated from any spurious or unknown classes. To combat this reliance on closed set classifiers which are singularly inappropriate for monitoring applications in which many non-relevant sounds are likely to be encountered, the performance of several open set classification frameworks is compared on environmental audio datasets recorded and published within this work, containing both biological and anthropogenic sounds. The inference-based open set classification techniques include prediction score thresholding, distance-based thresholding, and OpenMax. Each open set classification technique is evaluated under multi-, single-, and cross-corpus scenarios for two different types of unknown data, configured to highlight common challenges inherent to real-world classification tasks. The performance of each method is highly dependent upon the degree of similarity between the training, testing, and unknown domain.

Many human activities directly or indirectly disrupt animal communication,1,2 and passive acoustic monitoring can record many of these disruptions over long spatial and time scales, particularly because many indicator species engage in distinctive acoustic communication.3 Though collecting large quantities of bioacoustic data has become relatively inexpensive, manually processing these large datasets is usually impractical.4–6 

Automatic species classifiers, in particular deep neural networks, have been successfully used in recent decades to make such analysis possible.7–13 Bird species classification challenges such as BirdCLEF14,15 have only reinforced the dominance of, particularly, convolutional neural network (CNN)-based architectures in recent years. These neural networks can be classed into two general categories: (i) those designed to classify only a predefined set of N known classes and (ii) those designed to detect narrow target classes among otherwise unknown sounds.16 

While the N-class scenario occupies the vast majority of bioacoustics-related machine learning literature, it is singularly inappropriate for long-term monitoring applications.17 Indeed, for such a task it is not possible to have knowledge of the entire set of possible classes, particularly because new sounds will likely appear over time. Therefore, the critical challenge for machine learning-guided acoustic monitoring is accurately distinguishing relevant vocalizations from an audio stream that contains many non-relevant sounds.

Under a so-called open set classification (OSC) framework, the objective is to not only correctly classify each example belonging to a known class, but also to separate those known examples from those which are unknown. In this case, an unknown denotes an example that does not belong to any of the training classes that a given closed set classifier has been trained to distinguish. When processing continuous field recordings automatically, framing the data as belonging to an open, rather than closed set most closely reflects real-world conditions.

For bioacoustics classification applications specifically, however, such systems have rarely been implemented, and even then often only for test scenarios with a small number of species.18,19 The 2019 DCASE (Detection and Classification of Acoustic Scenes and Events)20 challenge introduced an open set acoustic scene classification task, and the vast majority of entries relied on placing a threshold on the closed set model prediction scores to determine whether a particular example would be considered an unknown,21–23 though a few other approaches included SVMs and autoencoder-based methods.24,25 Variational autoencoders were recently used to carry out successful avian species and individual detection in a leave-one-out open set approach.26 If these methods prove successful in environmental monitoring contexts, they have the potential to allow large amounts of unlabeled data to be processed automatically while mitigating misclassification risks.

Here, we primarily offer a performance comparison of several standard open set classification schemes for a variety of classification tasks, designed to be as authentic to long-term environmental monitoring concerns as possible, in contrast to the above papers, which largely focus on developing novel open set classification schemes to carry out a standard classification task. In an extension of an ongoing long-term environmental monitoring project,27 this work offers one of a few systematic comparisons of the following inference-based OSC strategies28 for bioacoustics applications:29 

  • AC: Added-classes

  • HA: Hierarchical added-classes

  • PT: Prediction score-based threshold

  • DT: Distance metric-based threshold

  • OM: OpenMax

The AC strategy29 and its HA,30–32 are unique from the other approaches in that they are not technically open set classification approaches at all. Indeed, with both techniques, a form of unknown data is introduced during training itself, representing an expansion of the closed set approach which includes one or multiple catch-all training categories corresponding to otherwise unknown data which is assumed to be similar to that encounted during testing. The HA approach differs only in that it incorporates parent-child relationships into the neural network architecture instead of keeping the classification flat.

In some cases, training on unknown data is not possible because an adequate representation cannot be obtained for a particular monitoring site, particularly over broad geographic ranges and long time scales. Therefore, two threshold-based strategies are also pursued, one which relies on thresholding the final output prediction confidence scores themselves (PT)21–23,33 and one which relies on a distance metric calculated using the network embedding space (DT).19,34–36 Finally, the popular OpenMax (OM)37–40 approach is employed, which incorporates a similar distance metric calculated from the embedding space and extreme value theory.

Since these computational frameworks are employed to introduce a higher degree of realism into the automation of long-term acoustics monitoring projects, the choice of testbed is critical. Therefore, in lieu of an off-the-shelf audio dataset, this study is carried out on two different long-term field recorded databases, published here: 10.5281/zenodo.6456604. Denoted S1 and S2, respectively, these datasets were collected continuously over a period of one year and contain a variety of both shared and unique geo-, bio-, and anthropophony allowing for an examination of the following mixed-corpus classification scenarios:

  • Single: Single-corpus training and testing

  • Multi: Multi-corpus training and testing

  • Cross: Cross-corpus training and testing

In the single-corpus scenario, either S1 or S2 is used for both training and testing. In the multi-corpus scenario, the complete set of data and classes from S1 and S2 are combined to form one larger database for both training and testing. Finally, in the cross-corpus scenario, S1 is used for training, while S2 is used for testing or vice versa.

However, the degree of similarity between the training and unknown domain is just as crucial as that of the training and testing domain. During realistic environmental monitoring scenarios, assuming supervised learning frameworks, the unknown data is most likely to consist (i) of sound classes with such a late onset that they are not detected during initial training labeling efforts or (ii) of examples from sound classes than occur so infrequently that training on them is not possible. For this reason, we generate results for two different authentic compositions of the unknown data, corresponding to these two likely scenarios:

  • UU: Uniform unknowns

  • IU: Irregular unknowns

In the case of the UU data, three randomly selected bird calls of the available nine high frequency bird call types (AMRO, AMCR, BLJA, NOCA, BCCH, EAPH, BAWW, WBNU, and RBNU from Table I) are excluded from training and used instead as the unknown class in a leave-some-out approach. This construction of unknowns represents the aforementioned case of late onset sound classes which do not appear in the soundscape record until after neural network training has concluded. Meanwhile, the IU data represents the second case, where data is constructed from classes for which an insufficient number of samples was detected for training. For example, the dataset contained vocalizations belonging to the American Goldfinch, but there was an insufficient number (below 300 examples) to use as one of the network training classes.

TABLE I.

The sound categories that comprise the database at both monitoring sites and their abundance, with 56 035 sound clips assembled in total.

Sound stimuli Abbr. S1 S2 Sound stimuli Abbr. S1 S2
E. chipmunk “chuck”  ECMK  537  1161  E. chipmunk “chirp”  ECMC  101  2305 
(Tamias striatus)        (Tamias striatus)       
Dog bark  DGBR  1051  —  E. gray squirrel “kuk”  EGSK  453  — 
(Canis familiaris)        (Sciurus carolinensis)       
Downy woodpecker drum  DOWO  340  —  American robin  AMRO  1411  5554 
(Dryobates pubescens)        (Turdus migratorius)       
Rainfall  RNFL  2856  4518  American crow  AMCR  221  1477 
        (Corvus brachyrhynchos)       
Engine  ENGE  3406  4592  Bluejay  BLJA  770  1002 
        (Cyanocitta cristata)       
RC Car  RCCR  2361  —  Northern cardinal  NOCA  1057  — 
        (Cardinalis cardinalis)       
Back-up beeper  BUBP  —  1934  Black-capped chickadee  BCCH  —  780 
        (Poecile atricapillus)       
Siren  SREN  401  —  Eastern Phoebe  EAPH  —  590 
        (Sayornis phoebe)       
Fall field cricket  FFCR  5296  6101  Black-and-white Warbler  BAWW  —  434 
(Gryllus pennsylvanicus)        (Mniotilta varia)       
Dog day cicada  DDCC  519  —  White-breasted nuthatch  WBNU  371  — 
(Neotibicen canicularis)        (Sitta carolinensis)       
Low frequency unk.  IU  433  667  Red-breasted nuthatch  RBNU  —  341 
High frequency unk.  IU  897  2118  (Sitta canadensis)       
Sound stimuli Abbr. S1 S2 Sound stimuli Abbr. S1 S2
E. chipmunk “chuck”  ECMK  537  1161  E. chipmunk “chirp”  ECMC  101  2305 
(Tamias striatus)        (Tamias striatus)       
Dog bark  DGBR  1051  —  E. gray squirrel “kuk”  EGSK  453  — 
(Canis familiaris)        (Sciurus carolinensis)       
Downy woodpecker drum  DOWO  340  —  American robin  AMRO  1411  5554 
(Dryobates pubescens)        (Turdus migratorius)       
Rainfall  RNFL  2856  4518  American crow  AMCR  221  1477 
        (Corvus brachyrhynchos)       
Engine  ENGE  3406  4592  Bluejay  BLJA  770  1002 
        (Cyanocitta cristata)       
RC Car  RCCR  2361  —  Northern cardinal  NOCA  1057  — 
        (Cardinalis cardinalis)       
Back-up beeper  BUBP  —  1934  Black-capped chickadee  BCCH  —  780 
        (Poecile atricapillus)       
Siren  SREN  401  —  Eastern Phoebe  EAPH  —  590 
        (Sayornis phoebe)       
Fall field cricket  FFCR  5296  6101  Black-and-white Warbler  BAWW  —  434 
(Gryllus pennsylvanicus)        (Mniotilta varia)       
Dog day cicada  DDCC  519  —  White-breasted nuthatch  WBNU  371  — 
(Neotibicen canicularis)        (Sitta carolinensis)       
Low frequency unk.  IU  433  667  Red-breasted nuthatch  RBNU  —  341 
High frequency unk.  IU  897  2118  (Sitta canadensis)       

For both UU and IU, in the case of the AC and HA, the unknown training data is drawn only from the IU data, split into two distinct unknown training classes corresponding to the low and high frequency range. This is because the UU data is meant to represent the case for which a sound class was introduced to a recording site after neural network training began, meaning that the unknown data utilized for training under the AC and HA strategy would likely not be a close match in a real-world application of these techniques.

In summary, this work offers an examination of:

  1. The extent to which OSC can be used to classify unlabeled data as the testing data varies in similarity to the training data

    As audio data has become increasingly cheap to both record and store, the creation and use of large benchmark audio datasets15,20,41 means that data collected from numerous unspecified geographic and temporal contexts are being used to train and test neural networks. However, the degree of similarity between the training and testing domain will naturally have a large influence on the efficacy of OSC. Therefore, we systematically generate results for five different mixed-corpus configurations. Both datasets contain a variety of both shared and unique geo-, bio-, and anthropophony, which in and of itself is highly unusual for currently available datasets but very typically encounted during long-term environmental monitoring projects.

  2. The extent to which OSC can be used to classify unlabeled data that is highly similar to the training data

    To assemble an unknown class of data, bioacoustic data recorded not only in a different location and during a different time period, but of a very different semantic nature is often used (i.e., training on the animal class amphibia and using the animal class, aves, as an unknown).19 The OSC challenge that comprised DCASE 2019, for example, used outdoor urban sounds (i.e., metro station, tram travel) for training with quiet, indoor scenes (i.e., library, office) used for testing.42 In contrast, we use data collected from both recording sites to assemble two different types of highly realistic unknown data for the monitoring scenario.

  3. The extent to which OSC can be used to classify unlabeled data with no unified temporal or frequency structure

    Moving beyond the leave-one-out approach to OSC, we also construct a different form of unknown data designed to tackle a very common but less discussed monitoring scenario, represented by IU data. In this study, the IU data is comprised of dozens of different sound sources captured at both monitoring sites, meaning that the data is generally similar to the training data. However, no two samples are inherently likely to be similar, ensuring that the IU data will not behave like a traditional class of sounds with strong intra-class uniformity.

The first soundscape under study is located in a lightly wooded suburban area north of Albany, NY, at about 100 meters above sea level (Fig. 1). The recording device was not tampered with or disrupted during the study. The surrounding vegetation consists of white pines, spruces, oaks, and maples.

FIG. 1.

The monitoring site, located in the Capital Region of New York State, in Albany County.

FIG. 1.

The monitoring site, located in the Capital Region of New York State, in Albany County.

Close modal

The microphone was protected from above with a plastic lid and with a metal cage from the sides (see Fig. 2). The unit was secured from a pole at a height of 1.9 meters. A Moultrie P-180i 14MP trail camera (Panoramic 180i Game Camera, Moultrie Feeders, Calera, AL) was placed nearby, taking photos every 1 min, and an AcuRite weather station (5-in-1 Weather Sensor model 06004, AcuRite, Lake Geneva, WI) was also installed to record local weather variables every 5 min.

FIG. 2.

(Color online) The microphone array, protected from the weather with a plastic lid and wire cage. The array is powered by an external battery housed inside a waterproof casing.

FIG. 2.

(Color online) The microphone array, protected from the weather with a plastic lid and wire cage. The array is powered by an external battery housed inside a waterproof casing.

Close modal

The acoustic data were collected using a Zoom H2n Handy Recorder (H2n Audio Recorder, Zoom Corp., Tokyo, Japan) in 4-channel mode, which records audio from two pairs of coincident microphone capsules. The front pair span 120 degrees and the rear pair span 90 degrees, and because they are all coincident, they can be added together for near-omnidirectional recording in the horizontal plane. Recording with this configuration offers the capability of a surround-sound playback effect, though for the purposes of this analysis, only data from a single microphone capsule was used. Recordings were binaural and typically recorded at 16 bits in Waveform Audio File Format (WAV) at a frequency of 44 100 Hz.

TABLE II.

Comparison of the total classification accuracy of the test dataset and the average number of training epochs for different convolutional neural network architectures pre-trained on the ImageNet database for different size initial training databases. The left column of the table represents the maximum number of samples used for training per class.

No. samples Inception VGG16 ResNet50
100  64.4%/18  87.2%/18  88.1%/14 
300  69.7%/13  90.9%/17  89.7%/14 
500  83.0%/9  93.8%/14  94.7%/10 
1k  83.2%/12  94.2%/10  94.7%/14 
No. samples Inception VGG16 ResNet50
100  64.4%/18  87.2%/18  88.1%/14 
300  69.7%/13  90.9%/17  89.7%/14 
500  83.0%/9  93.8%/14  94.7%/10 
1k  83.2%/12  94.2%/10  94.7%/14 

The dataset considered in this study was collected from August 2019 to August 2020, during all seasons and all weather events. Recordings were made nearly continuously for 24 h each day, all year, except during brief interludes to change the memory card, typically at 10 pm each day. The microphone was powered using a hard-wired supply of electricity. Over 8000 h of recordings are analyzed.

The second soundscape under study is located in a forested area in Lake George, New York, also at about 100 meters above sea level. The surrounding vegetation consists largely of white pines. Located near the southern-most end of the lake, this microphone array was positioned safely between several areas used for recreation activities including hiking, baseball, football, and snowmobiling. The recording device was not tampered with or disrupted during the study.

The microphone array consisted of two H2n Zoom recorders mounted perpendicularly to simulate four-channel recording conditions. The unconventional design was motivated by limited access to the site – four-channel recordings can only be made as WAV files, which would exceed memory card capacity too quickly. The redundancy is nonetheless useful in the case of a single microphone or memory card failure. In fact, for the purposes of this analysis, only data from a single microphone capsule was used. Recordings were binaural, recorded at 256 kbps in MPEG-2 Audio Layer III (MP3) format.

The array was deployed from February 2020 until February 2021, during all seasons and all weather events. Recordings were made nearly continuously for 24 h each day, all year, except during a 24-h interlude, typically from 9 am to 9 am, to change the memory card and battery every 12 days.

As before, the array was protected by a metal cage with a plastic lid. The unit was secured from a tree at a height of 1.4 meters. A Moultrie P-180i 14MP trail camera was placed nearby, taking photos every 15 min. The microphones are both powered with a 168 Watt-hour Goal Zero Yeti 400 battery (Yeti 400 Portable Power Station, Goal Zero, Riverton, UT) housed inside of a waterproof box placed on the forest floor. Over 8000 h of recordings are analyzed.

Once all of the audio data were collected, it was segmented into 8 s-long clips, to ensure that both biological and anthropogenic sound sources with longer call lengths were provided with sufficient temporal context, while still achieving reasonable computational efficiency. Each segment was then downsampled to 16 kHz for processing efficiency, allowing each to be represented as a 128 000-sample vector. Finally, these segments were converted into log-mel spectrograms using the Librosa Python package.43 These spectrograms were computed using common parameters: 2048 fast Fourier transform (FFT) coefficients per segment, a hop length of 512, and 128 mel-scaled frequency bins.44 The Librosa function specshow was used to map each spectrogram to a color scale suitable for use as neural network inputs.

Since data collection initially started at Monitoring Site 1, these data points were assembled into a labeled database S1, first. To do this, three randomly selected days from each month of the calendar year were selected for annotation. Relevant spectrogram images were manually labeled via visual inspection by graduate students into a set of sound categories that evolved as the year progressed. On average, around 60 images were annotated per minute, corresponding to a total of 8 min of audio data. Therefore, annotating every spectrogram from a single day of recording (10 800 images) would take 3 h. Assembling the labeled database took less time in practice because only 10%–15% of the spectrograms from each of the selected days (the most relevant spectrograms) were actually annotated.

In order to more expeditiously label the Monitoring Site 2 dataset, S2, a neural network pre-trained on the labeled Monitoring Site 1 dataset is used to provide an initial set of unvalidated labels. This step allowed each spectrogram image to be automatically sorted into folders suitable for manual label validation. While specific bird species and other sounds that did not overlap between the two datasets required extensive re-labeling efforts, other, more common sound categories needed far less manual intervention, shortening the labeling process by 30% overall. At Monitoring Site 2, seven bird calls were included in the database and miscellaneous “high frequency” and “low frequency” categories were also established.

Table I summarizes the abundance and type of sound categories established for both monitoring sites, using abbreviations established from bird or alpha codes, when applicable. For both sites, any suspected bird calls for which an insufficient number of examples were detected were placed into a generic category for unknown “high frequency” calls, some of which were used for training under AC, or used solely for testing in the case of the other OSC strategies. An unknown “low frequency” category was created that encompasses any sounds that could not be sorted into any of the more specific anthropophony classes.

Transfer learning, where a model pre-trained on one dataset is re-trained to classify a similar datset, is one convenient approach to effectively utilize the power of CNNs. Well-known architectures such as Inception, MobileNet, and Resnet50, often pre-trained on large image databases, have recently been used with great success for the classification of birds in soundscape recordings.44–50 

To construct the neural network used to classify the spectrogram database assembled from S1, several well-known two-dimensional CNNs were explored. The architectures selected include Inception v3,48 ResNet50,50 and VGG16,51 all pretrained on the ImageNet dataset,52 which contains over 14 × 106 images from over 20 000 categories. Training was conducted using the Python package Keras with a Tensorflow backend.53,54

To fully compare each of the three base architectures, representative distributions of the sound categories present at S1 were assembled for various overall database sizes (100, 300, 500, and 1000 max images per sound category). In this initial experiment, ResNet50 performed with a slightly higher total accuracy for the test data as the dataset approached its final, largest size (see Table II). However, even with a maximum of 1000 samples per class, ResNet50 took one-third longer to train than VGG16, with a more uncertain training time trajectory as the number of samples per class increased. Since the performance difference between both models was smaller than one percentage point, VGG16 was selected as the base neural network architecture for each successive experiment. The full neural network architecture is outlined in the left detail panel of Fig. 3.

TABLE III.

Performance summary of each open-set classification approach for each of the four data configurations using the validation metric ACC. The approach AC and extension HA are recorded separately, as these methods rely on using unknown data during network training.

PT DT OM AC HA
Multi  UU  40.8  61.5  69.6  40.6  — 
Multi  IU  57.5  40.0  50.2  83.7  83.4 
Single  UU  38.0  62.4  49.6  63.2  — 
Single  IU  52.1  57.4  51.4  85.7  85.5 
Cross  UU  32.6  51.1  50.2  31.5  — 
Cross  IU  42.4  36.3  45.1  46.2  59.8 
PT DT OM AC HA
Multi  UU  40.8  61.5  69.6  40.6  — 
Multi  IU  57.5  40.0  50.2  83.7  83.4 
Single  UU  38.0  62.4  49.6  63.2  — 
Single  IU  52.1  57.4  51.4  85.7  85.5 
Cross  UU  32.6  51.1  50.2  31.5  — 
Cross  IU  42.4  36.3  45.1  46.2  59.8 
FIG. 3.

(Color online) Left detail panel: The convolutional neural network architecture. Each 256 × 128 spectrogram image is fed into the pre-trained VGG16 network, followed by two-dimensional GAP and two FC layers, before a prediction is rendered. This FCB is used for all flat classification tasks. Right schematic: Sequentially assembled FCBs are used to form a hierarchical neural network. One local FCB is used per parent node.

FIG. 3.

(Color online) Left detail panel: The convolutional neural network architecture. Each 256 × 128 spectrogram image is fed into the pre-trained VGG16 network, followed by two-dimensional GAP and two FC layers, before a prediction is rendered. This FCB is used for all flat classification tasks. Right schematic: Sequentially assembled FCBs are used to form a hierarchical neural network. One local FCB is used per parent node.

Close modal

As depicted in Fig. 3, a global average pooling (GAP) layer was added after the VGG16 layers, followed by a fully-connected (FC) layer with 128 units (the embedding layer) and then a FC layer with a number of units equal to the number of training classes. The softmax activation function was applied to this layer to allow for multi-class classification, with categorical cross-entropy used as the corresponding loss function. The optimizer Adam was selected with an initial learning rate of 1 × 10−4 and a decay of 1 × 10−7 and early stopping was used to prevent overfitting.55 Training occurred on an NVIDIA GeForce RTX 2080 Ti GPU.

From the assembled database of annotated clips, in each of the five cross-validation folds, 10% of the data were reserved validation, and a further 10% for testing. All results reported are the average after fivefold cross-validation. Classes with over 2000 samples were undersampled, and classes with fewer than half this amount were oversampled.

The most straightforward approach to taxonomic classification is a flat classifier. With this approach, there is only one level of label and implementation only requires a single classification architecture, in this case, a neural network (see left panel of Fig. 3). Despite the advantages afforded by this simplicity, with this method, information about any parent-child relationships between the classes is lost. In this example, there is no distinction between the classification of anthropogenic sounds, bird calls, and insect calls, for example, despite large differences in the origin and structure of these acoustic signals.

In this work, to include the hierarchies inherent in the dataset, one local classifier is constructed for each parent node.56 This local classifier is the simple VGG16-based architecture captured in the left panel of Fig. 3 repeatedly invoked for increasing divisions of the original dataset.

In this case, there is one coarse-grained binary classifier (distinguishing low versus high frequency sounds), two medium-grained binary classifiers (distinguishing vertebrates either from non-vertebrates or from non-living sound sources), and four multi-class fine-grained classifiers (species-specific classification). Each of these fine-grained classifiers is established in a way that allows open-set classification, regardless of whether any unknown data will actually be fed into that classifier. Classification begins at the top of the taxonomic structure and with each descending level, only the local classifier(s) that are directly below the given predicted class are considered.

With the front-end neural network architecture designed, the final classification stage must be altered to accommodate an open dataset. The simplest way to handle this added ambiguity is to introduce a training category meant to represent the collective of these unknowns. This method requires some domain-specific knowledge on the nature and structure of the unknown data for the given classification task, not always possible with continuous monitoring applications, Under these conditions, the unknown samples must be detected solely during the testing phase, such as with a score-based prediction score-based threshold.

Specifically, during network testing, after each test sample has been assigned to a class, the score associated with that prediction is examined. If the score exceeds a given threshold (typically 0.5) then the original prediction stands. If, however, the score falls below the threshold, then the prediction label is reassigned and the sample is deemed an unknown.

Specifically, if, for a given test sample, y = [ y 1 , y 2 , , y N ] is the softmax output of a closed set classifier with N classes, the test sample will be classified in the following way:
{ argmax 1 n N ( y n ) if max 1 n N ( y n ) > ϵ , unknown otherwise ,
where ϵ ( 0 , 1 ) is the decision threshold.

This basic approach can be made more sophisticated by introducing an additional numeric value: the likelihood of the distance between the test sample and the average training sample from the class to which it was assigned during the prediction step. This distance is calculated based upon neural network embeddings, low-dimensional representations of the input data that the network has learned. In this work, the last FC layer of the neural network, with 128 units, serves as the embedding layer.

Inspired by Thakur et al.19 to carry out this process, the embeddings of the training samples are first averaged to obtain a mean embedding vector for each training class. The Euclidean distance between each of these average embeddings, denoted d n [ d 1 , d 2 , , d N ] for n N classes, is calculated with respect to every validation embedding of the same class.

With the assumption that this collection of distances can be generated by an underlying Gaussian process, then the value of the mean and variance can be calculated using a maximum likelihood estimation (MLE). Finding the μ and σ 2 values that result in the largest MLE means that the unimodal Gaussian model is, definitionally, most likely to produce dn. The likelihood of this distance with respect to each of the N training classes is then calculated using the Gaussian probability density function
p ( d n ) = 1 2 π σ n exp ( d n μ n ) 2 2 σ n 2 .
(1)

As before, if the resulting probability for each training class is below the decision threshold, then that particular test sample can be considered an unknown.

With the final approach, OpenMax,40 an alternative final activation layer is applied to the neural network, in lieu of softmax. Similar to the DT approach, the weighted sum of Euclidean distance and cosine similarity is taken between the average embedding for each training class and a given test sample. An unknown is then determined to be a test sample whose embedding layer diverges sufficiently from the average of each of the N classes. To determine this threshold, N Weibull distributions are fitted to represent the maximum possible divergence from the average embedding layer for each class. If the divergence of the test example is larger than the probability of the maximum possible divergence for each class, then the sample can be considered an unknown.

In order to compare the flat classifier results of each of the four open-set classification schemes a validation metric, ACC derived from the DCASE 2019 challenge is used.42 In particular, Task 1C uses a weighted sum of the average test accuracy of samples from either the known (ACCK) or unknown (ACCU) classes,
ACC = 0.5 · A C C K + 0.5 · A C C O .
(2)

The success of each OSC technique is highly dependent upon both the dataset configuration and the type of unknown data that is fed into the neural network, as Table III illustrates. In Table III, the ACCU and ACCK metrics are averaged together to form the ACC metric. Meanwhile, the two different sub scenarios that comprise both the single-corpus are also averaged together for a comparison between all methods. We use S1S1 to denote training and testing on S1, and S2S2 to denote the other single-corpus scenario. Similarly, we use S1S2 and S2S1 to denote both cross-corpus scenarios.

TABLE IV.

Performance summary of each open-set classification approach for each of the four data configurations using the validation metric ACC. The approach AC and extension HA are recorded separately, as these methods rely on using unknown data during network training.

PT DT OM AC HA
Multi  UU  40.8  61.5  69.6  40.6 
Multi  IU  57.5  40.0  50.2  83.7  83.4 
Single  UU  38.0  62.4  49.6  63.2 
Single  IU  52.1  57.4  51.4  85.7  85.5 
Cross  UU  32.6  51.1  50.2  31.5 
Cross  IU  42.4  36.3  45.1  46.2  59.8 
PT DT OM AC HA
Multi  UU  40.8  61.5  69.6  40.6 
Multi  IU  57.5  40.0  50.2  83.7  83.4 
Single  UU  38.0  62.4  49.6  63.2 
Single  IU  52.1  57.4  51.4  85.7  85.5 
Cross  UU  32.6  51.1  50.2  31.5 
Cross  IU  42.4  36.3  45.1  46.2  59.8 

OM performs best with a multi-corpus dataset for the case of the UUs, but with IUs, of the methods which do not rely on using the unknown data during network training, PT actually achieves the highest overall accuracy. Meanwhile, for all single-corpus scenarios, across both types of unknowns, DT performs with the highest ACC values, again, not including either the AC or HA methods. For cross-corpus training and testing, the result is again split between two different methods, with DT again performing best for UUs, but OM performing best for IUs, a reverse of the trend seen for the multi-corpus case. For scenarios in which the test unknown data is most similar to the training unknown data that is provided for methods AC and HA, particularly the IU scenario for all data configurations, AC predictably outperforms all other methods.

When it comes to cross-corpus scenarios of either unknown type, however, the AC method is far outperformed by the HA method. The hierarchical model, whose structure is outlined in Fig. 3, only performed effectively when the added-classes OSC strategy was applied, and only for one unknown type, the IU type, therefore, only these results are shown. While for the multi- and single-corpus scenarios HA performs roughly as well as the AC method, the performance is not sufficient to justify the added implementation complexity. However, for the cross-corpus scenario, the HA technique far outperforms all other methods, including, crucially, AC.

A more detailed view of the flat classifier results can be seen in Fig. 4. In the case of AC, the results with the IUs far exceed those generated for the UUs. This is likely due to the fact that the training data provided for the IU case is far more similar to the testing data. For UU, meanwhile, the testing data is deliberately not reflected in the training data, as the UUs represent a scenario in which new sound classes appear in the record after training has begun. In fact, in this case, the method is almost always outperformed, indicating that the inclusion of unknown training data is much more of a hindrance than a benefit when it does not precisely capture the testing data. This method is also quite sensitive to the similarity of the testing data to the training data because, for both unknown types, the method performs notably better for both single-corpus scenarios than any others.

FIG. 4.

(Color online) Flat classification performance for the four open-set classification techniques for both unknown data types and all five data configurations. Both the average unknown accuracy (ACC0) and the average known accuracy (ACCK) are displayed.

FIG. 4.

(Color online) Flat classification performance for the four open-set classification techniques for both unknown data types and all five data configurations. Both the average unknown accuracy (ACC0) and the average known accuracy (ACCK) are displayed.

Close modal

The PT approach, meanwhile, also performs best for the IU data type. This is due to the far higher performance seen for the ACCK metric than it is due to any other factor. This indicates that for the UU data type, more false positive unknown classifications are occurring, maintaining a similar performance in ACCU across both unknown data types but necessitating a decrease in the ACCK metric. Interestingly, the method performs marginally better on the multi-corpus dataset than any other, particularly due to a boost in the ACCU metric.

With the DT approach, however, the opposite is largely true. Specifically, the method performs best for the UU rather than the IU. This is due to an increase in the ACCU metric, as ACCK levels between the two data types are far more similar. This is presumably caused by an increase in false negatives for the IU data type. The ACCU metric suffers noticeably for the multi-corpus scenario in particular, in comparison to the corresponding UU result.

For the OM approach, the largest performance discrepancy between single-corpus scenarios S1 and S2 and cross-corpus scenarios S1S2 and S2S1 can be seen, indicating an increased sensitivity to the particulars of the given dataset. In all cases, when training occurs on the S2 dataset, performance is lower. There is little pattern as to whether the method achieves better results with one unknown type or the other. For example, OM performs best overall for the multi-corpus UU scenario, but the only other type it outperforms every other method is for the cross-corpus IU scenario.

In the case of the hierarchical classifier, only the added-classes open set approach is considered, as the uncertainty inherent to the other methods, in combination with the error that carries forward in the hierarchical approach, produced results that were not satisfactory. Additionally, each fine-grain classifier required a different threshold for the DT and PT approaches, making for a difficult ad hoc determination.

To compare approach AC to HA for the IU unknown type in more detail, the validation metrics from all four fine-grained classifiers are averaged together to create the right side of Fig. 5. Meanwhile, the left side shows the AC results again, for a more rapid visual comparison. With the two single-corpus scenarios, HA performs fairly comparably to the flat classification strategy. The slight discrepancy in performance is due to a reduced recall score, caused by a disproportionate number of false negatives for sound classes that are erroneously sorted into the unknown class. This particularly occurs for the hierarchical model because each of the four fine-grained classifiers is built to carry out OSC, even though only two of these classifiers can expect to face unknown data (see Fig. 3).

FIG. 5.

(Color online) Flat versus hierarchical classification performance for the added-classes OSC approach for all five data configurations and both unknown data types. The metrics ACCU and ACCK are shown.

FIG. 5.

(Color online) Flat versus hierarchical classification performance for the added-classes OSC approach for all five data configurations and both unknown data types. The metrics ACCU and ACCK are shown.

Close modal

With the cross-corpus training scenarios, however, the potential advantages of a hierarchical neural network architecture become more clear. For both S1S2 and S2S1, the hierarchical network significantly outperforms the flat network, with an average gain of 7% for ACCO and 20% for ACCK. This particular increase in ACCK is likely due to a decrease in the number of false positive unknown classifications, indicating that examples from sound classes that were misconstrued as unknowns with the flat classifier were correctly sorted by the coarse or medium-grained hierarchical classifiers into fine-grained classification situations for which no unknown training data were present during training.

In this study, a variety of OSC frameworks are compared for a diverse set of training and testing data configurations. Two of these approaches, AC and the HA extension, rely on the construction of a general unknown training category. Since constructing this category may be undesirable for an under-labeled or temporally ongoing environmental acoustics dataset, two threshold-based approaches, PT and DT, are explored.

Finally, the OM approach involves modifying the final softmax layer of the neural network to incorporate distance-based metrics, where unknowns are identified because they are a greater distance away from samples that typify each training category in the embedding space. In this way, approach DT combines unique elements of approaches PT and OM because it offers a distance-based threshold metric.

Notably, with the multi-corpus use-case with UUs, all other OSC methods outperform AC. This indicates that relying on imperfect domain knowledge of the unknown class is worse than relying on no domain knowledge at all. In such a case, the OM approach performs far better.

For the IU type, however, the simplest thresholding method, PT, outperforms all other closed set training approaches by a notable margin. If implementation time is a concern, particularly for shorter-term monitoring projects for which there is not likely to be a change in the nature of the unknown data type over time, PT is an appropriate choice. Indeed, given its high performance for a relatively large and diverse acoustic dataset and ease of interpretability, the authors agree with others42 that this thresholding technique is a useful tool, if only as a starting point for an OSC feasibility study.

The single-corpus use-case offers an environmentally authentic scenario in which the training and testing data share a domain, but instances of unknowns in the training data are structurally very different than those found in the testing data. This data configuration most clearly demonstrates the utility of the DT approach. Therefore, this strategy is recommended as a first approach for long-term acoustic monitoring classification tasks for cases when the training and testing domains are similar and AC methods cannot or should not be used.

Notably, for cross-corpus scenarios with the UU type, both distance-based metrics, DT and OM, achieve a similar overall classification performance. OM performs better, meanwhile, for the IU unknown case, making this approach the most appropriate, overall, for use-cases that involve highly dissimilar training and testing domains. Indeed, the consistency of performance for the OM technique across all data configurations and unknown data types makes this approach desirable for a user might who requires consistent and estimable classification results even if neural network training strategies change or become more ambiguous. DT, meanwhile, does not achieve similar consistency across both unknown data types.

While the benefit of implementing a local classifier per parent node hierarchical classifier is not clear when the training and testing domains are similar, particularly given the inherent implementation complexities, this does not hold true as the distance between the train and test domain increases. In fact, in cases where the AC approach is viable, a hierarchical network achieves noteworthy performance improvements without any additional costly data labeling efforts for the highly realistic cross-corpus scenario. While approach DT and OM are other viable options for such cross-corpus training scenarios, when the general nature of the unknowns can be ascertained to be sufficiently uniform, hierarchical architectures can prove far more successful. However, a hierarchical classification scheme did not combine effectively with any of the other open-set classification approaches, limiting the potential utility of this highly inflexible OSC method.

This material is based upon work supported by the National Science Foundation under Grant Nos. 1631674 and 1909229, a Rensselaer Polytechnic Institute Humanities, Arts, and Social Sciences Fellowship, and their Cognitive and Immersive Systems Laboratory. Thank you to Vincent Moriarty and Mark Lucius for instrumentation assistance and Rick Relyea for consulting on this project.

1.
A. S.
Goudie
, “
Human influence on animals
,” in
Human Impact on the Natural Environment: Past, Present and Future
,
8th ed.
(
Wiley Blackwell
,
Oxford, UK
,
2018
), pp.
70
102
.
2.
G.
Shannon
,
M. F.
McKenna
,
L. M.
Angeloni
,
E.
Brown
,
K. A.
Warner
,
M. D.
Nelson
,
C.
White
,
J.
Briggs
,
S.
McFarland
,
K. R.
Crooks
,
K. M.
Fristrup
, and
G.
Wittemyer
, “
A synthesis of two decades of research documenting the effects of noise on wildlife
,”
Biol. Rev.
91
(
4
),
982
1005
(
2016
).
3.
P.
Duelli
and
M. K.
Obrist
, “
Biodiversity indicators: The choice of values and measures
,”
Agricult. Ecosyst. Environ.
98
(
1
),
87
98
(
2003
).
4.
I.
Potamitis
,
S.
Ntalampiras
,
O.
Jahn
, and
K.
Riede
, “
Automatic bird sound detection in long real-field recordings: Applications and tools
,”
Appl. Acoust.
80
,
1
9
(
2014
).
5.
N.
Priyadarshani
,
S.
Marsland
, and
I.
Castro
, “
Automated birdsong recognition in complex acoustic environments: A review
,”
J. Avian Biol.
49
(
5
),
jav-01447
(
2018
).
6.
K. A.
Swiston
and
D. J.
Mennill
, “
Comparison of manual and automated methods for identifying target sounds in audio recordings of Pileated, Pale-billed, and putative Ivory-billed woodpeckers
,”
J. Field Ornithol.
80
(
1
),
42
50
(
2009
).
7.
A. L.
McIlraith
and
H. C.
Card
, “
Bird song identification using artificial neural networks and statistical analysis
,” in
Proceedings of the Canadian Conference on Electrical and Computer Engineering
, St. Johns, Newfoundland, Canada (May 25–28,
1997
), pp.
63
66
.
8.
S. O.
Murray
,
E.
Mercado
, and
H. L.
Roitblat
, “
The neural network classification of false killer whale (Pseudorca crassidens) vocalizations
,”
J. Acoust. Soc. Am.
104
(
6
),
3626
3633
(
1998
).
9.
S.
Parsons
and
G.
Jones
, “
Acoustic identification of twelve species of echolocating bat by discriminant function analysis and artificial neural networks
,”
J. Exp. Biol.
203
(
17
),
2641
2656
(
2000
).
10.
G. S.
Campbell
,
R. C.
Gisiner
,
D. A.
Helweg
, and
L. L.
Milette
, “
Acoustic identification of female Steller sea lions (Eumetopias jubatus)
,”
J. Acoust. Soc. Am.
111
(
6
),
2920
2928
(
2002
).
11.
M.
Cowling
and
R.
Sitte
, “
Comparison of techniques for environmental sound recognition
,”
Pattern Recogn. Lett.
24
(
15
),
2895
2907
(
2003
).
12.
C. M.
Nickerson
,
L. L.
Bloomfield
,
M. R. W.
Dawson
, and
C. B.
Sturdy
, “
Artificial neural network discrimination of black-capped chickadee (Poecile atricapillus) call notes
,”
J. Acoust. Soc. Am.
120
(
2
),
1111
1117
(
2006
).
13.
P.
Khunarsal
,
C.
Lursinsap
, and
T.
Raicharoen
, “
Very short time environmental sound classification based on spectrogram pattern matching
,”
Inf. Sci.
243
,
57
74
(
2013
).
14.
S.
Kahl
,
F.-R.
Stöter
,
H.
Goëau
,
H.
Glotin
,
B.
Planqué
,
W.
Vellinga
, and
A.
Joly
, “
Overview of BirdCLEF 2019: Large-Scale Bird Recognition in Soundscapes
,” in
Proceedings of CLEF 2019
, Lugano, Switzerland (September 9–12,
2019
).
15.
S.
Kahl
,
M.
Clapp
,
W.
Hopping
,
H.
Goëau
,
H.
Glotin
,
R.
Planqué
,
W.-P.
Vellinga
, and
A.
Joly
, “
Overview of BirdCLEF 2020: Bird Sound Recognition in Complex Acoustic Environments
,” in
Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum (CEUR-WS)
, Vol. 2696, p. 262.
16.
L.
Ptacek
,
L.
Machlica
,
P.
Linhart
,
P.
Jaska
, and
L.
Muller
, “
Automatic recognition of bird individuals on an open set using as-is recordings
,”
Bioacoustics
25
(
1
),
55
73
(
2016
).
17.
J.
Salamon
,
J. P.
Bello
,
A.
Farnsworth
,
M.
Robbins
,
S.
Keen
,
H.
Klinck
, and
S.
Kelling
, “
Towards the automatic classification of avian flight calls for bioacoustic monitoring
,”
PLoS One
11
(
11
),
e0166866
(
2016
).
18.
J.
Salamon
,
J. P.
Bello
,
A.
Farnsworth
, and
S.
Kelling
, “
Fusing shallow and deep learning for bioacoustic bird species classification
,” in
Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing
, New Orleans, LA (March 5–9,
2017
), pp.
141
145
.
19.
A.
Thakur
,
D.
Thapar
,
P.
Rajan
, and
A.
Nigam
, “
Deep metric learning for bioacoustic classification: Overcoming training data scarcity using dynamic triplet loss
,”
J. Acoust. Soc. Am.
146
(
1
),
534
547
(
2019
).
20.
D.
Stowell
,
M.
Wood
,
H.
Pamuła
,
Y.
Stylianou
, and
H.
Glotin
, “
Automatic acoustic detection of birds through deep learning: The first Bird Audio Detection challenge
,”
Methods Ecol. Evol.
10
(
2
),
368
(
2019
).
21.
H.
Zhu
,
C.
Ren
,
J.
Wang
,
L.
Yang
,
S.
Li
, and
L.
Wang
, “
DCASE 2019 challenge task1 technical report
,” in
Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019)
, Beijing, China (March 4–June 30,
2019
).
22.
B.
Lehner
,
K.
Koutini
,
C.
Schwarzlmuller
,
T.
Gallien
, and
G.
Widmer
, “
Acoustic scene classification with reject option based on resnets
,” Technical report,
Silicon Austria Labs, Institute of Computational Perception, Johannes Kepler University Linz
,
Liz, Austria
(
2019
).
23.
A.
Rakowski
and
M.
Kosmider
, “
Frequency-aware CNN for open set acoustic scene classification
,” Technical report,
Samsung R&D Poland
,
Krakow, Poland
(
2019
).
24.
D.
Battaglino
,
L.
Lepauloux
, and
N.
Evans
, “
The open-set problem in acoustic scene classification
,” in
Proceedings of the 2016 IEEE International Workshop on Acoustic Signal Enhancement (IWAENC)
, Xi'an, China (September 13–16,
2016
), pp.
1
5
.
25.
K.
Wilkinghoff
and
F.
Kurth
, “
Open-Set Acoustic Scene Classification with Deep Convolutional Autoencoders
,” in
Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019)
, Beijing, China (March 4–June 30,
2019
), pp.
258
262
.
26.
S.
Ntalampiras
and
I.
Potamitis
, “
Acoustic detection of unknown bird species and individuals
,”
CAAI Trans. Intell. Technol.
6
(
3
),
291
300
(
2021
).
27.
M.
Morgan
and
J.
Braasch
, “
Long-term deep learning-facilitated environmental acoustic monitoring in the capital region of new york state
,”
Ecol. Inf.
61
,
101242
(
2021
).
28.
R.
Roady
,
T. L.
Hayes
,
R.
Kemker
,
A.
Gonzales
, and
C.
Kanan
, “
Are open set classification methods effective on large-scale datasets?
,”
PLoS One
15
(
9
),
e0238302
(
2020
).
29.
Z.
Kwiatkowska
,
B.
Kalinowski
,
M.
Kośmider
, and
K.
Rykaczewski
, “
Deep Learning Based Open Set Acoustic Scene Classification
,” in
Proceeding of the Interspeech 2020
, Shanghai, China (October 25–29,
2020
), pp.
1216
1220
.
30.
J.
Cramer
,
V.
Lostanlen
,
A.
Farnsworth
,
J.
Salamon
, and
J. P.
Bello
, “
Chirping up the Right Tree: Incorporating Biological Taxonomies into Deep Bioacoustic Classifiers
,” in
Proceedings of the ICASSP 2020 IEEE International Conference on Acoustics, Speech and Signal Processing
, Barcelona, Spain (May 4–9,
2020
), pp.
901
905
.
31.
J. G.
Colonna
,
J.
Gama
, and
E. F.
Nakamura
, “
A comparison of hierarchical multi-output recognition approaches for anuran classification
,”
Mach. Learn.
107
(
11
),
1651
1671
(
2018
).
32.
F.
Saki
and
N.
Kehtarnavaz
, “
Real-time hierarchical classification of sound signals for hearing improvement devices
,”
Appl. Acoust.
132
,
26
32
(
2018
).
33.
D.
Hendrycks
and
K.
Gimpel
, “
A baseline for detecting misclassified and out-of-distribution examples in neural networks
,” arXiv:1610.02136 (
2018
).
34.
F.
Angiulli
and
C.
Pizzuti
, “
Fast unknown detection in high dimensional spaces
,” in
Principles of Data Mining and Knowledge Discovery
, edited by
G.
Goos
,
J.
Hartmanis
,
J.
van Leeuwen
,
J. G.
Carbonell
,
J.
Siekmann
,
T.
Elomaa
,
H.
Mannila
, and
H.
Toivonen
(
Springer
,
Berlin-Heidelberg
,
2002
), Vol.
2431
, pp.
15
27
.
35.
S. D.
Bay
and
M.
Schwabacher
, “
Mining distance-based unknowns in near linear time with randomization and a simple pruning rule
,” in
Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data mining - KDD '03
, Washington, DC (August 24–27,
2003
), p.
29
.
36.
W.
Qin
and
J.
Qu
, “
VOD: A novel unknown detection algorithm based on Voronoi Diagram
,” in
Proceedings of the 2010 WASE International Conference on Information Engineering
, Beidaihe, Hebei (August 14–15,
2010
), pp.
40
42
.
37.
W. J.
Scheirer
,
A.
Rocha
,
R. J.
Micheals
, and
T. E.
Boult
, “
Meta-recognition: The theory and practice of recognition score analysis
,”
IEEE Trans. Pattern Anal. Mach. Intell.
33
(
8
),
1689
1695
(
2011
).
38.
W. J.
Scheirer
,
L. P.
Jain
, and
T. E.
Boult
, “
Probability models for open set recognition
,”
IEEE Trans. Pattern Anal. Mach. Intell.
36
(
11
),
2317
2324
(
2014
).
39.
A.
Bendale
and
T.
Boult
, “
Towards Open World Recognition
,” in
Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Boston, MA (June 7–12,
2015
), pp.
1893
1902
.
40.
A.
Bendale
and
T. E.
Boult
, “
Towards open set deep networks
,” in
Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Las Vegas, NV (June 27–30,
2016
), pp.
1563
1572
.
41.
Xeno-canto: Sharing bird sounds from around the world
,” https://www.xeno-canto.org/ (Last viewed March 3, 2022).
42.
M.
Mandel
,
J.
Salamon
, and
D. P. W.
Ellis
, in
Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019)
(
2019
).
43.
B.
McFee
,
V.
Lostanlen
,
A.
Metsai
,
M.
McVicar
,
S.
Balke
,
C.
Thomé
,
C.
Raffel
,
F.
Zalkow
,
A.
Malek
,
Dana
,
K.
Lee
,
O.
Nieto
,
J.
Mason
,
D.
Ellis
,
E.
Battenberg
,
S.
Seyfarth
,
R.
Yamamoto
,
K.
Choi
,
J.
Moore
,
R.
Bittner
,
S.
Hidaka
,
Z.
Wei
,
nullmightybofo
,
D.
Hereñú
,
F.-R.
Stöter
,
P.
Friesch
,
A.
Weiss
,
M.
Vollrath
, and
T.
Kim
, “
librosa/librosa: 0.8.0
,” https://zenodo.org/record/3955228 (Last viewed October 22, 2021).
44.
J.
LeBien
,
M.
Zhong
,
M.
Campos-Cerqueira
,
J. P.
Velev
,
R.
Dodhia
,
J. L.
Ferres
, and
T. M.
Aide
, “
A pipeline for identification of bird and frog species in tropical soundscape recordings using a convolutional neural network
,”
Ecol. Inf.
59
,
101113
(
2020
).
45.
A.
Sevilla
and
H.
Glotin
, “
Audio bird classification with inception-v4 extended with time and time-frequency attention mechanisms
,” in
Proceedings of CLEF 2017
, Dublin, Ireland (September 11–14,
2017
).
46.
A.
Incze
,
H.-B.
Jancso
,
Z.
Szilagyi
,
A.
Farkas
, and
C.
Sulyok
, “
Bird sound recognition using a convolutional neural network
,” in
Proceedings of the 2018 IEEE 16th International Symposium on Intelligent Systems and Informatics (SISY)
, Subotica, Serbia (September 13–15,
2018
), pp.
000295
000300
.
47.
J.
Florentin
,
T.
Dutoit
, and
O.
Verlinden
, “
Detection and identification of European woodpeckers with deep convolutional neural networks
,”
Ecol. Inf.
55
,
101023
(
2020
).
48.
C.
Szegedy
,
W.
Liu
,
Y.
Jia
,
P.
Sermanet
,
S.
Reed
,
D.
Anguelov
,
D.
Erhan
,
V.
Vanhoucke
, and
A.
Rabinovich
, “
Going deeper with convolutions
,” arXiv:1409.4842 (
2014
).
49.
A. G.
Howard
,
M.
Zhu
,
B.
Chen
,
D.
Kalenichenko
,
W.
Wang
,
T.
Weyand
,
M.
Andreetto
, and
H.
Adam
, “
MobileNets: Efficient convolutional neural networks for mobile vision applications
,” arXiv:1704.04861 (
2017
).
50.
K.
He
,
X.
Zhang
,
S.
Ren
, and
J.
Sun
, “
Deep Residual Learning for Image Recognition
,” in
Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Las Vegas, NV (June 27–30,
2016
), pp.
770
778
.
51.
K.
Simonyan
and
A.
Zisserman
, “
Very deep convolutional networks for large-scale image recognition
,” arXiv:1409.1556 (
2015
).
52.
J.
Deng
,
W.
Dong
,
R.
Socher
,
L.-J.
Li
,
K.
Li
, and
L.
Fei-Fei
, “
ImageNet: A large-scale hierarchical image database
,” in
Proceedings of CVPR09
, Miami Beach, FL (June 20–25,
2009
).
53.
F.
Chollet
, “
Keras
,” https://github.com/fchollet/keras (Last viewed January 12, 2022).
54.
TensorFlow
,” https://zenodo.org/record/4724125 (Last viewed November 9, 2021).
55.
D. P.
Kingma
and
J.
Ba
, “
Adam: A method for stochastic optimization
,” arXiv:1412.6980 (
2017
).
56.
C. N.
Silla
and
A. A.
Freitas
, “
A survey of hierarchical classification across different application domains
,”
Data Min. Knowl. Discov.
22
(
1–2
),
31
72
(
2011
).