Many odontocetes produce whistles that feature characteristic contour shapes in spectrogram representations of their calls. Automatically extracting the time × frequency tracks of whistle contours has numerous subsequent applications, including species classification, identification, and density estimation. Deep-learning-based methods, which train models using analyst-annotated whistles, offer a promising way to reliably extract whistle contours. However, the application of such methods can be limited by the significant amount of time and labor required for analyst annotation. To overcome this challenge, a technique that learns from automatically generated pseudo-labels has been developed. These annotations are less accurate than those generated by human analysts but more cost-effective to generate. It is shown that standard training methods do not learn effective models from these pseudo-labels. An improved loss function designed to compensate for pseudo-label error that significantly increases whistle extraction performance is introduced. The experiments show that the developed technique performs well when trained with pseudo-labels generated by two different algorithms. Models trained with the generated pseudo-labels can extract whistles with an *F*1-score (the harmonic mean of precision and recall) of 86.31% and 87.2% for the two sets of pseudo-labels that are considered. This performance is competitive with a model trained with 12 539 expert-annotated whistles (*F*1-score of 87.47%).

## I. INTRODUCTION

There are currently 72 recognized species of odontocetes (compared to only 14 baleen whale species) of which approximately two-thirds are known to produce whistles (Wursig and Perrin, 2009). Odontocete whistles are highly complex and variable communication signals and contain not only information about the species that produced the vocalization (Gillespie , 2013; Jiang , 2019) but also behavioral states (Taruski, 1979; Sjare and Smith, 1986), and, in some cases, individual identity (Caldwell and Caldwell, 1968; van Parijs and Corkeron, 2001; Janik , 2013; Sayigh , 2013; Kaplan , 2014). Consequently, marine biologists frequently deploy hydrophones to study these marine mammals. Mid to high frequency signals, such as whistles and echolocation clicks (Oswald , 2003), are frequently recorded at high sample rates (e.g., ≥192 kHz), resulting in extensive sound archives to be analyzed. Automated extraction (and subsequent species classification) of whistles from these data remains a significant challenge in the field of animal bioacoustics, and new methods are needed to make the extraction process more efficient and reliable.

Most odontocete whistles feature characteristic contour shapes in the time-frequency (t-f) domain. Whistle extraction aims to determine the t-f bins of whistles in spectrograms, which then facilitates the subsequent tasks, e.g., classification of these acoustic signals to the species level. Although biologists can manually extract whistles as t-f contours in spectrograms, this task is highly labor intensive. To speed up the acoustic analysis process, various automated whistle extraction algorithms have been developed over the years (e.g., Mallawaarachchi , 2008; White and Hadley, 2008; Mellinger , 2011; Roch , 2011; Gillespie , 2013; Gruden and White, 2020; Li , 2020; Wang , 2021; Conant , 2022).

Whistle extraction methods (e.g., Roch , 2011) typically contain two steps. Most algorithms start by using peak detection algorithms to find regions of high energy that may belong to a whistle. In most cases, these are then examined to see if they are near other peaks and, therefore, likely to be parts of a whistle. This may be performed using deterministic (e.g., Mellinger , 2011) or probabilistic (e.g., Gruden and White, 2020) trajectory models. These sets of peaks may be subjected to additional processing but are eventually reported as a whistle contour.

More recently, deep-learning-based methods have been applied to whistle extraction. Li (2020) trained convolutional neural networks (CNNs) to find candidate t-f bins of whistles in spectrograms. The CNN model outputs a confidence map for the spectrogram, where each t-f bin has a confidence score of whether this node contains part of a whistle signal. t-f nodes with confidence scores above a predefined threshold are connected into whistle contours using a graph search method (Roch , 2011). Compared to using spectral peaks to extract whistles, the deep-learning-based method improved the *F*1-score by around 20% on a two-species (*Delphinus capensis* and *Tursiops truncatus*) benchmark dataset. However, training the model used thousands of manually annotated whistles, and analyst annotations were produced over several months.

To facilitate the application of deep-learning-based methods in situations where large, annotated datasets are unavailable, we explore ways to train the model with pseudo-labels. Pseudo-labels are approximative labels of whistles consisting of sequences of time × frequency coordinates that trace the whistle in a spectrogram representation. In this work, we generated these by using previously published whistle extractors that require no or very little analyst training data. We do not expect the pseudo-labels to be as accurate as those produced by human analysts. We expect that some whistles will not be labeled, spurious labels may be generated, and errors in time and frequency labels may occur. Our setting does not require human analysts to annotate whistles nor to validate the generated pseudo-labels. Training deep neural networks (DNNs) with pseudo-labels is challenging due to the increased errors in the pseudo-labels as compared to analyst generated ones. Neural networks learn by adjusting network parameters to minimize a loss function that measures the difference between predictions and expected labels. Consequently, when a DNN is trained to fit these less accurate pseudo-labels well, the model will typically result in unsatisfactory performance.

Errors in the pseudo-labels can be thought of as a form of noise in the label set. The machine learning community has proposed three categories of modified loss functions to improve model robustness to label noise (Song , 2022), including using distance metrics robust to label noise, reweighting samples according to their label quality, and loss correction with noise estimation.

First, researchers developed novel distance metrics in loss functions. Ghosh (2017) showed that symmetric loss functions, e.g., mean absolute error (MAE), led to a smaller performance drop compared to nonsymmetric loss functions, e.g., categorical cross-entropy (CCE), when there are noisy labels. Zhang and Sabuncu (2018) found that MAE may perform poorly with DNNs and proposed to use negative Box-Cox transformation as a noise robust loss function, which surpassed MAE and CCE on varying label noise scenarios. Wang (2019b) added reversed cross-entropy to the original cross-entropy loss, which formed a symmetric cross-entropy loss and reduced model overfitting to noisy labels. Ma (2020) normalized loss functions by dividing the sum of loss among all possible labels, but the resultant model tended to underfit. To address this problem, they further proposed active passive loss that combined normalized loss of two types: one only optimized on the label class (active loss) and one optimized on all classes (passive loss). Kim (2019) and Kim (2021) proposed negative learning (NL) loss to deal with noisy label on an image classification task. Compared to classical loss functions, which encourage neural network models to maximize the label class, NL minimizes the scores of the complementary non-label classes. Assuming that label noise is not too great, the NL loss tends to reduce the impact of a mislabeled example.

Second, samples may be weighted differently in the loss function. Natarajan (2013) assumed the existence of class-dependent label noise on a binary classification dataset, and they modified the original loss to a weighted surrogate loss according to manually assigned noise rates and sample labels. Wang (2019a) calculated the gradients of the training loss with regard to the logit vector and improved MAE by giving samples different weights according to the magnitude of gradients. Su (2021) applied the annotator robust loss to edge detection data that are annotated by multiple annotators. If annotators cannot agree on the label of one pixel, this pixel was removed from the loss calculation. As the number of edge pixels is usually much smaller than that of background pixels, the annotator robust loss balanced their contribution by using different weight factors for the two classes.

Finally, the noise distribution may be estimated to correct the loss function. Goldberger and Ben-Reuven (2017) viewed the correct label as a latent random variable and modeled the noise by an additional softmax layer, which predicted the probability of correct hidden labels. Patrini (2017) combined noise rate estimation algorithms and DNNs, where the estimated transition matrix corrected the loss function to make it equal to the original loss computed on clean labels. Tanno (2019) modeled the annotation errors of each annotator with a confusion matrix, and they added a regularization term that jointly optimized the confusion matrix and model predictions. Xia (2019) trained the classifier with noisy labels and initialized the label transition matrix based on their classifier's predictions and then retrained the model with a learnable variable that automatically revised the transition matrix.

All of the above methods work well on their target image classification tasks but translating these methods to dual-class (noise and whistle energy) t-f node predictions within audio requires the algorithms to be adapted to a new two-class domain that leverages the local context of surrounding t-f nodes to make correct predictions. In some cases, e.g., careful manual estimation of noise distributions, the methods are not well-aligned with the goal of this study, which is to perform predictions without the need for extensive analysis of the pseudo-labels. Direct application of these methods to a whistle extraction task did not seem guaranteed to be a fruitful direction and we, therefore, designed a method inspired by these techniques. To the best of our knowledge, there is no previous method that uses pseudo-label training for the whistle extraction task. Specifically, we propose a method to re-weight different components in the loss function, which will reduce the effect of incorrect labels. We observe that our pseudo-labels frequently miss whistle signals. When models are directly trained with a loss function that assigns the same weight to each t-f bin, such models may prefer to predict whistles as background noise due to the presence of mislabeled whistle energy within the set of pseudo-labels. Consequently, we observe a tendency for the network to predict whistles with high precision but low recall.

To encourage the network to predict more whistles, we divide the t-f bins into two categories: foreground bins, where a pseudo-label indicates that whistle energy occurs, and background bins otherwise. We add a regularization term to re-weight foreground and background t-f bins in the loss function. The modified loss encourages the model to make correct predictions under expected label noise, e.g., when the pseudo-label missed part of a whistle. However, pseudo-labels may contain multiple types of errors, and although the modified loss function may suppress one type of error, it may encourage the model to have another type of error. For example, improving the weight of foreground t-f bins may help reduce false negative predictions but it also increases the chance of false positive predictions. Inspired by the work of focal loss (Lin , 2017), we add a multiplication factor in the regularization term so that it dynamically adjusts the weight according to the prediction and pseudo-label to reduce the undesired errors.

To examine our method and eliminate the need for manual annotations, we applied the proposed method to train a CNN-based whistle extractor on two different sets of pseudo-labels, generated by two different whistle extraction algorithms selected from the existing literature.

## II. METHODS

### A. Dataset

We used the acoustic data from the Detection, Classification, Localization, and Density Estimation (DCLDE) workshop (DCLDE Organizing Committee, 2011) for model training and evaluation. This dataset consists of approximately 32 h of recordings collected for five species of odontocetes: bottlenose dolphins (*T. truncatus*), long- and short-beaked common dolphins (*D. capensis*, *Delphinus delphis*), melon-headed whales (*Peponocephala electra*), and spinner dolphins (*Stenella longirostris*). Two types of hydrophones were deployed, ITC 1042 (International Trandsucer Corp., Santa Barabara, CA) and HS 150 (Sonar Research and Development Ltd., Beverly, UK) hydrophones, for collecting the data. The hydrophones were towed by the R/V David Starr Jordan, mounted to the stationary platform R/P FLIP (Fisher and Spiess, 1963), and deployed from small boats. The deployment depths of the hydrophones were 10–30 m. The acoustic signals were sampled at 192 kHz with 16- or 24 bit quantization.

#### 1. Data preparation

As in our previous work (Li , 2020), we transformed the acoustic data into log-magnitude spectrograms before using them to generate pseudo-labels or as input to a trained whistle extraction model. Discrete Fourier transforms (DFTs) were performed on 8 ms Hamming-windowed frames (125 Hz bandwidth) every 2 ms. These parameters were empirically set in Roch (2011) as a trade-off between frequency and time resolution. Longer analysis windows with better frequency resolution tend to blur rapidly changing whistle signals. Examples of whistle signals from the five species are shown in Fig. 9 in the Appendix. We empirically restricted the log10-magnitude spectrogram to the range [0,6], clamping values to a range of 0–6. This corresponds to an uncalibrated intensity range of 0–120 dB, which was then normalized to the range [0,1]. We limited the spectrogram to the frequency range of 5–50 kHz (361 frequency bins), which covered most delphinid whistles and their harmonics. The spectrograms were divided into 3-s long nonoverlapping segments for model training and evaluation. The spectrogram segments from training datasets were further divided into patches of size 64 (128 ms) × 64 (8 kHz) before model training.

#### 2. Training dataset

We used two nonoverlapping subsets of the DCLDE data for model training. First, we used a labeled subset for our supervised training experiments. The DCLDE dataset provides detailed t-f annotations of whistles for 45 recordings with a total duration of approximately 3 h. Annotations were previously produced for these data using an interactive software tool that let expert analysts trace whistles by placing control knots of cubic splines, the details of which can be found in Roch (2011). Each recording contains a single species based on visual identification. Among these annotated recordings, we chose 30 recordings that were not used for evaluation in Roch (2011) as our “labeled dataset.” These audio files recorded 127 min of odontocete calls and included 12 539 annotated whistles. Pseudo-label experiments used a larger unlabeled subset of the DCLDE data. These data consist of 348 recordings, and the total duration is around 29 h. This set of data is referred to as our “unlabeled dataset.”

#### 3. Evaluation dataset

We used a subset of annotated acoustic data from the DCLDE workshop 2011 for evaluation. This subset consists of 12 audio files for bottlenose dolphins, long-beaked common dolphins, melon-headed whales, and spinner dolphins. The total duration of those recordings was around 43 min, and the t-f coordinates of 6011 whistles were annotated by analysts. All of these files were used for evaluation in the work of Roch (2011). We did not use the recordings of short-beaked common dolphins because of some annotation errors. The details of the number of annotated whistles per species that we expected to retrieve are summarized in Table I. Criteria for which whistles were expected to be retrieved is detailed in the metrics section (Sec. II E), and the specific files are summarized in Table III in the Appendix.

Species . | Number of whistles . | Duration (s) . |
---|---|---|

Bottlenose dolphin | 354 | 652.2 |

Long-beaked common dolphin | 557 | 833.9 |

Melon-headed whale | 338 | 680.7 |

Spinner dolphin | 686 | 425.9 |

Species . | Number of whistles . | Duration (s) . |
---|---|---|

Bottlenose dolphin | 354 | 652.2 |

Long-beaked common dolphin | 557 | 833.9 |

Melon-headed whale | 338 | 680.7 |

Spinner dolphin | 686 | 425.9 |

### B. Pseudo-label generation

We use the spectral peak detection and graph search algorithm implemented in *Silbido*^{1} (Roch , 2011) to extract whistles from the unlabeled dataset. The spectral peak detection algorithm smooths the spectrograms with a median filter over each 3 × 3 t-f grid. Then it subtracts the mean value over a 3-s window in each frequency bin. If one t-f bin has a signal-to-noise ratio (SNR) larger than 10 dB and no other bins within ±250 Hz have a larger magnitude than this t-f bin, it is a spectral peak. Next, the graph search algorithm manages the candidate detections with sets of graphs. Each graph depicts one or more candidate whistle contours where a sequence of spectral peaks is connected. Each spectral peak either starts a new graph or is added to existing graphs. Peaks are added to existing graphs if they are a good fit to adaptive polynomial predictions of graph trajectories and, otherwise, used to seed new graphs. Polynomial order is driven by goodness of fit as measured by the adjusted *R*2 coefficient (Dillon and Goldstein, 1984), and spectral peaks are merged into an existing graph when they are within 50 ms of the last end point in the graph and 1000 Hz of the fitted polynomial curve. Graph state is maintained across 3-s blocks, permitting graphs to represent spectral peaks from whistles that cross processing blocks. Once a graph is no longer eligible to incorporate additional spectral peaks, whistles are extracted from the graph. When interior nodes have more than a pair of edges, the rate of change on both sides are examined to determine if multiple whistles crossed the interior node. We remove detected whistles that are shorter than 150 ms as per Roch (2011).

To further examine our method on pseudo-labels with different characteristics, we used the sequential Monte Carlo probability hypothesis density (SMC-PHD) whistle extractor^{2} by Gruden and White (2020) to generate the second set of pseudo-labels. Briefly, the algorithm uses computationally tractable approximation of the multi-target Bayes filter to track whistle contours based on spectral peaks from preprocessed spectrograms. Preprocessing of spectrograms is based on established methods (Gillespie , 2013; Gruden and White, 2016) to reduce noise and interfering signals. If t-f bins have magnitudes larger than 8 dB on the normalized spectrogram and are within the frequency range of 2–50 kHz, they are considered to be spectral peaks. These peaks are used as measurements for the SMC-PHD algorithm to track whistles. The SMC-PHD filter is a recursive filter that propagates the first-order moment of the multi-target posterior [called probability hypothesis density (PHD)] in time through prediction and update steps. The PHD function at each time step is approximated by a cloud of weighted particles. Particle locations and weights are predicted and updated according to the sequential Monte Carlo principles and PHD equations, respectively. The SMC-PHD implementation of Gruden and White (2020) used in this work employs a trained radial basis function (RBF) network to estimate the particle locations in the prediction step. The training data consist of 3 min of recording and 185 annotated whistles, and these data are not included in our training or evaluation dataset. Consequently, this second algorithm requires some analyst-annotated training data, but it is trivial when compared to the requirements for deep learning algorithms. New whistles are introduced to the filter through a birth model that incorporates measurements and priors based on training data. Additionally, the filter incorporates false alarms and missed detections in the problem formulation. At each time step, whistle states (representing whistle contour peaks) are estimated and their identities tracked based on labeled particles as outlined in Gruden and White (2020).

Irrespective of the whistle extraction algorithm, we generate bin-wise pseudo-labels for each 3-s spectrogram segment. The pseudo-label is initialized as a zero matrix of the same size as the spectrogram segment. We draw the whistle contour on the matrix with the cv2.polylines() method in the Python OpenCV library (Bradski, 2000). The thickness of the polyline is empirically set to two. The pseudo-labels have element values normalized to values between zero (background) and one (whistle), respectively. Similar to the training spectrograms described in Sec. II A 1, the pseudo-labels are divided into 64 × 64 patches that match the spectrogram patches for model training. If the pseudo-label marks at least one t-f bin in the patch as containing whistle energies, we consider this patch to be a “positive patch.” Otherwise, the patch is considered to be a “negative patch.” As there are many more negative patches than positive patches, we balance the training dataset by randomly selecting the same number of negative patches as positive patches for model training.

### C. CNN-based whistle extraction

We use the deep whistle contour (DWC) detector^{3} implemented by Li (2020) as our model for whistle extraction (Fig. 1). First, a CNN model, the whistle extraction network, takes a spectrogram as input and predicts a confidence map of the same size as the input spectrogram. The confidence within each t-f bin indicates the probability that this bin contains whistles. The confidence map is used to predict peaks for a modified version of the graph search method that was summarized in Sec. II B. Confidence map t-f bins are labeled as peaks when the probability of attribution to whistle energy is larger than 0.5 and the bin contains a local maximum along the frequency axis. Whistle contours are produced from the set of peaks using a modified version of the graph search method summarized in Sec. II B.

*L*2-norm of the difference between $y$ and $ y \u0302$. This baseline loss function encourages the CNN model to predict the value of the pseudo-label. We train the model for 1 × 10

^{6}iterations (around 88 epochs). The learning rate is initially 0.001 and multiplied by 0.1 every 400 000 iterations. The other training hyperparameters and graph search parameters are the same as those in the implementation of Li (2020). Tuning of these hyperparameters is beyond the scope of this paper, but we empirically found that this set of parameters worked well in our experiments.

### D. Pseudo-label learning

Let us consider the errors in the pseudo-labels. Figure 2 shows two typical examples of whistles extracted by graph search. The extracted whistles typically have high bin-wise precision but low bin-wise recall, i.e., the extracted contours mostly cover t-f bins that have whistles, but there are a significant number of whistle t-f bins missed.

### E. Metrics

We evaluate the model performance using the evaluation code in *Silbido*. The evaluation starts with a matching process between detections and ground truth labels. If detected and annotated whistles overlap in time for more than 30% of the annotated whistle and the mean frequency difference is less than 350 Hz, this pair of detected and annotated whistles are considered a match. As in Roch (2011) and Li (2020), we only consider the analyst-annotated whistles with a duration ≥150 ms and a SNR ≥ 10 dB over at least one-third of the whistle. Any annotated whistles that did not meet these criteria were omitted from the analysis. Detections that matched discarded ground truth annotations were neither counted toward nor against performance.

After matching, *Silbido* provides five metrics for whistle extraction performance. Precision indicates the percentage of correctly detected ground truth whistles. Recall is the percentage of ground truth whistles missed. The next three metrics are designed to measure the quality of detections. Deviation shows the average frequency deviation of the detected whistle to matched annotation. For annotated whistles that are matched to detections, coverage is the percentage of these whistles' durations that was detected. Finally, the fragmentation metric calculates the average number of detections matched to the same ground truth whistle and provides an indication of how often a whistle is split into multiple segments in the detection process. Ideally, one would have detections with zero deviation, 100% coverage, and a fragmentation score of one.

We calculate the *F*1-score, the harmonic mean of precision and recall, as an overall metric of the extraction performance. We evaluate our model on each species independently and report the performance averaged on different species.

## III. EXPERIMENTS

### A. Pseudo-label generated by graph search

We designed a series of experiments to validate our proposed methods. In a first step, we extracted whistles with spectral peaks detection and graph search method in *Silbido*, and this experiment is referred to as “graph search.” Next, we trained two models with the $ L base$ loss function [Eq. (1)]. The first of these models used the analyst annotations and is referred to as “ $ L base \u2212 annotation$.” The second model, “ $ L base \u2212 graph$,” used the same loss function but was trained with *Silbido* graph search generated pseudo-labels from the larger unlabeled dataset.

To examine the effectiveness of the proposed regularization penalties $ L recall$ and $ L prec$, we trained additional models on the unannotated data using graph search labels. Experiments using these models are denoted “ $ L recall \u2212 graph$” and “ $ L prec \u2212 graph.$” Various values of $\lambda $ and $\gamma $ were empirically explored to find the optimal parameter setting on our dataset. Specifically, we use values of 0, 1, 2, or 4 for the exponent parameter $\gamma $ in $ L recall$ and $ L prec$. For each fixed value of $\gamma ,$ we explored varied $\lambda $ values until we find a peak *F*1-score. In the $ L recall$ experiments, we used $\lambda \u2208$ {0.5, 1, 2, 3, 4}, {2, 4, 6, 8}, {4, 6, 8, 10}, and {4, 6, 8, 10, 15, 20, 25, 30} for $\gamma =$ 0, 1, 2, and 4, respectively. We explored larger values of $\lambda $ when $\gamma $ is larger in $ L recall$ because larger $\gamma $ led to lower weight in the regularization term. We used the same set of $\lambda $ in $ L prec$, $\lambda \u2208$ {0.01, 0.1}, as we observed that larger $\lambda $ resulted in lower *F*1-scores and experiments with $\lambda =0$ (removal of the regularization term) had the best *F*1-scores.

### B. Pseudo-labels generated by SMC-PHD

To further validate our proposed method, we substituted an alternative whistle extraction method to generate a different set of pseudo-labels. We used SMC-PHD (Gruden and White, 2020) to extract whistles, and the extraction result was referred to as “SMCPHD.” We used the RBF motion model, which requires a modest amount of analyst-annotated training data, with Gruden and White using a small training set of 185 whistles from several minutes of annotated data that do not overlap with our test data. Consequently, this method is not entirely free of analyst annotations. As our method and graph search discarded detections that were shorter than 150 ms, we created a second set of pseudo-labels where only detections that were at least 150 ms were retained: “SMCPHD $\u2265$ 150 ms.”

We trained the whistle extraction model with the $ L base$ loss function on these two sets of pseudo-labels, which are hereafter referred to as “ $ L base \u2212 SMCPHD$” and “ $ L base \u2212 SMCPHD \u2265 150 ms$.” As SMC-PHD exhibits the same characteristics as the graph search of tending to produce more false negatives than false positives, we trained models using the $ L recall$ loss function and varied the values of $\gamma \u2009and\u2009\lambda $ on these two sets of pseudo-labels, which were referred to as “ $ L recall \u2212 SMCPHD$” and “ $ L recall \u2212 SMCPHD \u2265 150 ms$,” respectively. We use $\gamma =0,\u20091$ for $ L recall$. Specifically, we used $\lambda \u2208$ {1, 2, 3, 4, 5, 6} for $\gamma =0$ and $\lambda \u2208$ {2, 4, 6, 8, 10, 12, 11, 12, 14, 16} for $\gamma =1$ for $ L recall \u2212 SMCPHD \u2265 150 ms$. For the experiment that did not discard short detections, $ L recall \u2212 SMCPHD$, we used $\lambda \u2208$ {1, 2, 3, 4} for $\gamma =$ 0 and $\lambda \u2208$ {2, 3, 4, 5, 6, 8, 10} for $\gamma =$ 1. We did not execute experiments using the $ L prec$ loss function as SMC-PHD exhibits similar patterns of error in the pseudo-labels.

## IV. RESULTS

### A. Comparison between the proposed method and baselines

We summarize the performance of our method and several baselines in Table II. There are three types of baselines: (i) the model trained with analysts' annotation, (ii) the performance of the algorithms used to produce pseudo-labels, and (iii) the model trained with pseudo-label and $ L base$. We also show the best performance using $ L recall$ and pseudo-labels. Experiments using the $ L prec$ loss are included in the ablation study of Sec. IV B.

Method . | F1 (%)
. | Precision (%) . | Recall (%) . | $ \mu \sigma \u2009(Hz)$ . | Coverage (%) . | Fragmentation (detected segments/whistle) . |
---|---|---|---|---|---|---|

$ L base \u2212 annotation$ | 87.47 | 89.50 | 85.93 | 92.00 | 88.08 | 1.13 |

Graph search | 75.95 | 81.13 | 72.28 | 101.00 | 81.05 | 1.23 |

SMCPHD | 83.40 | 76.55 | 92.45 | 108.00 | 70.93 | 1.80 |

SMCPHD $\u2265$ 150 ms | 74.38 | 95.85 | 60.88 | 103.50 | 72.88 | 1.23 |

$ L base \u2212 graph$ | 74.14 | 94.15 | 61.53 | 144.75 | 77.50 | 1.18 |

$ L base \u2212 SMCPHD$ | 82.34 | 94.98 | 72.75 | 122.75 | 81.75 | 1.18 |

$ L base \u2212 SMCPHD \u2265 150 ms$ | 70.97 | 98.45 | 56.08 | 119.75 | 73.70 | 1.20 |

$ L recall \u2212 graph$ | 86.31 | 89.55 | 83.33 | 154.25 | 86.73 | 1.18 |

$ L recall \u2212 SMCPHD$ | 86.42 | 87.18 | 85.85 | 134.25 | 87.30 | 1.15 |

$ L recall \u2212 SMCPHD \u2265 150 ms$ | 87.20 | 88.78 | 85.78 | 134.75 | 87.30 | 1.18 |

Method . | F1 (%)
. | Precision (%) . | Recall (%) . | $ \mu \sigma \u2009(Hz)$ . | Coverage (%) . | Fragmentation (detected segments/whistle) . |
---|---|---|---|---|---|---|

$ L base \u2212 annotation$ | 87.47 | 89.50 | 85.93 | 92.00 | 88.08 | 1.13 |

Graph search | 75.95 | 81.13 | 72.28 | 101.00 | 81.05 | 1.23 |

SMCPHD | 83.40 | 76.55 | 92.45 | 108.00 | 70.93 | 1.80 |

SMCPHD $\u2265$ 150 ms | 74.38 | 95.85 | 60.88 | 103.50 | 72.88 | 1.23 |

$ L base \u2212 graph$ | 74.14 | 94.15 | 61.53 | 144.75 | 77.50 | 1.18 |

$ L base \u2212 SMCPHD$ | 82.34 | 94.98 | 72.75 | 122.75 | 81.75 | 1.18 |

$ L base \u2212 SMCPHD \u2265 150 ms$ | 70.97 | 98.45 | 56.08 | 119.75 | 73.70 | 1.20 |

$ L recall \u2212 graph$ | 86.31 | 89.55 | 83.33 | 154.25 | 86.73 | 1.18 |

$ L recall \u2212 SMCPHD$ | 86.42 | 87.18 | 85.85 | 134.25 | 87.30 | 1.15 |

$ L recall \u2212 SMCPHD \u2265 150 ms$ | 87.20 | 88.78 | 85.78 | 134.75 | 87.30 | 1.18 |

The network model trained using analyst annotations and the baseline loss function, “ $ L base \u2212 annotation$,” results in a whistle extraction recall of 85.93% and a precision of 89.50%. This is the target performance that we wish to achieve using pseudo-labels. By applying graph search on spectral peaks, graph search detects 72.28% (recall) of the analyst-annotated whistles with a precision of 81.13% (*F*1, 75.95%). As a competitive baseline, SMC-PHD detects 92.45% (recall) of annotated whistles with a precision of 76.55% (*F*1, 83.40%). After removing the detections that are shorter than 150 ms, SMC-PHD achieves a precision of 95.85% while the recall drops to 60.88% (*F*1, 74.38%), indicating that SMC-PHD is not retrieving longer whistles as well.

Replacing the analyst-annotation training data with pseudo-labels generated by graph search, “ $ L base \u2212 graph$,” extracts whistles with a recall of 61.53% and a precision of 94.15% (*F*1, 74.14%). When we trained the model with SMC-PHD detections and $ L base$, we obtained a recall of 72.75% and a precision of 94.98% (*F*1, 82.34%). Removing shorter detections resulted in a recall of 56.08% and a precision of 98.45% (*F*1, 70.97%). These models achieved inferior *F*1-scores than the corresponding methods for generating pseudo-labels.

In contrast, our modified loss function [Eq. (3)] leads to an *F*1-score of 87.2% with $ L recall \u2212 SMCPHD \u2265 150 ms$, which is almost identical to the model trained with a large analyst-annotated dataset (*F*1-score of 87.47%). Additionally, we observe relative improvements in coverage from 6.8 to 18.5% when we train models with $ L recall$ as compared to models using the $ L base$ loss function. We also have fewer fragments in detected whistles in our $ L recall$ experiments. Combining these observations, our model trained with $ L recall$ can correctly predict more t-f bins as whistles. Finally, although we observe a higher mean frequency deviation in pseudo-label experiments compared to graph search and SMC-PHD, the increase in deviation is less than one frequency bin width (125 Hz) on our spectrogram, which makes it negligible in subsequent applications that use detected tonals. Details of the performance on each species are summarized in Tables IV–VII in the Appendix.

### B. Ablation study on pseudo-label generated by graph search

We design an ablation study for the loss function design ( $ L base$, $ L recall$, $ L prec$) and hyperparameter choice as stated in Sec. III. Pseudo-labels were generated by graph search. We present the experiment results in Fig. 5, where each point reports the performance of one whistle extraction method. Points joined into a curve show the results for a fixed $\gamma $ with variation of $\lambda $ as specified in Sec. III (experiments section). The curves of $ L recall$ show a tendency of increased recall and decreased precision when $\lambda $ is increased. In contrast, larger $\lambda $ in $ L prec$ leads to higher precision but lower recall. The details of the performance are depicted in Tables VIII and IX in the Appendix.

For models trained with $ L recall$, we observe a significant increase in recall (>21% in the best case) compared to $ L base \u2212 graph$ while still achieving a reasonable precision (89.6%). For models trained with pseudo-labels and $ L prec$ loss, we find that precision is slightly increased compared to the unaltered loss function, $ L base$, when we apply a larger value $\lambda $ (0.01), but there is a slight decrease in recall. In comparison to graph search, precision is greatly increased at the cost of significant loss in recall. We observe that increasing $\lambda $ leads to a decreasing *F*1-score.

These comparisons show that the new loss metrics, $ L recall$ and $ L prec$, can increase recall or precision with respect to the performance of the algorithm that generated the pseudo-labels. Models trained with $ L prec$ lead to comparable *F*1-scores with $ L base \u2212 graph$ while $ L recall$ results in a significant *F*1-score increase (12.17%). The $ L recall \u2212 graph$ produce results that are similar to the performance on human analyst-training data, and the *F*1-score of 86.31% ( $\lambda =2$, $\gamma =0$) approaches the $ L base \u2212 annotation$ *F*1-score of 87.47%.

### C. Ablation study on pseudo-labels generated by SMC-PHD

We perform a similar ablation study with models trained on SMC-PHD pseudo-labels. The precision-recall performance is shown in Fig. 6. We observe a similar trend on the curves of $ L recall$, where increased $\lambda $ result in increased recall and decreased precision, and we achieve the best *F*1-score using $SMCPHD\u2265150ms$ pseudo-label when $\lambda =14$ and $\gamma =1$. The maximum *F*1-score is 87.2% in the experiment of “ $ L recall \u2212 SMCPHD \u2265 150 ms,$” which is a significant increase compared to “ $ L base \u2212 SMCPHD \u2265 150 ms$,” “ $ L base \u2212 SMCPHD$,” SMCPHD $\u2265$ $150\u2009ms$, and SMCPHD. The details of the performance are noted in Tables X and XI in the Appendix.

### D. Visualization of model output and whistle extraction result

We show examples of network output and extracted whistles produced by our algorithms in Figs. 7 and 8. We compare typical network outputs (confidence maps) of models trained with $ L recall$ and $ L base$ in Fig. 7. The model trained with $ L recall$ had higher response to whistle energy and produced more continuous coverage of whistles compared to the model trained with $ L base$. When the confidence map predictions are processed by *Silbido*'s graph search algorithm (and likely many other whistle extraction algorithms), this results in more and longer extracted whistles (Fig. 8). More examples of whistle extraction results are displayed in Fig. 10 in the Appendix.

## V. DISCUSSION

Our results show that it is feasible to train competitive deep-learning-based models for whistle extraction without using analyst annotations as training data. With pseudo-labels generated by graph search and the proposed loss function, we can extract whistles with an *F*1-score of 86.42%, which significantly surpasses graph search (*F*1-score of 75.95%). With pseudo-labels generated by SMC-PHD, which uses 185 annotated whistles for training, we are able to further improve the whistle extraction performance to an *F*1-score of 87.2%, which is close to the whistle extractor trained with 12 539 annotated whistles. By using different whistle extractors to generate pseudo-labels, the proposed method can eliminate or greatly reduce the human analyst annotation effort.

The comparison between the performance of the algorithms used to generate pseudo-labels and models trained on these labels with $ L base$ loss shows that the models trained with the baseline loss function are fairly accurate in their detections, but they miss many whistles that would have been detected by the algorithm that was used to generate the pseudo-labels. We also observe that SMC-PHD tends to extract more short whistle contours than graph search. The longer detections of SMC-PHD are more likely to be correct detections. Adjusting the time threshold for SMC-PHD might achieve a better *F*1-score on our evaluation datasets, but this experiment is beyond the scope of this paper.

The $ L recall$ loss showed strong *F*1 performance gains over the algorithms used to produce the pseudo-labels used to train the CNNs. We observe that the system improves whistle coverage and reduces fragmentation in quantitative (Table II) and qualitive results (Figs. 7 and 8). The longer whistle detections with fewer gaps may better facilitate downstream research, e.g., species identification. The $ L prec$ loss, which was only tested on labels generated by an algorithm that tends to produce more false negatives than false positives, provided gains in precision at the cost of significant drops in recall on these data. We suspect that it would fare better on pseudo-label sets with higher false positive rates. In contrast, models trained using the baseline loss function, $ L base$, were unable to produce *F*1-scores that exceeded the performance of the algorithms used to produce the pseudo-labels. When some labels are present (e.g., evaluation data), it is relatively simple to score the pseudo-labels and determine whether they are more likely to have false positives or false negatives. When such labels are not available, manual inspection of pseudo-labels can provide intuition about which type of error is more prevalent.

There were likely more false negatives than false positives in our pseudo-labels for both label generation methods. We observed a higher precision than recall in graph search and SMCPHD $\u2265$ 150 ms. Although SMCPHD has higher recall (92.45%) than precision (76.55%), the coverage was around 71%, suggesting that roughly 29% of t-f bins in whistles were not detected and the t-f bin-level recall was lower. Furthermore, because we balanced the number of negative patches and positive patches in the training dataset and false negative patches only covered a small portion of negative patches, many false negatives were excluded from training. While the proposed $ L recall$ and $ L prec$ were effective in improving whistle extraction recall or precision, respectively, $ L recall$ increased *F*1-score significantly more than $ L prec$. This observation also indicated that false negatives (missed whistles) in the pseudo-labels affect our whistle extraction model more than false positives.

The pseudo-label regularization terms demonstrated the ability to extract whistles that outperformed the algorithms used to produce the pseudo-labels. Due to whistle t-f nodes labeled as noise in the pseudo-labels (e.g., Fig. 3), learning without regularization produced models that had high precision but sacrificed the ability to produce high confidence map scores for many whistles, resulting in low recall. The regularization in $ L recall \u2212 graph$ sacrifices some of the precision attained with $ L base \u2212 graph$ but compensates with a much higher recall, leading to an overall superior *F*1-score (Table II). Similar trends are observed with pseudo-labels generated by SMC-PHD.

## VI. CONCLUSION

We have developed a convolutional DNN that can be trained without any analyst annotations and is able to extract whistles with a performance comparable to one trained from a rich set of analyst annotations. Instead of using the expensive and time-consuming annotations produced by analysts, we used methods that required no (graph search) or minimal training data (SMCPHD) to extract whistle annotations used as pseudo-labels for model training. We evaluated extraction performance on a diverse four-species evaluation dataset consisting of 1935 analyst-annotated whistles (duration $\u2265$ 150 ms; $\u22651/3$ of the t-f bins have a SNR $\u2265$ 10 dB). Performance of a baseline CNN model using a standard loss function ( $ L base$) produced *F*1-scores comparable to the performance of the algorithms used to produce the pseudo-labels. However, there was a tendency to increase precision at a nontrivial cost to recall.

The proposed loss functions significantly improve whistle extraction performance. Regularization penalties compensated for errors in pseudo-labels and prioritized recall [ $ L recall$, Eq. (3)] or precision [ $ L prec$, Eq. (5)]. Our experiments demonstrated that missed whistles in pseudo-labels affect the CNN model more than the incorrectly detected whistles, and the proposed $ L recall$ loss function outperformed $ L base$ with an absolute *F*1-score increase of 12.17% (graph search pseudo-labels) and 3.8% (SMC-PHD pseudo-labels). In the best case, a model trained without any analyst annotations using SMC-PHD detections of at least 150 ms duration detected 85.78% of the whistles with a precision of 88.78%. The *F*1-score (87.2%) was comparable to a model trained with 12 539 annotated whistles (87.47%), showing the potential to generate whistle extraction models with near state-of-the-art performance with little to no human annotation effort.

## ACKNOWLEDGMENTS

We thank the DCLDE organizers for providing the DCLDE 2011 dataset used in this work, as well as the numerous crews and science staff responsible for hardware development and deployment, visual observations, and annotation that resulted in these public datasets. Our thanks to Dr. Michael Weise, Office of Naval Research, for financial support (Grant No. N00014-21-1–2567). We also thank the anonymous reviewers who contributed insightful suggestions that improved this manuscript.

### APPENDIX

See Figs. 9 and 10 as well as Tables III–XI for additional details of the experiments conducted in this study.

Species . | DCLDE 2011 Files . |
---|---|

Bottlenose dolphin (T. truncatus) | Qx-Tt-SCI0608-N1-060814-121518 |

palmyra092007FS192-070924-205305 | |

palmyra092007FS192-070924-205730 | |

Long-beaked common dolphin (D. capensis) | Qx-Dc-CC0411-TAT11-CH2-041114-154040-s |

Qx-Dc-CC0411-TAT11-CH2-041114-154040-s | |

QX-Dc-FLIP0610-VLA-061015-165000 | |

Melon-headed whale (P. electra) | QX-Dc-FLIP0610-VLA-061015-165000 |

palmyra092007FS192-071004-032342 | |

palmyra102006-061020-204327_4 | |

Spinner dolphin (S. longirostris) | palmyra092007FS192-070927-224737 |

palmyra092007FS192-070927-224737 | |

palmyra102006-061103-213127_4 |

Species . | DCLDE 2011 Files . |
---|---|

Bottlenose dolphin (T. truncatus) | Qx-Tt-SCI0608-N1-060814-121518 |

palmyra092007FS192-070924-205305 | |

palmyra092007FS192-070924-205730 | |

Long-beaked common dolphin (D. capensis) | Qx-Dc-CC0411-TAT11-CH2-041114-154040-s |

Qx-Dc-CC0411-TAT11-CH2-041114-154040-s | |

QX-Dc-FLIP0610-VLA-061015-165000 | |

Melon-headed whale (P. electra) | QX-Dc-FLIP0610-VLA-061015-165000 |

palmyra092007FS192-071004-032342 | |

palmyra102006-061020-204327_4 | |

Spinner dolphin (S. longirostris) | palmyra092007FS192-070927-224737 |

palmyra092007FS192-070927-224737 | |

palmyra102006-061103-213127_4 |

Method . | F1 (%)
. | Precision (%) . | Recall (%) . | $ \mu \sigma \u2009(Hz)$ . | Coverage (%) . | Fragmentation (detected segments/whistle) . |
---|---|---|---|---|---|---|

$ L base \u2212 annotation$ | 90.64 | 96.30 | 85.60 | 98.00 | 85.80 | 1.20 |

Graph search | 88.06 | 92.30 | 84.20 | 112.00 | 79.30 | 1.20 |

SMCPHD | 90.52 | 85.80 | 95.80 | 122.00 | 72.60 | 2.00 |

SMCPHD $\u2265$ 150 ms | 77.05 | 97.70 | 63.60 | 116.00 | 74.10 | 1.30 |

$ L base \u2212 graph$ | 69.17 | 96.20 | 54.00 | 170.00 | 72.30 | 1.40 |

$ L base \u2212 SMCPHD$ | 70.65 | 97.50 | 55.40 | 150.00 | 72.70 | 1.40 |

$ L base \u2212 SMCPHD \u2265 150 ms$ | 83.38 | 96.50 | 73.40 | 148.00 | 81.50 | 1.30 |

$ L recall \u2212 graph$ | 91.49 | 92.60 | 90.40 | 188.00 | 86.90 | 1.20 |

$ L recall \u2212 SMCPHD$ | 91.32 | 92.90 | 89.80 | 159.00 | 87.20 | 1.20 |

$ L recall \u2212 SMCPHD \u2265 150 ms$ | 91.90 | 94.10 | 89.80 | 159.00 | 86.80 | 1.20 |

Method . | F1 (%)
. | Precision (%) . | Recall (%) . | $ \mu \sigma \u2009(Hz)$ . | Coverage (%) . | Fragmentation (detected segments/whistle) . |
---|---|---|---|---|---|---|

$ L base \u2212 annotation$ | 90.64 | 96.30 | 85.60 | 98.00 | 85.80 | 1.20 |

Graph search | 88.06 | 92.30 | 84.20 | 112.00 | 79.30 | 1.20 |

SMCPHD | 90.52 | 85.80 | 95.80 | 122.00 | 72.60 | 2.00 |

SMCPHD $\u2265$ 150 ms | 77.05 | 97.70 | 63.60 | 116.00 | 74.10 | 1.30 |

$ L base \u2212 graph$ | 69.17 | 96.20 | 54.00 | 170.00 | 72.30 | 1.40 |

$ L base \u2212 SMCPHD$ | 70.65 | 97.50 | 55.40 | 150.00 | 72.70 | 1.40 |

$ L base \u2212 SMCPHD \u2265 150 ms$ | 83.38 | 96.50 | 73.40 | 148.00 | 81.50 | 1.30 |

$ L recall \u2212 graph$ | 91.49 | 92.60 | 90.40 | 188.00 | 86.90 | 1.20 |

$ L recall \u2212 SMCPHD$ | 91.32 | 92.90 | 89.80 | 159.00 | 87.20 | 1.20 |

$ L recall \u2212 SMCPHD \u2265 150 ms$ | 91.90 | 94.10 | 89.80 | 159.00 | 86.80 | 1.20 |

Method . | F1 (%)
. | Precision (%) . | Recall (%) . | $ \mu \sigma \u2009(Hz)$ . | Coverage (%) . | Fragmentation (detected segments/whistle) . |
---|---|---|---|---|---|---|

$ L base \u2212 annotation$ | 83.98 | 89.90 | 78.80 | 96.00 | 87.50 | 1.10 |

Graph search | 60.20 | 75.40 | 50.10 | 75.00 | 83.90 | 1.30 |

SMCPHD | 73.60 | 62.90 | 88.70 | 103.00 | 69.80 | 1.70 |

SMCPHD $\u2265$ 150 ms | 70.27 | 93.20 | 56.40 | 96.00 | 73.80 | 1.20 |

$ L base \u2212 graph$ | 63.14 | 84.80 | 50.30 | 146.00 | 79.40 | 1.10 |

$ L base \u2212 SMCPHD$ | 60.65 | 99.60 | 43.60 | 105.00 | 73.20 | 1.10 |

$ L base \u2212 SMCPHD \u2265 150 ms$ | 72.30 | 86.90 | 61.90 | 113.00 | 81.80 | 1.10 |

$ L recall \u2212 graph$ | 73.81 | 77.20 | 70.70 | 151.00 | 86.50 | 1.20 |

$ L recall \u2212 SMCPHD$ | 74.44 | 70.70 | 78.60 | 122.00 | 86.20 | 1.10 |

$ L recall \u2212 SMCPHD \u2265 150 ms$ | 75.59 | 73.60 | 77.70 | 125.00 | 87.00 | 1.20 |

Method . | F1 (%)
. | Precision (%) . | Recall (%) . | $ \mu \sigma \u2009(Hz)$ . | Coverage (%) . | Fragmentation (detected segments/whistle) . |
---|---|---|---|---|---|---|

$ L base \u2212 annotation$ | 83.98 | 89.90 | 78.80 | 96.00 | 87.50 | 1.10 |

Graph search | 60.20 | 75.40 | 50.10 | 75.00 | 83.90 | 1.30 |

SMCPHD | 73.60 | 62.90 | 88.70 | 103.00 | 69.80 | 1.70 |

SMCPHD $\u2265$ 150 ms | 70.27 | 93.20 | 56.40 | 96.00 | 73.80 | 1.20 |

$ L base \u2212 graph$ | 63.14 | 84.80 | 50.30 | 146.00 | 79.40 | 1.10 |

$ L base \u2212 SMCPHD$ | 60.65 | 99.60 | 43.60 | 105.00 | 73.20 | 1.10 |

$ L base \u2212 SMCPHD \u2265 150 ms$ | 72.30 | 86.90 | 61.90 | 113.00 | 81.80 | 1.10 |

$ L recall \u2212 graph$ | 73.81 | 77.20 | 70.70 | 151.00 | 86.50 | 1.20 |

$ L recall \u2212 SMCPHD$ | 74.44 | 70.70 | 78.60 | 122.00 | 86.20 | 1.10 |

$ L recall \u2212 SMCPHD \u2265 150 ms$ | 75.59 | 73.60 | 77.70 | 125.00 | 87.00 | 1.20 |

Method . | F1 (%)
. | Precision (%) . | Recall (%) . | $ \mu \sigma \u2009(Hz)$ . | Coverage (%) . | Fragmentation (detected segments/whistle) . |
---|---|---|---|---|---|---|

$ L base \u2212 annotation$ | 82.11 | 77.50 | 87.30 | 86.00 | 91.70 | 1.10 |

Graph search | 69.20 | 66.70 | 71.90 | 95.00 | 80.50 | 1.10 |

SMCPHD | 77.56 | 67.20 | 91.70 | 92.00 | 69.10 | 1.50 |

SMCPHD $\u2265$ 150 ms | 70.49 | 95.40 | 55.90 | 88.00 | 74.10 | 1.10 |

$ L base \u2212 graph$ | 78.75 | 97.60 | 66.00 | 135.00 | 80.00 | 1.10 |

$ L base \u2212 SMCPHD$ | 70.83 | 98.50 | 55.30 | 109.00 | 75.00 | 1.10 |

$ L base \u2212 SMCPHD \u2265 150 ms$ | 85.26 | 98.60 | 75.10 | 113.00 | 83.80 | 1.10 |

$ L recall \u2212 graph$ | 88.74 | 93.30 | 84.60 | 142.00 | 88.80 | 1.10 |

$ L recall \u2212 SMCPHD$ | 87.41 | 89.40 | 85.50 | 122.00 | 89.70 | 1.10 |

$ L recall \u2212 SMCPHD \u2265 150 ms$ | 89.28 | 92.70 | 86.10 | 123.00 | 89.40 | 1.10 |

Method . | F1 (%)
. | Precision (%) . | Recall (%) . | $ \mu \sigma \u2009(Hz)$ . | Coverage (%) . | Fragmentation (detected segments/whistle) . |
---|---|---|---|---|---|---|

$ L base \u2212 annotation$ | 82.11 | 77.50 | 87.30 | 86.00 | 91.70 | 1.10 |

Graph search | 69.20 | 66.70 | 71.90 | 95.00 | 80.50 | 1.10 |

SMCPHD | 77.56 | 67.20 | 91.70 | 92.00 | 69.10 | 1.50 |

SMCPHD $\u2265$ 150 ms | 70.49 | 95.40 | 55.90 | 88.00 | 74.10 | 1.10 |

$ L base \u2212 graph$ | 78.75 | 97.60 | 66.00 | 135.00 | 80.00 | 1.10 |

$ L base \u2212 SMCPHD$ | 70.83 | 98.50 | 55.30 | 109.00 | 75.00 | 1.10 |

$ L base \u2212 SMCPHD \u2265 150 ms$ | 85.26 | 98.60 | 75.10 | 113.00 | 83.80 | 1.10 |

$ L recall \u2212 graph$ | 88.74 | 93.30 | 84.60 | 142.00 | 88.80 | 1.10 |

$ L recall \u2212 SMCPHD$ | 87.41 | 89.40 | 85.50 | 122.00 | 89.70 | 1.10 |

$ L recall \u2212 SMCPHD \u2265 150 ms$ | 89.28 | 92.70 | 86.10 | 123.00 | 89.40 | 1.10 |

Method . | F1 (%)
. | Precision (%) . | Recall (%) . | $ \mu \sigma \u2009(Hz)$ . | Coverage (%) . | Fragmentation (detected segments/whistle) . |
---|---|---|---|---|---|---|

$ L base \u2212 annotation$ | 93.14 | 94.30 | 92.00 | 88.00 | 87.30 | 1.10 |

Graph search | 86.35 | 90.10 | 82.90 | 122.00 | 80.50 | 1.30 |

SMCPHD | 91.92 | 90.30 | 93.60 | 115.00 | 72.20 | 2.00 |

SMCPHD $\u2265$ 150 ms | 79.71 | 97.10 | 67.60 | 114.00 | 69.50 | 1.30 |

$ L base \u2212 graph$ | 85.48 | 98.00 | 75.80 | 128.00 | 78.30 | 1.10 |

$ L base \u2212 SMCPHD$ | 81.74 | 98.20 | 70.00 | 115.00 | 73.90 | 1.20 |

$ L base \u2212 SMCPHD \u2265 150 ms$ | 88.41 | 97.90 | 80.60 | 117.00 | 79.90 | 1.20 |

$ L recall \u2212 graph$ | 91.20 | 95.10 | 87.60 | 136.00 | 84.70 | 1.20 |

$ L recall \u2212 SMCPHD$ | 92.50 | 95.70 | 89.50 | 134.00 | 86.10 | 1.20 |

$ L recall \u2212 SMCPHD \u2265 150 ms$ | 92.03 | 94.70 | 89.50 | 132.00 | 86.00 | 1.2 |

Method . | F1 (%)
. | Precision (%) . | Recall (%) . | $ \mu \sigma \u2009(Hz)$ . | Coverage (%) . | Fragmentation (detected segments/whistle) . |
---|---|---|---|---|---|---|

$ L base \u2212 annotation$ | 93.14 | 94.30 | 92.00 | 88.00 | 87.30 | 1.10 |

Graph search | 86.35 | 90.10 | 82.90 | 122.00 | 80.50 | 1.30 |

SMCPHD | 91.92 | 90.30 | 93.60 | 115.00 | 72.20 | 2.00 |

SMCPHD $\u2265$ 150 ms | 79.71 | 97.10 | 67.60 | 114.00 | 69.50 | 1.30 |

$ L base \u2212 graph$ | 85.48 | 98.00 | 75.80 | 128.00 | 78.30 | 1.10 |

$ L base \u2212 SMCPHD$ | 81.74 | 98.20 | 70.00 | 115.00 | 73.90 | 1.20 |

$ L base \u2212 SMCPHD \u2265 150 ms$ | 88.41 | 97.90 | 80.60 | 117.00 | 79.90 | 1.20 |

$ L recall \u2212 graph$ | 91.20 | 95.10 | 87.60 | 136.00 | 84.70 | 1.20 |

$ L recall \u2212 SMCPHD$ | 92.50 | 95.70 | 89.50 | 134.00 | 86.10 | 1.20 |

$ L recall \u2212 SMCPHD \u2265 150 ms$ | 92.03 | 94.70 | 89.50 | 132.00 | 86.00 | 1.2 |

$\gamma $ . | $\lambda $ . | F1 (%)
. | Precision (%) . | Recall (%) . |
---|---|---|---|---|

0 | 0.5 | 79.30 | 90.18 | 70.93 |

1 | 83.14 | 88.68 | 78.38 | |

2 | 86.31 | 89.55 | 83.33 | |

3 | 84.38 | 84.08 | 84.78 | |

4 | 82.01 | 79.33 | 85.18 | |

1 | 2 | 83.37 | 88.70 | 78.85 |

4 | 84.80 | 86.90 | 82.95 | |

6 | 84.88 | 86.08 | 83.85 | |

8 | 81.61 | 78.68 | 85.20 | |

2 | 4 | 83.52 | 87.85 | 79.85 |

6 | 83.88 | 85.75 | 82.48 | |

8 | 84.39 | 85.78 | 83.30 | |

10 | 83.98 | 83.30 | 84.90 | |

4 | 4 | 81.87 | 91.20 | 74.35 |

6 | 82.47 | 89.95 | 76.28 | |

8 | 82.82 | 88.88 | 77.80 | |

10 | 83.03 | 88.28 | 78.70 | |

15 | 83.30 | 85.35 | 81.98 | |

20 | 83.59 | 83.63 | 84.43 | |

25 | 81.08 | 79.90 | 83.88 | |

30 | 79.83 | 76.83 | 85.10 |

$\gamma $ . | $\lambda $ . | F1 (%)
. | Precision (%) . | Recall (%) . |
---|---|---|---|---|

0 | 0.5 | 79.30 | 90.18 | 70.93 |

1 | 83.14 | 88.68 | 78.38 | |

2 | 86.31 | 89.55 | 83.33 | |

3 | 84.38 | 84.08 | 84.78 | |

4 | 82.01 | 79.33 | 85.18 | |

1 | 2 | 83.37 | 88.70 | 78.85 |

4 | 84.80 | 86.90 | 82.95 | |

6 | 84.88 | 86.08 | 83.85 | |

8 | 81.61 | 78.68 | 85.20 | |

2 | 4 | 83.52 | 87.85 | 79.85 |

6 | 83.88 | 85.75 | 82.48 | |

8 | 84.39 | 85.78 | 83.30 | |

10 | 83.98 | 83.30 | 84.90 | |

4 | 4 | 81.87 | 91.20 | 74.35 |

6 | 82.47 | 89.95 | 76.28 | |

8 | 82.82 | 88.88 | 77.80 | |

10 | 83.03 | 88.28 | 78.70 | |

15 | 83.30 | 85.35 | 81.98 | |

20 | 83.59 | 83.63 | 84.43 | |

25 | 81.08 | 79.90 | 83.88 | |

30 | 79.83 | 76.83 | 85.10 |

$\gamma $ . | $\lambda $ . | F1 (%)
. | Precision (%) . | Recall (%) . |
---|---|---|---|---|

0 | 0.01 | 74.51 | 93.88 | 62.20 |

0.1 | 72.25 | 95.73 | 58.53 | |

1 | 0.01 | 74.52 | 93.53 | 62.33 |

0.1 | 72.39 | 96.03 | 58.80 | |

2 | 0.01 | 74.55 | 94.80 | 61.83 |

0.1 | 72.16 | 95.55 | 58.58 | |

4 | 0.01 | 74.81 | 95.38 | 61.95 |

0.1 | 72.85 | 95.68 | 59.40 |

$\gamma $ . | $\lambda $ . | F1 (%)
. | Precision (%) . | Recall (%) . |
---|---|---|---|---|

0 | 0.01 | 74.51 | 93.88 | 62.20 |

0.1 | 72.25 | 95.73 | 58.53 | |

1 | 0.01 | 74.52 | 93.53 | 62.33 |

0.1 | 72.39 | 96.03 | 58.80 | |

2 | 0.01 | 74.55 | 94.80 | 61.83 |

0.1 | 72.16 | 95.55 | 58.58 | |

4 | 0.01 | 74.81 | 95.38 | 61.95 |

0.1 | 72.85 | 95.68 | 59.40 |

$\gamma $ . | $\lambda $ . | F1 (%)
. | Precision (%) . | Recall (%) . |
---|---|---|---|---|

0 | 1 | 85.65 | 89.35 | 82.40 |

2 | 86.42 | 87.18 | 85.85 | |

3 | 86.22 | 85.15 | 87.34 | |

4 | 85.81 | 83.63 | 88.20 | |

1 | 2 | 85.16 | 89.88 | 81.10 |

4 | 86.14 | 88.80 | 84.05 | |

6 | 86.26 | 88.28 | 84.60 | |

8 | 86.2 | 87.18 | 85.40 | |

10 | 85.9 | 85.73 | 86.35 | |

11 | 85.94 | 84.90 | 87.10 | |

12 | 86.09 | 84.40 | 87.95 |

$\gamma $ . | $\lambda $ . | F1 (%)
. | Precision (%) . | Recall (%) . |
---|---|---|---|---|

0 | 1 | 85.65 | 89.35 | 82.40 |

2 | 86.42 | 87.18 | 85.85 | |

3 | 86.22 | 85.15 | 87.34 | |

4 | 85.81 | 83.63 | 88.20 | |

1 | 2 | 85.16 | 89.88 | 81.10 |

4 | 86.14 | 88.80 | 84.05 | |

6 | 86.26 | 88.28 | 84.60 | |

8 | 86.2 | 87.18 | 85.40 | |

10 | 85.9 | 85.73 | 86.35 | |

11 | 85.94 | 84.90 | 87.10 | |

12 | 86.09 | 84.40 | 87.95 |

$\gamma $ . | $\lambda $ . | F1 (%)
. | Precision (%) . | Recall (%) . |
---|---|---|---|---|

0 | 1 | 81.95 | 93.88 | 72.78 |

2 | 84.91 | 91.98 | 78.88 | |

3 | 85.92 | 89.43 | 82.95 | |

4 | 86.82 | 89.23 | 84.65 | |

5 | 86.16 | 87.40 | 85.10 | |

6 | 86.48 | 86.33 | 86.78 | |

1 | 2 | 81.58 | 95.40 | 71.35 |

4 | 84.03 | 92.10 | 77.30 | |

6 | 85 | 90.95 | 79.93 | |

8 | 86.11 | 90.68 | 82.13 | |

10 | 86.74 | 90.28 | 83.68 | |

11 | 86.51 | 89.45 | 83.98 | |

12 | 86.77 | 88.40 | 85.48 | |

14 | 87.2 | 88.78 | 85.78 | |

16 | 86.38 | 87.33 | 85.58 |

$\gamma $ . | $\lambda $ . | F1 (%)
. | Precision (%) . | Recall (%) . |
---|---|---|---|---|

0 | 1 | 81.95 | 93.88 | 72.78 |

2 | 84.91 | 91.98 | 78.88 | |

3 | 85.92 | 89.43 | 82.95 | |

4 | 86.82 | 89.23 | 84.65 | |

5 | 86.16 | 87.40 | 85.10 | |

6 | 86.48 | 86.33 | 86.78 | |

1 | 2 | 81.58 | 95.40 | 71.35 |

4 | 84.03 | 92.10 | 77.30 | |

6 | 85 | 90.95 | 79.93 | |

8 | 86.11 | 90.68 | 82.13 | |

10 | 86.74 | 90.28 | 83.68 | |

11 | 86.51 | 89.45 | 83.98 | |

12 | 86.77 | 88.40 | 85.48 | |

14 | 87.2 | 88.78 | 85.78 | |

16 | 86.38 | 87.33 | 85.58 |

^{1}

We used beta2 version of *Silbido* at https://roch.sdsu.edu/index.php/software/. The latest version of *Silbido* is available at https://github.com/MarineBioAcousticsRC/silbido (Last viewed July 18, 2023).

^{2}

The preprocessing code is available at https://doi.org.10.5258/SOTON/D0316. The SMC-PHD code is available at https://github.com/PinaGruden/SMC-PHD_whistle_contour_tracking (Last viewed July 18, 2023).

^{3}

Code is available at https://github.com/Paul-LiPu/DeepWhistle (Last viewed July 18, 2023).

## REFERENCES

*Multivariate Analysis, Methods and Applications*

*Tursiops truncates*)

*Peponocephala electra*)

*Advances in Neural Information Processing Systems*, December 5-8, Lake Tahoe, NV (Curran Associates Inc., Red Hook, NY), Vol. 26.

*Globicephala macrorhynchus*

*Delphinapterus leucas*

*Globicephala melaena*) and its relationship to behavior and environment

*Behavior of Marine Animals*

*Sousa chinensis*

*Encyclopedia of Marine Mammals*

*Advances in Neural Information Processing Systems*, December 8–14, Vancouver, Canada (Curran Associates Inc., Red Hook, NY), pp. 6835–6846.

*Advances in Neural Information Processing Systems*, February 2–7, New Orleans, LA (Curran Associates Inc., Red Hook, NY), Vol.