A convolutional neural network (CNN) was trained to identify multi-modal gunshots (impulse calls) within large acoustic datasets in shallow-water environments. South Atlantic right whale gunshots were used to train the CNN, and North Pacific right whale (NPRW) gunshots, to which the network was naive, were used for testing. The classifier generalizes to new gunshots from the NPRW and is shown to identify calls which can be used to invert for source range and/or environmental parameters. This can save human analysts hours of manually screening large passive acoustic monitoring datasets.

The propagation of acoustic signals in shallow water is highly impacted by interaction with the sea surface and the seabed. These environmental interactions can cause low-frequency signals to propagate dispersively. In this case, the received signal can be modeled as a set of frequency-dependent components, or modes, with each mode traveling at its own group speed. Having passed through the propagation channel, the modes contain information that can be used to estimate source location and/or environmental characteristics1–5 (hereinafter referred to as “environmental inversion”). However, before environmental inversion can be performed, it is necessary to estimate the modes. Modal filtering has historically been done using arrays of synchronized hydrophones, but it is now established that it can be done with a single sensor, which is our context here. Several methods can be used to filter modes from a single hydrophone including Bayesian learning,6 particle filtering,7 the dispersion-based short-time Fourier transform,8 or warping.9 Once the modes have been filtered, they can be used as input for environmental inversion problems. To obtain meaningful results, it is desirable for signals that are used in environmental inversion to have at least two modes. Single-hydrophone modal dispersion has been successfully used to estimate range1,2 and/or depth of marine mammals,3 as well as environmental properties.4,5 Modal estimation and subsequent inversion can be carried out using tomographic sound sources such as the transmission of pseudo-random noise,10 SUS (Signal, Underwater Sound) charge explosions;5 or light bulb implosions.11 Alternatively, natural acoustic sources of opportunity, notably baleen whale vocalizations1–3,12 are also used. In the context of natural sources, modal inversion has been limited to small dataset, because finding suitable signals in large passive acoustic monitoring (PAM) datasets is a tedious task. The objective of our research is to facilitate the processing of large PAM datasets for environmental inversion. A high generalizability to PAM datasets containing vocalizations from similar species is also desirable, for it would increase the number of environments in which our method could be applied.

This study focuses on automatic detection of dispersed gunshot calls suitable for environmental inversion. Gunshot calls (hereafter referred to as “gunshots”) are impulsive vocalizations which have been reported for the North Pacific right whale (NPRW, Eubalaena japonica), North Atlantic right whale (NARW, Eubalaena glacialis), South Atlantic right whale (SARW, Eubalaena australis), and bowhead whale13,14 (Balaena mysticetus). Gunshots are prime acoustic sources of opportunity for environmental inversion, because of their impulsive nature that tends to produce dispersive signals in shallow water waveguides.

Here, we present a convolutional neural network (CNN) applied to scan large acoustic datasets containing vocalizations from the SARW and NPRW species to detect multi-modal gunshots. We find that the CNN is able to learn a complex nonlinear function to classify a spectrogram as (1) no gunshot present, (2) gunshot with one mode (or no dispersion), or (3) gunshot with at least two modes. Our target application, environmental inversion, requires the extraction of calls with a high signal-to-noise ratio (SNR). This differs from the typical goal of bioacoustic detectors which aim to find all the calls. In terms of performance, bioacoustic detectors need to operate with a high true positive rate (TPR) and a minimal false negative rate (FNR) and false positive rate (FPR). Contrastingly, our aim is to detect the highest SNR multi-modal gunshots with a minimal FPR, and we do not focus on detecting the majority of the calls. These metrics enable the automatic scanning of big-data scale acoustic datasets to find calls that are suitable for environmental inversion. Hence, a large number of false-negative classifications is not a problem for our application. Also, it is likely that any missed multi-modal calls would not have an SNR that is high enough to be effectively post-processed. Taking a data-driven approach to the identification of multi-modal gunshots drastically reduces the amount of time that it takes to collect source-of-opportunity data for environmental inversion when compared to manual search by a human analyst which, as far as we know, is the current state of the art. Although existing bioacoustic detectors can find gunshots, none of the available methods have the capacity to discriminate between signals based on modal content. Moreover, this approach provides opportunities to ask data-heavy research questions. This step is crucial in the development of scalable analytical methods which can be readily applied to massive existing PAM datasets15 as well as through-the-sensor environmental characterization methods. If the reader desires background material regarding the mathematical and practical elements of deep learning, the book Deep Learning16 is a highly recommended resource. Additionally, all the code and labeled spectrograms used in this article can be found on GitHub17 and figshare,18 respectively.

This paper is organized as follows. In Sec. 2 the data is introduced. Next, data preprocessing and the CNN's architecture are presented in Sec. 3. Following which, the training and testing procedures and results are shown in Secs. 4 and 5, respectively. Finally, Sec. 6 provides concluding remarks.

The PAM data used to train the CNN were collected in Bahía San Antonio, Argentina over the course of eight days from late August to early September 2015 as a part of the San Antonio model BAY (SAMBAY) experiment. The data were collected using a SoundTrap 202 STD with a 96 kHz sampling rate and amplifier set to “high” (maximum level before clipping is 169.2 dB re 1 μPa2) at a depth of 10 m, while the water depth was tide-dependent and ranged between 10 and 25 m. The data were then split into hour-long audio files (WAV) which totaled 215 h. In the data, the begin and end times of SARW gunshots were manually annotated by an analyst. The gunshots were later separated into two of the classes of interest as described in Sec. 3.1.

The PAM data used to test the CNN were collected in the Bering Sea from May to October 2014 from AURAL recorders attached to the sub-surface mooring station M2 (AURAL deployments BS14_AU_PM02-a and BS14_AU_PM02-b). The recorder was positioned at 65 m in 73-m water depth. Raw data from the recorder were converted into 10-min audio files (WAV) which altogether make 3240 h of acoustic data, of which 1124 h were manually screened for NPRW vocalizations (gunshots and frequency-modulated calls referred to as “upsweeps”19) by an analyst using an in-house matlab program called “soundchecker.” The data were preselected such that all audio files analyzed in this study contained calls and yielded 12.5 h of acoustic data. A subset of the data containing exclusively gunshots and another subset of the data containing exclusively upsweeps were selected to perform two experiments to test the network. The considered data subset was collected during August and September, a time period during which the water column is highly stratified.

Initially, training data (SARW) were decimated to a sampling frequency of 1 kHz to limit the frequency range to where the presence of lower-frequency modes was anticipated. These calls may or may not be dispersive. Also, among the dispersive calls, there may be different numbers of modes present (Fig. 1). In order to further label the SARW data based on the number of modes present, spectrograms of calls were calculated using the start and end times as bounds. The short-time Fourier transform (STFT) was carried out using a 31-sample Hamming window with an empirically determined overlap of 26 samples and 256 frequency bins. Finally, the spectrograms were transformed onto a dB scale and normalized to be within the range [0,1]. These parameters made the modes of dispersed gunshots visually discernible, and these spectrograms were used to label the calls as having either (1) one mode or (2) at least two modes. After all examples were labeled, each gunshot was centered in a 0.634-s time snippet (the snippet length was in part chosen to fit the longest call in the dataset). A uniform target image size of 128 × 128 was decided upon for the dataset. Indeed, it is not too computationally intensive to generate a spectrogram of that size, and square images are commonly used with CNNs. To achieve this, the number of frequency bins was first set to 256, and only the first 128 bins (corresponding to positive frequencies) were kept. Also, the overlap and snippet length were fine-tuned such that each row contained 128 pixels.

Fig. 1.

(a) A spectrogram from the SARW dataset with no gunshot present. (b) A spectrogram from the SARW dataset containing a gunshot with one mode. (c) A spectrogram from the SARW dataset containing a gunshot with four modes. (d) A grid of 64 spectrograms from the SARW dataset containing no gunshots. (e) A grid of 64 spectrograms from the SARW dataset containing gunshots with one mode. (f) A grid of 64 spectrograms from the SARW dataset containing gunshots with at least two modes.

Fig. 1.

(a) A spectrogram from the SARW dataset with no gunshot present. (b) A spectrogram from the SARW dataset containing a gunshot with one mode. (c) A spectrogram from the SARW dataset containing a gunshot with four modes. (d) A grid of 64 spectrograms from the SARW dataset containing no gunshots. (e) A grid of 64 spectrograms from the SARW dataset containing gunshots with one mode. (f) A grid of 64 spectrograms from the SARW dataset containing gunshots with at least two modes.

Close modal

Because the objective of the study was to build a CNN that can scan large datasets, the network must also be able to identify and ignore sections of the data which do not contain a gunshot. Thus, examples with no gunshot present were collected by selecting random 0.634-s time snippets of the SARW training data and manually inspecting the calculated spectrograms to ensure the absence of any gunshot calls. Examples were compiled in this manner until the training dataset was balanced among the three classes. The no call, one mode, and at least two modes classes constitute 500 examples (32.3%), 620 examples (40.1%), and 427 examples (27.6%) of the labeled data, respectively. Figures 1(a)–1(c) show individual examples from each of the three classes, while Figs. 1(d)–1(f) show a grid of 64 examples from each of the three classes. The rightmost column [Figs. 1(c) and 1(f)] contain the multi-modal calls which are the interest of this study.

The CNN architecture is presented in Table 1. Specifically, the convolutional section of the CNN consists of four layers, which are followed by three hidden dense layers and a classification layer. The kernel sizes of the convolutional layers are 5 × 5 and 3 × 3, and each of the dense layers consists of 1024 units while the classification layer consists of three (one for each class). The rectified linear unit (ReLU) activation function is applied to all layers. An L2 regularization of 0.001 is applied after each layer, and dropout with a probability of 0.2 for the convolutional layers and 0.5 for the dense layers is applied to prevent overfitting. A 2 × 2 max pooling layer is included between the convolutional layers and dropout to aid in the dimensional reduction of the data and also contributes to the network's translational invariance. The sparse categorical cross-entropy between the true and predicted labels is used as the cost function. The CNN was implemented using tensorflow,20 and used the Adam21 optimization method. A constant learning rate, batch size, and total epochs of 0.0005, 32, and 130 were set, respectively.

Table 1.

A high-level view of the CNN architecture used. In total, there are 6 919 043 tunable parameters.

Layer NameComponents
Convolutional Layer (×4Conv + L2 + ReLu → Max-Pool → Dropout 
Dense Layer (×3Dense + L2 + ReLu → Dropout 
Classification Dense + L2 + ReLu 
Layer NameComponents
Convolutional Layer (×4Conv + L2 + ReLu → Max-Pool → Dropout 
Dense Layer (×3Dense + L2 + ReLu → Dropout 
Classification Dense + L2 + ReLu 

Given the relatively small size of the training set (1547 spectrograms), the CNN was trained using 5-fold cross-validation to artificially augment it. First, the SARW spectrograms were shuffled and split into five groups or “folds.” Next, five CNNs of the architecture shown in Table 1 were trained using a different combination of four folds and validated using a different fifth fold that the network had not yet seen. Calculating performance metrics for the CNN requires that each processed spectrogram is manually verified to be one of true positive (TP), false positive (FP), true negative (TN), or false negative (FN). Note that a positive example is a spectrogram containing a gunshot with at least two modes, and a negative example is a spectrogram containing either a gunshot with one mode or no gunshot. To bolster the network's translational invariance, data augmentation was performed by moving a constrained random number of columns from the right side of a spectrogram image to the left side (wrapping) when used in training to simulate a time shift. Each spectrogram was shifted such that the call itself would not be subject to partial wrapping. This process promotes translational invariance with respect to time such that regardless of where the gunshot, single or multi-modal, appears in the scan window, it can be properly classified by the CNN.

At the end of the training process, the metrics for each CNN were averaged to assess the performance. Notably, across the five folds of training, the average accuracy [i.e., (TP+TN)/(TP+TN+FP+FN)] of the CNN is 0.90, and the average precision [i.e., TP/(TP+FP)] for the class of spectrogram containing a gunshot with at least two modes is 0.88 as shown in Table 2. A confusion matrix was generated using each of the validation folds. These were summed to produce Table 3. Additionally, the precision for the class of spectrogram with no call present across all five validation folds is 0.98.

Table 2.

The accuracy and precision for the class of spectrogram containing a gunshot with at least two modes for each of the five folds as well as their averages.

FoldAccuracyPrecision of At Least Two Modes Class
0.89 0.83 
0.92 0.82 
0.90 0.90 
0.89 0.97 
0.90 0.87 
Average 0.90 0.88 
FoldAccuracyPrecision of At Least Two Modes Class
0.89 0.83 
0.92 0.82 
0.90 0.90 
0.89 0.97 
0.90 0.87 
Average 0.90 0.88 
Table 3.

Sum of the confusion matrices generated form each fold of training.

Predicted labels
1 mode 2 modesno call
True labels 1 mode 565 48 
  2 modes 90 336 
 no call 449 
Predicted labels
1 mode 2 modesno call
True labels 1 mode 565 48 
  2 modes 90 336 
 no call 449 

A high precision among the class of spectrogram containing a gunshot with at least two modes equates to a high number of time snippets marked as containing a multi-modal gunshot by the CNN actually containing this call type. This is important in decreasing the time spent manually screening for these ideal calls in unknown datasets. Most of the misclassifications occur between the calls with one mode and the calls with at least two modes. These two classes of spectrogram are more characteristically similar to each other than either are to the spectrograms with no gunshot present which has a notably high precision.

In order to ensure that the samples used to train the network were representative of multiple ranges, the same training procedure was carried out with the addition of randomly stretching and interpolating the multi-modal gunshot spectrograms in the training data for each fold. Each multi-modal gunshot was stretched horizontally from 128 pixels to either 200, 250, 300, 350, 400, or not at all (chosen to be random and uniform). Note that only the multi-modal gunshots were stretched because when the single-mode gunshots were stretched it falsely provided a highly dispersive appearance. Both the average accuracy and precision of the class of spectrogram that contains a multi-modal gunshot decreased by 0.01. The only metric that improved was the average precision of the class of spectrograms with no gunshot calls which increased by 0.013. The fact that the changes in the average metrics were not meaningfully large and did not improve or worsen uniformly leads us to believe that the model is already able to incorporate knowledge of the ranges present in the dataset during training. The final model did not use this augmentation technique.

The utility of the CNN is determined by its ability to generalize to impulsive calls vocalized by different whale species in different places. This is desirable because it expands the area where the classifier can be used to collect multi-modal gunshots for use in environmental inversion. Hence, because the CNN was trained on SARW data (South Atlantic region), its performance was measured using acoustic data containing NPRW vocalizations (North Pacific region) to which the network was naive.

When the CNN is applied to a large unknown acoustic data file, all adjacent (non-overlapping) time snippets of 0.634 s are transformed into spectrograms using the same methodology described in Sec. 3.1 with the exception of labeling and the use of provided start and end times. The spectrograms are then processed by the CNN to predict their classes. The primary goal of testing is to evaluate the ability of the CNN to detect single-mode and multi-modal gunshots and differentiate them from each other or any other extraneous sound sources. The labels predicted by the CNN are manually verified. This leads to the categorization of all processed spectrograms as TP, FP, TN, or FN.

Ideally the CNN is able to extract a sufficient number of multi-modal gunshots from the known gunshot-containing files. When applied to the upsweep data, the goal is to produce as few FP detections as possible and identify any gunshots that were not manually annotated. Due to the fact that dispersed gunshots are relatively sparse, the data in the test set is unbalanced with 437 time snippets that contain multi-modal gunshots and 37 063 time snippets that contain either a gunshot with one mode or no call.

Figures 2(a) and 2(b) show a detected gunshot with one mode and five modes, respectively. This specific example of a dispersed gunshot has modes that are longer and closer together than the examples shown in Fig. 1(f). This illustrates the CNN's generalization capability.

Fig. 2.

(a) Gunshot with a single mode detected by the CNN in the NPRW dataset. (b) Gunshot with five modes detected by the CNN in the NPRW dataset.

Fig. 2.

(a) Gunshot with a single mode detected by the CNN in the NPRW dataset. (b) Gunshot with five modes detected by the CNN in the NPRW dataset.

Close modal

The precision-recall curve and receiver operating characteristic (ROC) illustrate results which stray from those of an ideal model because of the imbalance of the test data22 [Figs. 3(a) and 3(b)]. However, given that the goal of the network is to extract multi-modal gunshots with as few FP detections as possible, we argue that this model is sufficient. Here, the detection threshold can still be tuned to have a precision for the at least two modes class that is greater than 0.80. Moreover, although the maximized true positive rate [TPR—i.e., TP/(TP+FN)] is close to 0.25, this is still two orders of magnitude greater than the false positive rate [FPR—i.e., FP/(FP+TN)]. Note that the presented mathematical definitions of TPR and FPR are common in machine learning; however, some may refer to these quantities as TP probability and FP probability, respectively.

Fig. 3.

(a) Precision-recall curve for the class of spectrogram with at least two modes. (b) Receiver operating characteristic (ROC) for the class of spectrogram with at least two modes.

Fig. 3.

(a) Precision-recall curve for the class of spectrogram with at least two modes. (b) Receiver operating characteristic (ROC) for the class of spectrogram with at least two modes.

Close modal

With these metrics, our CNN is sufficient for the application to large acoustic datasets that contain dispersed gunshots. Considering the TPR range shown in Fig. 3(b), a reasonable detection threshold will cause the CNN to capture between one in every four and one in every ten multi-modal gunshots that it encounters. Setting a threshold of 0.61 to achieve a baseline TPR of 0.1, the corresponding FPR is 0.00029. This means that for every 10 000 time snippets that do not contain a gunshot or that contain a gunshot with one mode, approximately three will be falsely labeled as a multi-modal gunshot. In one week of PAM data, this will yield 286 FP detections assuming the duration of the time snippet is unchanged. These metrics are sufficient to efficiently screen multiple hours of PAM data and select a few multi-modal gunshots to use for environmental inversion. Using a threshold of 0.61, a total of 45 multi-modal gunshots (TP detections) were detected by the CNN with 11 FP detections. There were 1569 3-min bins in the NPRW dataset which were marked as containing at least one gunshot. Therefore, there are an abundance of opportunities for the network to identify high-SNR gunshots with multiple modes. It is an important result that the network is shown to generalize across multiple receivers and water depths. As described in Sec. 2, the SARW data (training and validation) were collected in an environment with tide-dependent water depth between 10 and 25 m, while the NPRW data (testing) were collected in 73-m water depth. The ability to generalize across water depths is particularly significant, for it is one of the primary determinants of the presence and timing of mode arrivals in shallow water.

One thing to consider is that the 0.634-s time snippet used to prepare a spectrogram for the CNN was specifically chosen for the datasets used to train and test the network in this study. The time snippet is large enough to contain the longest dispersed gunshots in the dataset, but small enough so that multiple gunshots are not captured simultaneously. In order to adapt the network to a different context (e.g., a multi-second dispersed signal from a man-made source), the CNN would need to be retrained using an appropriately-sized time snippet.

Last, to highlight a notable result, among the acoustic data used in testing were 8.5 h of sub-sampled right whale upsweeps. In addition to no FP detections among the upsweeps, the CNN also identified four multi-modal dispersed gunshots (precision of 1.0 for the class of spectrogram with at least two modes) that were not previously annotated by the human analysts.

In this article, we show that a CNN can be used to detect and classify dispersed signals from PAM datasets. This was notably validated on a dataset to which the network was naive. The training and testing datasets were collected in different oceans (Southern Atlantic and Northern Pacific) using different acoustic recorders and containing vocalizations from different species (SARW and NPRW). This study demonstrates excellent generalization of the proposed method.

The presented CNN has a high enough TPR to detect many high-SNR dispersed gunshots, while having a FPR that is two orders of magnitude lower. Trained using 1547 spectrograms among the three classes, these metrics are sufficient to frequently harvest multi-modal gunshots for environmental inversion without the need for hours of manual screening by a human analyst. In the future, the CNN can be used to continuously harvest these ideal gunshots and speed up the processing pipeline from data collection to information extraction for impulsive signals from right whales in multiple ocean basins with applications in baleen whale density estimation and through-the-sensor environmental estimation.

We thank the numerous field technicians involved in NOAA mooring deployment and retrieval, and the captains and crews of the F/V Aquila. We thank Victoria Warren and Elena Schall for the annotation of the SAMBAY datasets. NOAA mooring and recorder purchasing, deployment, and retrieval were provided by the BOEM-funded ARCWEST project (IA# M12PG00021). We thank Program managers Jeff Denton, Carol Fairfield, and Chuck Monnett. Manual analysis of the M2 mooring was funded by the National Fish and Wildlife Foundation. The findings and conclusions in the paper are those of the authors and do not necessarily represent the views of the National Marine Fisheries Service, NOAA. Reference to trade names does not imply endorsement by the National Marine Fisheries Service, NOAA. This research was supported by the Office of Naval Research (ONR) under Grants Nos. N00014-19-1-2627 and N00014-18-1-2811, by the German Federal Ministry for Education and Science (BMBF) under Grant No. 01DN15019 and by the Woods Hole Oceanographic Institution's Summer Student Fellowship program.

1.
J.
Bonnel
,
A. M.
Thode
,
S. B.
Blackwell
,
K.
Kim
, and
A.
Michael Macrander
, “
Range estimation of bowhead whale (Balaena mysticetus) calls in the arctic using a single hydrophone
,”
J. Acoust. Soc. Am.
136
(
1
),
145
155
(
2014
).
2.
S. M.
Wiggins
,
M. A.
McDonald
,
L. M.
Munger
,
S. E.
Moore
, and
J. A.
Hildebrand
, “
Waveguide propagation allows range estimates for North Pacific right whales in the Bering Sea
,”
Can. Acoust.
32
(
2
),
146
154
(
2004
).
3.
A.
Thode
,
J.
Bonnel
,
M.
Thieury
,
A.
Fagan
,
C.
Verlinden
,
D.
Wright
,
C.
Berchok
, and
J.
Crance
, “
Using nonlinear time warping to estimate North Pacific right whale calling depths in the Bering Sea
,”
J. Acoust. Soc. Am.
141
(
5
),
3059
3069
(
2017
).
4.
J.
Bonnel
,
S. E.
Dosso
,
D.
Eleftherakis
, and
N. R.
Chapman
, “
Trans-dimensional inversion of modal dispersion data on the New England mud patch
,”
IEEE J. Ocean. Eng.
45
(
1
),
116
130
(
2020
). .
5.
G. R.
Potty
,
J. H.
Miller
,
J. F.
Lynch
, and
K. B.
Smith
, “
Tomographic inversion for sediment parameters in shallow water
,”
J. Acoust. Soc. Am.
108
(
3
),
973
986
(
2000
).
6.
H.
Niu
,
P.
Gerstoft
,
R.
Zhang
,
Z.
Li
,
Z.
Gong
, and
H.
Wang
, “
Mode separation with one hydrophone in shallow water: A sparse Bayesian learning approach based on phase speed
,”
J. Acoust. Soc. Am.
149
(
6
),
4366
4376
(
2021
).
7.
I.
Zorych
and
Z.-H.
Michalopoulou
, “
Particle filtering for dispersion curve tracking in ocean acoustics
,”
J. Acoust. Soc. Am.
124
(
2
),
EL45
EL50
(
2008
).
8.
J.-C.
Hong
,
K. H.
Sun
, and
Y. Y.
Kim
, “
Dispersion-based short-time Fourier transform applied to dispersive wave analysis
,”
J. Acoust. Soc. Am.
117
(
5
),
2949
2960
(
2005
).
9.
J.
Bonnel
,
A.
Thode
,
D.
Wright
, and
R.
Chapman
, “
Nonlinear time-warping made simple: A step-by-step tutorial on underwater acoustic modal separation with a single hydrophone
,”
J. Acoust. Soc. Am.
147
(
3
),
1897
1926
(
2020
).
10.
Z.-H.
Michalopoulou
and
N.
Aunsri
, “
Environmental inversion using dispersion tracking in a shallow water environment
,”
J. Acoust. Soc. Am.
143
(
3
),
EL188
EL193
(
2018
).
11.
M.
Taroudakis
,
C.
Smaragdakis
, and
N.
Ross Chapman
, “
Inversion of acoustical data from the ‘Shallow Water 06’ experiment by statistical signal characterization
,”
J. Acoust. Soc. Am.
136
(
4
),
EL336
EL342
(
2014
).
12.
C.
Ioana
,
A.
Jarrot
,
C.
Gervaise
,
Y.
Stéphan
, and
A.
Quinquis
, “
Localization in underwater dispersive channels using the time-frequency-phase continuity of signals
,”
IEEE Trans. Sign. Process.
58
(
8
),
4093
4107
(
2010
).
13.
J. L.
Crance
,
C. L.
Berchok
, and
J. L.
Keating
, “
Gunshot call production by the North Pacific right whale, Eubalaena japonica, in the southeastern Bering Sea
,”
Endangered Species Res.
34
,
251
267
(
2017
).
14.
B.
Würsig
and
C.
Clark
, “
Behavior
,” in
The Bowhead Whale
, Special Publication No. 2, edited by
J. J.
Burns
,
J. J.
Montague
, and
C. J.
Cowles
(
The Society for Marine Mammalogy
,
Yarmouth Port, MA
,
1993
).
15.
G. E.
Davis
,
M. F.
Baumgartner
,
J. M.
Bonnell
,
J.
Bell
,
C.
Berchok
,
J. B.
Thornton
,
S.
Brault
,
G.
Buchanan
,
R. A.
Charif
,
D.
Cholewiak
,
C. W.
Clark
,
P.
Corkeron
,
J.
Delarue
,
K.
Dudzinski
,
L.
Hatch
,
J.
Hildebrand
,
L.
Hodge
,
H.
Klinck
,
S.
Kraus
,
B.
Martin
,
D. K.
Mellinger
,
H.
Moors-Murphy
,
S.
Nieukirk
,
D. P.
Nowacek
,
S.
Parks
,
A. J.
Read
,
A. N.
Rice
,
D.
Risch
,
A.
Širović
,
M.
Soldevilla
,
K.
Stafford
,
J. E.
Stanistreet
,
E.
Summers
,
S.
Todd
,
A.
Warde
, and
S. M.
Van Parijs
, “
Long-term passive acoustic recordings track the changing distribution of North Atlantic right whales (Eubalaena glacialis) from 2004 to 2014
,”
Sci. Rep.
7
,
13460
(
2017
).
16.
I.
Goodfellow
,
Y.
Bengio
, and
A.
Courville
,
Deep Learning
(
MIT Press
,
Cambridge
,
2016
).
17.
M.
Goldwater
, “
ssf goldwater
,” GitHub Repository, https://github.com/whoi-mars/ssf_goldwater (
2020
) (Last viewed 9/16/2021).
18.
M.
Goldwater
, “Data Set for Dispersed Gunshot CNN
,” figshare, (
2021
) (Last viewed 9/16/2021).
19.
M.
Mcdonald
and
S.
Moore
, “
Calls recorded from north pacific right whales (Eubalaena japonica) in the eastern Bering sea
,”
J. Cetacean Res. Manage.
4
,
261
–266 (
2002
).
20.
M.
Abadi
,
A.
Agarwal
,
P.
Barham
,
E.
Brevdo
,
Z.
Chen
,
C.
Citro
,
G. S.
Corrado
,
A.
Davis
,
J.
Dean
,
M.
Devin
,
S.
Ghemawat
,
I.
Goodfellow
,
A.
Harp
,
G.
Irving
,
M.
Isard
,
Y.
Jia
,
R.
Jozefowicz
,
L.
Kaiser
,
M.
Kudlur
,
J.
Levenberg
,
D.
Mané
,
R.
Monga
,
S.
Moore
,
D.
Murray
,
C.
Olah
,
M.
Schuster
,
J.
Shlens
,
B.
Steiner
,
I.
Sutskever
,
K.
Talwar
,
P.
Tucker
,
V.
Vanhoucke
,
V.
Vasudevan
,
F.
Viégas
,
O.
Vinyals
,
P.
Warden
,
M.
Wattenberg
,
M.
Wicke
,
Y.
Yu
, and
X.
Zheng
, “
TensorFlow: Large-scale machine learning on heterogeneous systems
,” https://www.tensorflow.org (
2015
) (Last viewed 10/8/2021).
21.
D. P.
Kingma
and
J.
Ba
, “
Adam: A method for stochastic optimization
,” arXiv:1412.6980v9 (
2017
).
22.
J.
Brownlee
, “
Roc curves and precision-recall curves for imbalanced classification
,” https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-imbalanced-classification/ (
2020
) (Last viewed 10/8/2021).