Deep clustering was applied to unlabeled, automatically detected signals in a coral reef soundscape to distinguish fish pulse calls from segments of whale song. Deep embedded clustering (DEC) learned latent features and formed classification clusters using fixed-length power spectrograms of the signals. Handpicked spectral and temporal features were also extracted and clustered with Gaussian mixture models (GMM) and conventional clustering. DEC, GMM, and conventional clustering were tested on simulated datasets of fish pulse calls (fish) and whale song units (whale) with randomized bandwidth, duration, and SNR. Both GMM and DEC achieved high accuracy and identified clusters with fish, whale, and overlapping fish and whale signals. Conventional clustering methods had low accuracy in scenarios with unequal-sized clusters or overlapping signals. Fish and whale signals recorded near Hawaii in February–March 2020 were clustered with DEC, GMM, and conventional clustering. DEC features demonstrated the highest accuracy of 77.5% on a small, manually labeled dataset for classifying signals into fish and whale clusters.

Deep learning, a powerful, recent subfield of machine learning,1 excels at learning representations of large amounts of data and often outperforms shallower machine learning methods.2–8 Deep convolutional networks have been particularly effective for image classification due to their scalable capacity. In ocean bioacoustics, machine learning has become an effective signal detection and classification tool2,9–25 with research encompassing both supervised methods,9–13,18 in which reliable labeled data are present, and unsupervised methods16,17,19,26 when labeled data are limited or unavailable. Deep convolutional learning of marine mammal signals in the time-frequency domain has shown promising results for supervised detection and classification.15,20–25 For other ocean bioacoustic signals, particularly marine fishes, just a few studies have considered deep learning classification.26–28 This study proposes a variant of deep convolutional learning dubbed deep clustering as a method for time-frequency representation learning and classification of unlabeled, automatically detected signals in a cacophonous coral reef environment, characterized by significant presence of spatially dense calling fish.

Coral reefs are amongst the most biodiverse ecosystems in the ocean but are under threat from global climate change, overfishing, and pollution.29–32 Passive acoustic studies allow non-invasive study of reef ecology over continuous time scales of days to years, and offer a complement to measurements that traditionally have been collected through direct observation by SCUBA divers or snorkelers, taking point measurements in time.33 Coral reef ambient biological sound, or soundscapes, are an emerging topic of interest in the coral reef scientific community. Reef sound has been linked to relative ecosystem health, abundance of both reef building coral and fleshy macroalgae, and fish density.34,35 Healthy reef sounds have also been shown to enhance larval recruitment,36 suggesting that reef soundscapes are not just a by-product of biological activity but an integral part of complex ecosystem function. The biological components of reef soundscapes comprise of inadvertent sounds from organism activities such as feeding, locomotion,37 and photosynthesis,38 and acoustic communication such as the fish calls discussed here. Acoustic classification of individual fish calls informs our understanding of spatial and temporal movement, species assemblage, and response to human activities.

Acoustic classification of tropical marine fishes, such as damselfish (family Pomacentridae), has been improved through passive acoustic field experiments39–41 but lacks established terminology across studies.42 To reduce the labeling burden, a few studies have considered automatic classification of fish calls by utilizing machine learning tools.18,26 Unsupervised methods such as Gaussian mixture models (GMM) showed improved detection of fish chorusing events compared to the conventional spectral energy detector.26 Deep neural networks, including convolutional networks (CNN), recurrent networks (RNN),27 and sparse autoencoders (SAE)43 have achieved higher classification accuracy of grouper calls than a sparse feature classification method.28 

In this study, deep clustering is applied to automatically detected, unlabeled signals, combining the benefits of deep learning with the flexibility of clustering. Due to the complexity of the dataset, clustering is restricted to two distinct signals: low-frequency fish pulse trains (fish) and segments of whale song units (whale). Standard GMM-based clustering and other conventional clustering of handpicked features are conducted for comparison.

  1. Spectral and time domain features were manually chosen, or handpicked, based on their observed relation to coral reef fish calling and on studies of fish calling spectral and temporal properties.39–41 Clustering of events using handpicked features provides physical intuition about the event signals but is difficult due to the varying feature properties. Thus, the feature extraction and classification steps are combined into one algorithm using deep learning.

  2. Fixed-length spectrograms were used in deep embedded clustering (DEC),44 a deep-learning image compression algorithm that produces an accurate image reconstruction. The DEC latent features were directly clustered with GMM, then the DEC was trained further using a joint clustering loss to encourage deep cluster formation among the latent features.45–47 

Simulated signals were used to compare the limitations of the two approaches before applying them to data recorded on a Hawaiian coral reef between February and March 2020.48 

In Sec. II, the clustering methods are overviewed and metrics are presented for measuring classification success. Section III outlines the simulated signals used for method comparison. Experimental data collection and automatic signal detection on a Hawaiian coral reef in February–March 2020 are discussed in Sec. IV along with discussion of handpicked feature extraction. A comparison of clustering results on the simulated datasets are presented in Sec. V A, with an analysis of the effect of the DEC latent dimension. Finally, Sec. V B presents DEC, GMM, and conventional clustering of two signal types for the experimentally detected signals. Section VI summarizes the approach and discusses challenges associated with the methods and dataset.

Clustering methods are unsupervised frameworks for categorizing data according to their similarities.49 This section describes DEC for learning the latent feature vectors and forming clusters directly from high-dimensional data. For comparison, the GMM, K-means, and hierarchical agglomerative clustering algorithms, common methods that rely on approximations of the input feature vector properties, are also presented.

DEC is a modified convolutional autoencoder, a deep feature learning method,50 that uses a joint loss function to transform a set of N input data, X={X1,,XN},Xnd1×d2, into a set of latent feature vectors, Z={z1,,zN},znP. The output is a reconstruction of the input data, X̂nd1×d2, estimated using the latent feature vectors. The DEC structure consists of two stacked networks (Fig. 1): the encoder network, which maps input data into a lower-dimensional latent space, and the decoder network, which reconstructs an approximation of the input from the latent space. For a given latent feature dimension, P, DEC learns representative latent features using the mean squared reconstruction error of the inputs,

(1)

Inspired by recent applications to spectrograms of unlabeled seismic events,46,47 the compressive architecture downsamples the input using convolutions of stride length 2. The rectified linear unit (ReLU) was used to transform the outputs at each layer. The neural network structure46 was adapted by reducing the convolutional filter size and adding convolutional layers to downsample the input.

FIG. 1.

(Color online) Architecture of the deep embedded encoding (DEC) convolutional model. Convolutional filters were sized 3 × 3 and used the ReLU activation function.

FIG. 1.

(Color online) Architecture of the deep embedded encoding (DEC) convolutional model. Convolutional filters were sized 3 × 3 and used the ReLU activation function.

Close modal

The DEC weights were pretrained for 1000 epochs using the MSE of fixed-length spectrogram images containing Nf frequencies and Nt time samples, XnNf×Nt, and their reconstructions, X̂nNf×Nt. The Adam optimizer51 was used with learning rate of 103. The spectrogram length Nt corresponds to a signal duration of 0.5 s = NtdT and was motivated by the signals of interest in the coral reef soundscape. Events longer than 0.5 s were clipped to length to ensure a fixed size.

Each latent feature vector zn is assumed to belong to one of K clusters, Ck, with means μkP,k=1,,K. The latent features zn were clustered with GMM to initialize the deep clustering, with means μk,k=1,,K. These clustering results are reported as “GMM, latent features” for the simulated and experimental data.

Last, the DEC latent feature vectors were updated by incorporating the Kullback-Leibler (KL) divergence into the training loss. DEC was trained for an additional 20 epochs using the joint clustering/reconstruction loss function44,45

(2)
(3)
(4)
(5)

where Eq. (4) is the empirically estimated Student's t-distribution of the latent feature vectors around each cluster mean and Eq. (5) further penalizes points that are distant from cluster centers.45 The objective of additional training is to learn latent feature vectors that are both disjoint in feature space and representative of the inputs. The weights for each loss term in Eq. (2) are from Ref. 45. Details of the model are given in Fig. 1 and Table I. The DEC was written in keras52 using tensorflow.53 

TABLE I.

Network architecture used in DEC. The input and output shapes are given as [height, width, depth]. The kernel size is the shape of the two dimensional convolutional filters in [height, width].

Layer NameLayer TypeInput shapeFiltersKernel sizeStrideActivationOutput shapeParameters
Conv1 2D Convolution [90,20,1] [3,3] [2,2] ReLU [45,10,8] 80 
 2D Conv. [45,10,8] [2,1] [1,1] ReLU [44,10,8] 136 
Conv2 2D Conv. [44,10,8] 16 [3,3] [2,2] ReLU [22,5,16] 1168 
 2D Conv. [22,5,16] 16 [1,2] [1,1] ReLU [22,4,16] 528 
Conv3 2D Conv. [22,4,16] 32 [2,1] [2,1] ReLU [11,4,32] 1056 
 2D Conv. [11,4,32] 64 [2,1] [1,1] ReLU [10,4,64] 4160 
Conv4 2D Conv. [10,4,64] 64 [2,1] [2,1] ReLU [5,4,64] 8256 
Flatten Flatten [5,4,64] − − − − [1280] 
Encoded Fully Connected [1280] − − − ReLU [10] 6405 
Dense Fully Connected [15] − − − ReLU [1280] 7680 
Reshape  [1280] − − − − [5,4,64] 
TConv4 Transposed − − − − − − − 
 Convolution [5,4,64] 32 [2,1] [2,1] ReLu [10,4,32] 4128 
 T. Conv. [10,4,32] 32 [2,1] [1,1] ReLu [11,4,32] 2080 
TConv3 T. Conv. [11,4,32] 16 [2,1] [2,1] ReLu [22,4,16] 1040 
 T. Conv. [22,4,16] 16 [1,2] [1,1] ReLu [22,5,16] 528 
TConv2 T. Conv. [22,5,16] [3,3] [2,2] ReLu [44,10,8] 1160 
 T. Conv. [44,10,8] [2,1] [1,1] ReLu [45,10,8] 136 
TConv1 T. Conv. [45,10,8] [3,3] [2,2] Linear [90,20,1] 73 
Layer NameLayer TypeInput shapeFiltersKernel sizeStrideActivationOutput shapeParameters
Conv1 2D Convolution [90,20,1] [3,3] [2,2] ReLU [45,10,8] 80 
 2D Conv. [45,10,8] [2,1] [1,1] ReLU [44,10,8] 136 
Conv2 2D Conv. [44,10,8] 16 [3,3] [2,2] ReLU [22,5,16] 1168 
 2D Conv. [22,5,16] 16 [1,2] [1,1] ReLU [22,4,16] 528 
Conv3 2D Conv. [22,4,16] 32 [2,1] [2,1] ReLU [11,4,32] 1056 
 2D Conv. [11,4,32] 64 [2,1] [1,1] ReLU [10,4,64] 4160 
Conv4 2D Conv. [10,4,64] 64 [2,1] [2,1] ReLU [5,4,64] 8256 
Flatten Flatten [5,4,64] − − − − [1280] 
Encoded Fully Connected [1280] − − − ReLU [10] 6405 
Dense Fully Connected [15] − − − ReLU [1280] 7680 
Reshape  [1280] − − − − [5,4,64] 
TConv4 Transposed − − − − − − − 
 Convolution [5,4,64] 32 [2,1] [2,1] ReLu [10,4,32] 4128 
 T. Conv. [10,4,32] 32 [2,1] [1,1] ReLu [11,4,32] 2080 
TConv3 T. Conv. [11,4,32] 16 [2,1] [2,1] ReLu [22,4,16] 1040 
 T. Conv. [22,4,16] 16 [1,2] [1,1] ReLu [22,5,16] 528 
TConv2 T. Conv. [22,5,16] [3,3] [2,2] ReLu [44,10,8] 1160 
 T. Conv. [44,10,8] [2,1] [1,1] ReLu [45,10,8] 136 
TConv1 T. Conv. [45,10,8] [3,3] [2,2] Linear [90,20,1] 73 

GMM aim to partition a set of N feature vectors, X={x1,,xN} into K clusters, Ck, k=1,,K, assuming the kth cluster is Gaussian with prior probability πk, mean μk, and covariance Σk.54Ck is a nonempty set. If the number of clusters, K, is assumed a priori, then the Gaussian parameters θk=(πk,μk,Σk) can be iteratively estimated.

GMM employs the expectation-maximization (EM) algorithm to estimate θk from X via the complete data likelihood.54 EM is an alternating algorithm that updates the weighted posterior probability (responsibility, rnk) and the prior, mean, and covariance by54 

(6)
(7)
(8)

for xnX,n=1,,N. θk={πk,μk,Σk} represents the estimated parameters of the kth cluster at the tth iteration. In GMM, the data likelihood p is Gaussian,

(9)

where represents proportionality. EM is iterated until πk, μk, and Σk converge for all k. The estimated covariance matrix can become ill-conditioned if there are fewer than K true clusters.

The cluster priors πk, means μk, and covariance Σk,k=1,,K were initialized for 100 starts using the K-means++ algorithm (see  Appendix A).55 

K-means56 is an approximation to the EM algorithm that partitions a set of data-derived feature vectors X={x1,,xN} into K clusters, Ck, k=1,,K, where Ck is a nonempty set. The number of clusters K must be assumed a priori. The K-means algorithm is derived assuming the feature vectors belonging to the kth cluster are drawn from a Gaussian distribution with covariance Σ=σ2I and prior probability πk=1/K, where I is the identity matrix,57 

(10)

for n=1,,N. |Ck| denotes the cardinality or size of the kth cluster.

The cluster means μj,j=1,,K were initialized for 100 starts using the K-means++ algorithm ( Appendix A).55 

K-means is commonly used due to its computational speed and interpretability, but its validity is subject to the isotropic covariance assumption and the equality of the cluster priors. If the true number of classes differs from K, the clusters will be incorrectly estimated.

Agglomerative hierarchical clustering,49,54 also called bottom-up clustering, partitions a set of N data-derived feature vectors, X={x1,,xN}, into K clusters Ck, k=1,,K, by grouping the most similar data at each step. Hierarchical clustering successively merges nearby clusters until the stop criterion is achieved. In this case, the stop criterion is met when K or fewer clusters remain, where K must be set by the practitioner.

Agglomerative clustering does not require the number of clusters to be assumed in advance. As demonstrated with classification of dolphin echolocation clicks,16,17 bottom-up methods allow for detailed clusters to be sequentially merged as additional information or labels become available.

In agglomerative clustering, the kth cluster at the mth iteration is represented as Ckm,km=1,,Nm+1. The number of clusters is decreased to N – m by merging the two closest clusters that satisfy

(11)

d(Ci,Ci) is a metric measuring the distance between all points in sets i and i. This agglomerative process is repeated NK+1 times, until there are at most K clusters Ck remaining, with k=1,,K.

The distance metric d chosen here is from Ward's method.58 Ward's method measures the incremental sum-of-squares resulting from merging two sub-clusters.59 For clusters Cj and Ck with cluster means μj and μk that were merged for form Ck with mean μk, the Ward's method distance is

(12)

|Cj| and |Ck| are the cardinality of cluster Cj and Ck. Ward's method has been empirically successful,60 but has been shown to perform worse in cases with unequal-sized clusters.61 Single-linkage, centroid, and complete-linkage methods were also considered, but had low accuracy.

The performance of the handpicked feature clustering and DEC was measured by

(13)
(14)
(15)

where ti is the true label and t̂i is the categorical prediction. The indicator function I(x)=1 if x is true and I(x)=0 otherwise. Here, whale (label 0) was the true category. Fish were indicated by label 1. The label 2 was used in a scenario to indicate overlapping fish/whale signals. Precision (positive predictive value or PPV) measured the ratio of correctly predicted whale to total predicted whale. Recall (hit rate or true positive rate) measured the ratio of correctly classified whale to the true total. Higher metrics correspond to improved performance, with perfect performance given by accuracy, precision, and recall all equal to 1.

A set of simulated coral reef bioacoustic signals, motivated by observed signals from a coral reef soundscape near Hawaii, was simulated to compare conventional clustering of handpicked features to DEC.

One hundred simulation trials were conducted, with 10 000 signals simulated for each trial. The signals were simulated as timeseries sampled at 1 kHz and mimicked whale and fish signals (Fig. 2). A spectrogram image of each signal was generated using a 256–pt FFT with 90% overlap. The fixed-length spectrograms were created by clipping all signals to 0.5 s.

FIG. 2.

(Color online) Simulated (a), (b) and measured (c), (d) coral reef bioacoustic signals before clipping. Whales were simulated with an FM upsweep (a), (c), and fish were simulated with superimposed Gaussian pulses (b), (d). White noise was added to both signals at 25 dB SNR relative to the mean signal power.

FIG. 2.

(Color online) Simulated (a), (b) and measured (c), (d) coral reef bioacoustic signals before clipping. Whales were simulated with an FM upsweep (a), (c), and fish were simulated with superimposed Gaussian pulses (b), (d). White noise was added to both signals at 25 dB SNR relative to the mean signal power.

Close modal

The DEC latent feature dimension was optimized over a range of values (Fig. 7). The optimal DEC performance was compared to clustering with automatically extracted handpicked features. Then, a third category containing an overlapping fish and whale signals was included, with 33% of the total samples belonging to each category.

Quadratic FM sweeps mimic parts of humpack whale song units (whale). The equation for the instantaneous frequency of a quadratic sweep is62 

(16)

where t is the continuous time variable, f0 is the initial sweep frequency, Δf is the total bandwidth, and T is the duration of the signal. The signal is an FM upsweep when β>0, an FM downsweep when β<0 and a tonal when β = 0. The phase Φ(t) of the time domain signal is found by integrating the instantaneous frequency62 

(17)
(18)

where t0 is the signal start time. For simulation, t is discretized at 0.001 s (sampling frequency of 1 kHz).

Timeseries of impulses, or pulse trains, simulated fish pulse calling (fish). The pulses were a set of Np superimposed Gaussian-modulated sinusoids63 spaced Δt apart

(19)

where τ is the half-power pulse width and fc is the center frequency.

The signal parameters were varied randomly for each sample (Table II). Pulse width and center frequency were fixed to achieve a representative pulse structure. The number of pulses and spacing were drawn from experimentally estimated distributions. The duration, initial frequency, and total bandwidth of the FM sweep were drawn at uniform random from a range of realistically observed values. All signals were centered within the 0.5 s spectrogram and assigned a random delay of within ±0.1 s.

TABLE II.

Signal parameters were drawn from random distributions for each simulated event.

FM SweepPulse train
Duration/width (s) TU(0.2,0.4) τ=0.005 
Delay (s) t0U(0.1,0.1) t0U(0.1,0.1) 
Frequency (Hz) f0U(100,400) fc = 200 
 ΔfU(150,150)  
Peak spacing (s)  Δt0.47 beta(4, 23) 
Number of peaks  Np13beta(3.5,8) 
SNR (dB) SNRU(15,30) SNRU(0,30) 
FM SweepPulse train
Duration/width (s) TU(0.2,0.4) τ=0.005 
Delay (s) t0U(0.1,0.1) t0U(0.1,0.1) 
Frequency (Hz) f0U(100,400) fc = 200 
 ΔfU(150,150)  
Peak spacing (s)  Δt0.47 beta(4, 23) 
Number of peaks  Np13beta(3.5,8) 
SNR (dB) SNRU(15,30) SNRU(0,30) 

White noise was added to the simulated signals using a fixed signal-to-noise ratio (SNR),

(20)
(21)

where σs2 is the signal power and σn2 is the noise power. The SNR range of each signal was determined from the experimental spectrograms during manual labeling. The SNR in dB was estimated as the difference of the peak signal power to the median power of the background.

The signal power was estimated as the bandwidth-normalized mean power over the signal duration,64 

(22)
(23)

Three handpicked features were extracted: peak frequency (25), kurtosis (26), and number of timeseries peaks (27). Duration, median power, and cross-sensor coherence were excluded due to limitations of the fixed simulation parameters. Then, GMM, K-means, and hierarchical clustering were applied to the feature matrix to discover K = 2 clusters (fish or whale) or K = 3 (fish, whale, or both).

Three directional autonomous seafloor acoustic recorders (DASARs) were deployed adjacent to a coral reef westward of the island of Hawaii. The DASARs, labeled N, W, and S for the north-most, middle, and south-most sensors, measured pressure and horizontal particle velocity with x– and y–components oriented at orthogonal compass directions. The array was roughly oriented N-S with inter-sensor spacing about 15 m (Fig. 3).

FIG. 3.

(Color online) Diagram of the DASAR array deployed adjacent to a coral reef on the island of Hawaii. The estimated detection locations are shown as gray dots. The majority of the reef is located due east of the array. The sensor positions were measured on the seafloor relative to DASAR W.

FIG. 3.

(Color online) Diagram of the DASAR array deployed adjacent to a coral reef on the island of Hawaii. The estimated detection locations are shown as gray dots. The majority of the reef is located due east of the array. The sensor positions were measured on the seafloor relative to DASAR W.

Close modal

The DASARs recorded continuously for 7 days with a sampling rate of 1 kHz. On February 25, 2020, the dominant soundscape contributors below 500 Hz were reef fish, humpback whales, and motor noise from transiting surface boats, with boat noise occurring predominantly during daylight and fish calling most pronounced during dusk. The data were processed in 5 min chunks to account for DASAR clock drift. For each chunk, the pressure spectrograms on the three sensors were cross-correlated and aligned to within a time bin.65 The spectral density power spectrogram, denoted as matrix XNf×Nt, was generated using a 256–point FFT with 90% overlap and Hanning window and normalized to have units of logμPa Hz−1, with dt=0.026 s and df=3.91 Hz.

A directional event detector65 was developed to utilize the DASARs' directional capability by combining two DASARs ( Appendix B). For this study, the detector parameters were Δθ=90°, T = 120 Hz, Msep=1(Msepdt=0.0256 s), and Tmax=2 s. Detected events were localized66 to ensure that their signal had sufficient bandwidth for feature extraction. Detections for which the localization algorithm failed to converge were discarded. The remaining events were spatially filtered within a 100 m by 100 m box from DASAR S (Fig. 3). A total of 92 736 potentially localizable detections within 100 m were kept for further analysis, on average about 1 detection per second.

The handpicked features (Table III, Fig. 4) are time-frequency properties known to relate to fish call type including power, duration, and peak frequency39–41 as well as timeseries estimates of impulsive noise.18,67

TABLE III.

Event features estimated for automatically detected signals.

Feature Name (units)Description
Kurtosis Fourth moment normalized by the 
 squared variance 
Npeaks (count) Number of peaks with at least 5 
 dB prominence re the standard 
 deviation 
Peak frequency (Hz) Frequency of the peak power 
 spectral density 
Features for experimental data only  
Duration (s) Length of detected event 
Coherence Normalized time coherence 
 between DASARs 
Median PSD (dB) Median power spectral density (PSD) 
 across event mask 
Feature Name (units)Description
Kurtosis Fourth moment normalized by the 
 squared variance 
Npeaks (count) Number of peaks with at least 5 
 dB prominence re the standard 
 deviation 
Peak frequency (Hz) Frequency of the peak power 
 spectral density 
Features for experimental data only  
Duration (s) Length of detected event 
Coherence Normalized time coherence 
 between DASARs 
Median PSD (dB) Median power spectral density (PSD) 
 across event mask 
FIG. 4.

(Color online) Handpicked features of a fish call event on February 25, 07:12 HST, measured on directional autonomous seafloor acoustic recorders (DASARs) N, W, and S at 1 kHz. The spectrograms were generated using a 256–point FFT with 90% overlap. A call duration of 0.52 s was determined during the detection process. (a)–(c) The spectrograms were used to extract peak frequencies (black star) from 219 to 332 Hz and median PSD from 72.8 to 76.3 dB. (d)–(f) The timeseries envelope was used to extract the kurtosis values of 18–21, cross-sensor coherences of 0.72–0.74, and 8 temporal peaks. DASAR S was the closest to the call.

FIG. 4.

(Color online) Handpicked features of a fish call event on February 25, 07:12 HST, measured on directional autonomous seafloor acoustic recorders (DASARs) N, W, and S at 1 kHz. The spectrograms were generated using a 256–point FFT with 90% overlap. A call duration of 0.52 s was determined during the detection process. (a)–(c) The spectrograms were used to extract peak frequencies (black star) from 219 to 332 Hz and median PSD from 72.8 to 76.3 dB. (d)–(f) The timeseries envelope was used to extract the kurtosis values of 18–21, cross-sensor coherences of 0.72–0.74, and 8 temporal peaks. DASAR S was the closest to the call.

Close modal

If tn,1 and tn,2 represent the absolute start and end of the nth detected signal in seconds, the event duration is

(24)

The median power and peak frequency of the nth detection were extracted from the spectrogram, [Fig. 4(a)–4(c)], with

The spectral features extracted at DASAR S, closest to the majority of detected signals (Fig. 3), were used for clustering.

The pressure timeseries of the nth detected signal ynM was extracted between tn,1 and tn,2 (Fig. 4) and contained M time samples. For the experimental data, the vector sensor x– and y–velocity during the nth detected signal, vn,xM and vn,yM, were used to create a beamformed pressure timeseries, ybM, for improved detection SNR,

(25)

where Z0=ρc is the impedance in water with density ρ (kg m3) and sound speed c (m s−1).48 θ̂ is the signal azimuth estimated during the directional detector processing.65 For the simulated data, vn,x=vn,y=0.

Three metrics were chosen for timeseries extraction: kurtosis, number of peaks, and cross-sensor coherence. The kurtosis is a ratio of moments and has recently been applied to the task of differentiating impulsive from non-impulsive sounds,18,67

(26)

σ2 is the variance and the overbar represents the arithmetic mean.

The number of peaks and cross-sensor coherence were extracted from the Hilbert transform H() of the beamformed pressure timeseries (Fig. 4)

Here, the number of peaks is the number of local maxima with at least 5 dB prominence relative to the standard deviation

(27)

where (a, b) is an interval in Yb,maxminjYb[j] indicates the highest trough in (a, b), and σ is the standard deviation of Yb.

Last, for the experimental detections, the normalized correlation coefficient of the timeseries envelope across DASARs was computed to measure the spatial coherence of the signal propagation,

(28)

where YbN and YbS are the Hilbert transform of the beamformed timeseries at the North and South DASARs. The measure of coherence assumes that a good quality biological signal will be received as similar timeseries on two spatially separated sensors.

Figure 5 shows the number of events detected along with extracted feature median, 10%, and 90% levels for every 15 min. The number of peaks in the timeseries has been shown to be an indicator of fish species and call context for Hawaiian reef fish.40,41 In this study, fish were most common during nighttime [Fig. 5(g)], with pulse trains peaking during the evening chorus after nautical twilight (19:15 HST). The evening chorus corresponded to a visible increase in median power, event duration, and number of time peaks and a visible decrease in the 90th percentiles of peak frequency and kurtosis.

FIG. 5.

(Color online) (a) Number of detected events per 15 min on February 25 and (b)–(g) 10%, 50%, and 90% levels per 15 min for each feature: (b) peak frequency, (c) median time-frequency power, (d) event duration, (e) normalized time coherence between sensors, (f) kurtosis, and (g) number of time peaks. All features were measured on the DASAR S. The coherence is the normalized correlation lag coefficient between DASARs N and S.

FIG. 5.

(Color online) (a) Number of detected events per 15 min on February 25 and (b)–(g) 10%, 50%, and 90% levels per 15 min for each feature: (b) peak frequency, (c) median time-frequency power, (d) event duration, (e) normalized time coherence between sensors, (f) kurtosis, and (g) number of time peaks. All features were measured on the DASAR S. The coherence is the normalized correlation lag coefficient between DASARs N and S.

Close modal

The handpicked features were examined for separability for K = 2 equal-sized clusters (fish, whale) and K = 3 clusters (fish, whale, both). Figure 6 shows the feature values colored by true signal type for one simulation trial. The whale and fish signals overlapped in their peak frequency and in the number of automatically extracted temporal peaks. The signals were most strongly separated in kurtosis, with whale signals having very low kurtosis.

FIG. 6.

(Color online) Handpicked features demonstrate some separability by kurtosis, number of timeseries peaks, and peak frequency but with overlapping clusters for (a)–(c) two clusters (whale, fish) and (d)–(f) (bottom) three clusters (whale, fish, both). The dots are scaled to indicate density of feature pairs, with each dot increased by 1 pt for every 100 samples.

FIG. 6.

(Color online) Handpicked features demonstrate some separability by kurtosis, number of timeseries peaks, and peak frequency but with overlapping clusters for (a)–(c) two clusters (whale, fish) and (d)–(f) (bottom) three clusters (whale, fish, both). The dots are scaled to indicate density of feature pairs, with each dot increased by 1 pt for every 100 samples.

Close modal

For two clusters, all clustering methods assumed K = 2. When the cluster sizes were equal, all clustering methods achieved high accuracy (Table IV). For the unequal cluster scenario with 25% whale and 75% fish (2500/7500), GMM achieved much higher accuracy than either K-means or hierarchical agglomerative clustering on the handpicked features (Table IV). These simulation results are supported by a supplementary example of the clustering algorithm assumptions ( Appendix A), which demonstrates that GMM has fewer prior assumptions and is thus more flexible. The classification accuracy of all methods for three equal-sized clusters (Table IV), with whale, fish, and a cluster containing a temporally overlapping whale and fish, was lower than for two equal-sized clusters due to the signal overlap. GMM is capable of estimating overlapping clusters and achieved the best accuracy and recall on the handpicked features of the overlapping signal scenario.

TABLE IV.

Median of classification accuracy, recall, and precision across 100 simulations for clustering with hand-crafted features (K-means, GMM, hierarchical) and deep learning (DEC). Whale was defined as the positive category.

MethodAccuracyRecallPrecisionAccuracyRecallPrecisionAccuracyRecallPrecision
 K = 2 simulated clusters K = 3 simulated clusters K = 2 (unequal) 
 (DEC P = 10) (DEC P = 15) simulated clusters 
       (DEC P = 10) 
K-means 0.96 1.0 0.93 0.56 0.65 0.99 0.68 1.0 0.44 
GMM 0.98 1.0 0.96 0.86 1.0 0.94 0.97 1.0 0.88 
Hierarchical 0.98 0.99 0.98 0.60 0.58 0.99 0.66 1.0 0.42 
GMM, latent features 0.95 0.96 0.99 0.69 0.90 1.0 0.96 0.89 1.0 
DEC 0.98 0.99 0.99 0.75 0.92 0.99 0.77 1.0 0.52 
MethodAccuracyRecallPrecisionAccuracyRecallPrecisionAccuracyRecallPrecision
 K = 2 simulated clusters K = 3 simulated clusters K = 2 (unequal) 
 (DEC P = 10) (DEC P = 15) simulated clusters 
       (DEC P = 10) 
K-means 0.96 1.0 0.93 0.56 0.65 0.99 0.68 1.0 0.44 
GMM 0.98 1.0 0.96 0.86 1.0 0.94 0.97 1.0 0.88 
Hierarchical 0.98 0.99 0.98 0.60 0.58 0.99 0.66 1.0 0.42 
GMM, latent features 0.95 0.96 0.99 0.69 0.90 1.0 0.96 0.89 1.0 
DEC 0.98 0.99 0.99 0.75 0.92 0.99 0.77 1.0 0.52 

The effect of the DEC latent feature dimension was examined across 100 random simulation trials (Fig. 7). The DEC model, motivated by a successful architecture for classifying seismic spectrograms,46 was retrained and tested on all trials at varying latent feature vector dimensionality (P) for the K = 2 equal-sized clusters and K = 3 cluster scenarios (Fig. 7). When the latent dimension was low, e.g., P < 8, all features were likely to be clustered in a single cluster. Above P > 8, DEC accuracy was consistent. Values of P = 10 and P = 15 were chosen for the K = 2 and K = 3 cluster scenarios.

FIG. 7.

(Color online) Accuracy of DEC over 100 trials for each latent dimension (P). The whiskers extend to the 10th (lower) and 90th (upper) percentiles. The box marks the 25th to 75th percentile. The method fails at low P but is otherwise stable.

FIG. 7.

(Color online) Accuracy of DEC over 100 trials for each latent dimension (P). The whiskers extend to the 10th (lower) and 90th (upper) percentiles. The box marks the 25th to 75th percentile. The method fails at low P but is otherwise stable.

Close modal

Figure 8 shows t-SNE visualizations from a representative trial of the latent features clustered with GMM [Figs. 8(b), 8(e), and 8(h)], and the DEC latent features and cluster predictions after 20 additional training epochs using (2) [Figs. 8(c), 8(f), and 8(i)]. Additional training with (2) increased the accuracy of DEC clustering prediction from 96% to 99.2% in the case of 2 equal sized clusters and from 70.8% to 78.2% in the case of 3 clusters. In the scenario containing 2 unequal-sized clusters, the additional DEC training reduced the accuracy from 97.6% to 78.3%. In that case, high recall and low precision (Table IV) indicate that DEC overpredicted the smaller cluster. GMM clustering of the latent features learned from spectrograms achieved comparable accuracy to GMM clustering of handpicked features in the non-overlapping scenarios with K = 2 (Table IV). For the K = 3 scenario, the highest classification confusion for the spectrogram-based deep learning was between fish and the overlapping signal cluster. Figure 9 shows that fish signals with large bandwidth and duration may dominate the spectral signature, which may lead to increased misclassification.

FIG. 8.

(Color online) t-SNE representation of 10 000 deep embedded feature vectors from a sample trial (a), (b), (d), (e), (g), (h) before, and (c), (f), (i) after deep clustering, with (a)–(c) equal-sized clusters, (d)–(f) unequal-sized clusters of fish and whale, and (g)–(i) fish, whale, and fish/whale clusters. The perplexity for visualization was 200.

FIG. 8.

(Color online) t-SNE representation of 10 000 deep embedded feature vectors from a sample trial (a), (b), (d), (e), (g), (h) before, and (c), (f), (i) after deep clustering, with (a)–(c) equal-sized clusters, (d)–(f) unequal-sized clusters of fish and whale, and (g)–(i) fish, whale, and fish/whale clusters. The perplexity for visualization was 200.

Close modal
FIG. 9.

(Color online) Simulated coral reef bioacoustic signals successfully classified using DEC. From top to bottom rows: input spectrograms, simulated timeseries, initial DEC reconstruction, and DEC reconstruction after adding clustering loss.

FIG. 9.

(Color online) Simulated coral reef bioacoustic signals successfully classified using DEC. From top to bottom rows: input spectrograms, simulated timeseries, initial DEC reconstruction, and DEC reconstruction after adding clustering loss.

Close modal

The clustering methods were applied to a subset of 4000 unlabeled detections randomly selected from the Hawaii 2020 experiment. Each detection contained one or more directional signals. All clustering methods assumed K = 2. Then, for post-clustering analysis, labels of whale/no whale (only fish) were manually assigned to all 4000 samples, based on the signal within the detection window. About two-thirds were labeled as no whale and contained only fish.

Figure 10 shows the clustering results using t-SNE representations of the learned latent feature vectors [Figs. 10(a)–10(c)] and the DEC latent features after additional training [Fig. 10(d)]. Clustering of the P = 6–dimensional handpicked feature vectors did not correspond well to the manual labels, with K-means achieving the highest accuracy of 67%. GMM clustering of the latent features learned from the signal spectrograms achieved an accuracy of 78% compared to the manual labels, which was the highest accuracy of all methods examined. GMM of latent features had high precision and low recall (Table V).

FIG. 10.

(Color online) Features of experimentally detected signals from a Hawaiian coral reef shown as a t-SNE representation of (a)–(c) learned latent features and (d) modified DEC features of 4000 manually labeled detections. All clustering methods assumed K = 2.

FIG. 10.

(Color online) Features of experimentally detected signals from a Hawaiian coral reef shown as a t-SNE representation of (a)–(c) learned latent features and (d) modified DEC features of 4000 manually labeled detections. All clustering methods assumed K = 2.

Close modal
TABLE V.

Classification accuracy determined from manually labeled experimental detections for clustering with handpicked features (K-means, GMM, hierarchical) and deep learning (DEC).

MethodAccuracyRecallPrecision
K-means (K = 2) 0.66 0.42 0.47 
GMM (K = 2) 0.55 0.84 0.40 
Hierarchical (K = 2) 0.63 0.23 0.38 
GMM (K = 2), latent features 0.78 0.33 0.90 
DEC (P = 15, K = 2) 0.66 0.75 0.48 
MethodAccuracyRecallPrecision
K-means (K = 2) 0.66 0.42 0.47 
GMM (K = 2) 0.55 0.84 0.40 
Hierarchical (K = 2) 0.63 0.23 0.38 
GMM (K = 2), latent features 0.78 0.33 0.90 
DEC (P = 15, K = 2) 0.66 0.75 0.48 

In the experimental signals, additional training of the DEC latent features reduced the classification accuracy [Fig. 10(d)]. Similar to the simulated unequal-sized cluster scenario, DEC obtained higher recall with lower precision (Table V).

Spectrograms correctly classified by DEC are shown in Fig. 11, with the reconstructed spectrograms shown before and after additional DEC training using (2). These demonstrate that whale signals were primarily identified by their narrow bandwidth and temporal extent, whereas fish signals were identified as broadband. The timeseries in Fig. 11 demonstrate the magnitude variation between events that was not attributed to signal type, motivating the normalization of the spectrograms.

FIG. 11.

(Color online) Experimentally detected coral reef bioacoustic signals successfully classified using DEC. From top to bottom rows: input spectrograms, recorded timeseries, initial DEC reconstruction, and DEC reconstruction after adding clustering loss.

FIG. 11.

(Color online) Experimentally detected coral reef bioacoustic signals successfully classified using DEC. From top to bottom rows: input spectrograms, recorded timeseries, initial DEC reconstruction, and DEC reconstruction after adding clustering loss.

Close modal

A deep clustering approach was presented for interpreting unlabeled coral reef bioacoustic detections. This approach leverages deep learning and clustering, motivated by recent improvements in classification accuracy of fish calling using unsupervised detection and deep neural network classifiers.18,26,27,43

Clustering of simulated fish (Gaussian pulses) and whale (FM sweeps) demonstrated that unique signal categories could be clustered with GMM of deep latent features learned from spectrograms, DEC clustering,45,46 or with clustering of handpicked features motivated by studies of coral reef fish.39–41 GMM applied to either deep latent features or handpicked features achieved the highest accuracy when there were two clusters of unequal size, and GMM of handpicked features had the highest accuracy when the clusters contained overlapping signals. Deep clustering learned features directly from spectrograms and was successful in separating fish and whale clusters. DEC trained with a joint clustering loss function increased the classification accuracy in simulations compared to clustering the deep latent features with GMM, except when the cluster sizes were significantly different. For unequal-sized simulated clusters, applying GMM to the learned latent features demonstrated higher classification accuracy.

Broadband bioacoustic events detected on a Hawaiian coral reef in February–March 2020 using a directional detector were analyzed with deep clustering and clustering of handpicked features. A labeled subset of these detections with whale/no whale indicated that about two-thirds of the detections contained primarily fish and one-third contained a whale signal. GMM with K = 2 applied to latent features learned from the signal spectrograms achieved the highest accuracy and tended to overpredict the larger cluster containing fish. After further training DEC with a joint clustering loss, the DEC recall increased but DEC classification accuracy was reduced, indicating that whale was overpredicted. DEC reconstructions of the input spectrograms demonstrate that the learned features are representative of spectral features identified by manual labelers. Clustering of handpicked features with K = 2 on the experimental, manually labeled data achieved lower accuracy, precision, and recall.

These results demonstrate that deep clustering is a promising method for classifying unlabeled bioacoustic signals with distinct spectral signatures. In simulated and experimental studies, both GMM applied to deep latent features and DEC trained with a joint clustering loss achieved competitive classification accuracy for a range of latent feature dimensions, without requiring the selection of handpicked features. Our results indicate that the accuracy of handpicked feature clustering depends strongly on the feature properties and the choice of clustering method. Finally, the choice of clustering algorithm is an important consideration for applications that are subject to signal clusters of unequal size.

Thank you to Greeneridge Sciences for providing the DASAR sensors and to Alex Conrad for assisting with DASAR post-processing. Thank you to Richard Walsh for assistance with hardware deployment and recovery. This work was supported by the Office of Naval research under Award No. N00014-18-1-2065.

For a continuous data variable, x, that is Gaussian with mean μk and covariance Σk, the posterior probability of the cluster is given by57 

(A1)
(A2)

where Ck and Cj represent the kth and jth classes and |Σk| is the determinant of Σk. The boundary between the two classes occurs when there is an equal probability of x belonging to either class,

(A3)

The general solution for x in Eq. (A3) is a multi-dimensional parabola, which simplifies to a linear boundary if the distributions have a shared covariance such that Σk=Σjj,k. The solution in Eq. (A3) is extensible to K > 2 classes by considering the joint distributions of all classes (see Bishop57 Chap. 4.2.1 for details).

1. Clustering simulations

Three 2D Gaussian distributions, each with N = 10 000 points, were used to simulate overlapping clusters (Fig. 12). The true cluster means were μ1=(0,2),μ2=(10,8), and μ3=(21,3). The covariance of the first dataset was

(A4)

with σx2=σy2=3. For the first dataset [Fig. 12(a)], there were no off diagonal covariance terms.

FIG. 12.

(Color online) Clustering on two data distributions using (a), (b) GMM, (c), (d) K-means, and (e), (f) agglomerative clustering with Ward's method, with maximum likelihood boundaries shown by black lines.

FIG. 12.

(Color online) Clustering on two data distributions using (a), (b) GMM, (c), (d) K-means, and (e), (f) agglomerative clustering with Ward's method, with maximum likelihood boundaries shown by black lines.

Close modal

The second dataset [Fig. 12(b)] was generated by rotating the data counterclockwise at θ, with

(A5)

with σx2=6,σy2 = 3. The three clusters were rotated by θ=120°,25°, and 0°. The non-isotropic covariance is a result of correlation between variables and indicates feature correlations. A non-isotropic covariance matrix of a feature vector indicates correlation between features.

GMM assumes that the clusters are Gaussian distributed around a cluster mean. The K-means algorithm further assumes the Gaussians have the same diagonal covariance and uniform prior probability. Thus, GMM and K-means perform best for the first dataset [Figs. 12(a), 12(c), 12(e)] with identical, diagonal covariance for all clusters. By incorporating intercluster distance through the Ward metric, hierarchical agglomerative identifies 3 similar clusters.

When the cluster covariances are not of the form σ2I, K-means and agglomerative clustering provide solutions that are suboptimal to the true class boundaries. GMM achieves better accuracy [Figs. 12(b), 12(d), and 12(f)] but does not converge to the maximum likelihood solution due to estimation errors in the covariances. All three methods provide insight into the cluster memberships despite invalid class assumptions, but GMM is the most versatile clustering method.

2. K-means++ algorithm

K-means++ is an algorithm proposed to efficiently initialize parameters for Gaussian clustering methods, including GMM and K-means. K-means++ improves the runtime of the clustering algorithms and the quality of the final solution.55 For a set of data samples X={x1,,xN}, the K-means++ algorithm follows:55 

  1. Initialize the priors πk=1/K,k=1,,K.

  2. Initialize the covariance matrices as Σ=σ2I, with σ2= var [X].

  3. Select the first cluster mean as a random data sample, μ1=xl,lU(1,N).

  4. Assign all data samples to the cluster with the nearest mean. We denote that the mth sample belongs to cluster Ci as xmCi, satisfying d(xm,μi)d(xm,μp),i,p=1,,j,jK. Note that when j = 1 (first step), xmC1m, i.e., all data samples belong to one cluster.

  5. Select the j + 1-th cluster mean at random from the remaining data samples with probability
    (A6)
    d is the Mahalanobis distance,
    (A7)

    Equation (A6) ensures that each mean is selected with a probability proportional to its distance from all existing means.

  6. Repeat steps 4 and 5 until K cluster means are chosen.

The initialized parameters define θ0=(πk,μk,Σk),k=1,,K for the first E-step of the GMM algorithm in (6). For K-means, the prior and covariance are assumed fixed, and the means are updated according to Eq. (10).

3. Visualization of high-dimensional data

For data with more than two dimensions, xnP for P > 2, clusters may be visualized by applying dimensionality reduction. In this study, 2D t-stochastic neighbor embedding (t-SNE)68 was used to visualize P–dimensional features.

The similarity of one point, xiP, to another point, xjP, is found from the conditional probability that the points are neighbors within a Gaussian density with mean xi,68 

(A8)

for N points i, j = 1,,N. The neighborhood of xi, as determined by σi, is defined implicitly in terms of the perplexity (Fig. 13),68,69

(A9)
(A10)

where Pi=Σjpj|i, and H is the Shannon entropy. The optimal value of σi in Eq. (A8) for each point is solved with a binary search for a given value of perplexity.69 

FIG. 13.

(Color online) Varying values of perplexity for t-SNE on N = 1000 randomly drawn points xn3,n=1,,N.

FIG. 13.

(Color online) Varying values of perplexity for t-SNE on N = 1000 randomly drawn points xn3,n=1,,N.

Close modal

Then, a set of projected data is randomly initialized with zero-mean Gaussians of low variance,68yiN(0,104I),yi2. The yi are iteratively updated until the Kullback-Leibler (KL) divergence between the conditional probability in Eq. (A8) and the Student's t-distribution of the projected data is minimized,

(A11)
(A12)

In contrast to classic SNE which uses only Gaussians, t-SNE's use of the Student t-distribution further penalizes outliers in the projected data,68 resulting in a more compact representation. As shown in Fig. 13, the value of perplexity should be varied according to user preference to obtain the desired visualization.

A directional event detector65 was developed to utilize the DASARs' directional capability by combining two DASARs. The complex spectrograms of the x– and y–particle velocity, matrices VxNf×Nt and VyNf×Nt, were generated identically to the spectrogram X with units m s−1.

The active intensity, a measure of in-plane energy, was used to determine the noise directionality

(B1)

where the asterisk is the complex conjugate and the real component. a tan2 is a piecewise function that computes the elementwise angle between the elements of two matrices, with domain (0°,359°) defined counterclockwise from the y axis (0° = North). A is therefore the time-frequency representation of compass directionality. In the following, matrix AN is called the azigram70 for DASAR N, likewise for AW and AS.

The event detector makes the following signal assumptions:

  1. An event arrives from a constant azimuthal sector for each DASAR.

  2. Target events are broadband below 500 Hz. The minimum required bandwidth was set with an empirical threshold (see Appendix in Thode et al.65).

The detection algorithm is demonstrated in Fig. 14. First, the azigrams for the north- and southmost DASARs, AN and AS, were used to create binary maps BN and BS of time-frequency points within a fixed azimuthal sector [Fig. 14(a)],

(B2)

and likewise for BS [Fig. 14(b)]. I is the elementwise indicator function, with I(true)=1. Binary maps were generated for all combinations of azimuthal sectors θN,θS([0,Δθ]T,[Δθ/2,3Δθ/2]T,,[(360°Δθ),360°]). Here, Δθ=90°. Next, overlapping signals on both DASARs were discovered by creating a combined map [Fig. 14(c)], which was summed across frequency to determine the detection timeseries,

(B3)
(B4)

d measures the bandwidth of an event. Event start and end times were determined for dj>T,j=1,,Nt for threshold T. Events separated by less than Msepdt were merged and events longer than Tmax were removed.

FIG. 14.

Directional detector for a fish call event on February 25, 07:12 HST, with DASAR N looking between 135° and 225° and DASAR S looking between 45° and 135° (clockwise from north). The overlap of the binary masks, summed across frequency, defines the detection timeseries.

FIG. 14.

Directional detector for a fish call event on February 25, 07:12 HST, with DASAR N looking between 135° and 225° and DASAR S looking between 45° and 135° (clockwise from north). The overlap of the binary masks, summed across frequency, defines the detection timeseries.

Close modal
1.
Y.
LeCun
,
Y.
Bengio
, and
G.
Hinton
, “
Deep Learning
,”
Nature
521
,
436
444
(
2015
).
2.
M. J.
Bianco
,
P.
Gerstoft
,
J.
Traer
,
E.
Ozanich
,
M. A.
Roch
,
S.
Gannot
, and
C.-A.
Deledalle
, “
Machine learning in acoustics: Theory and applications
,”
J. Acoust. Soc. Am.
146
,
3590
3628
(
2019
).
3.
H.
Niu
,
Z.
Gong
,
E.
Ozanich
,
P.
Gerstoft
,
H.
Wang
, and
Z.
Li
, “
Deep-learning source localization using multi-frequency magnitude-only data
,”
J. Acoust. Soc. Am.
146
,
211
222
(
2019
).
4.
Z.
Huang
,
J.
Xu
,
Z.
Gong
,
H.
Wang
, and
Y.
Yan
, “
Source localization using deep neural networks in a shallow water environment
,”
J. Acoust. Soc. Am.
143
,
2922
2932
(
2018
).
5.
W.
Wang
,
H.
Ni
,
L.
Su
,
T.
Hu
,
Q.
Ren
,
P.
Gerstoft
, and
L.
Ma
, “
Deep transfer learning for source ranging: Deep-sea experiment results
,”
J. Acoust. Soc. Am.
146
(
4
),
EL317
EL322
(
2019
).
6.
E.
Ozanich
,
P.
Gerstoft
, and
H.
Niu
, “
A feedforward neural network for direction-of-arrival estimation
,”
J. Acoust. Soc. Am.
147
,
2035
2048
(
2020
).
7.
C.
Frederick
,
S.
Villar
, and
Z.-H.
Michalopoulou
, “
Seabed classification using physics-based modeling and machine learning
,”
J. Acoust. Soc. Am.
148
,
859
872
(
2020
).
8.
H.
Cao
,
W.
Wang
,
L.
Su
,
H.
Ni
,
P.
Gerstoft
,
Q.
Ren
, and
L.
Ma
, “
Deep transfer learning for underwater direction of arrival using one vector sensor
,”
J. Acoust. Soc. Am.
149
,
1699
1711
(
2021
).
9.
E.
Smirnov
, “
North Atlantic right whale call detection with convolutional neural networks
,” in
International Conference on Machine Learning
, Citeseer (
2013
), pp.
78
79
.
10.
D. K.
Mellinger
and
C. W.
Clark
, “
Methods for automatic detection of mysticete sounds
,”
Marine Freshw. Behav. Phys.
29
,
163
181
(
1997
).
11.
W. W.
Steiner
, “
Species-specific differences in pure tonal whistle vocalizations of five western North Atlantic dolphin species
,”
Behav. Ecol. Sociobiol.
9
,
241
246
(
1981
).
12.
B.
McCowan
, “
A new quantitative technique for categorizing whistles using simulated signal and whistles from captive bottlenose dolphins (Delphinidae, Tursiops truncates)
,”
Ethology
100
,
177
193
(
2010
).
13.
V. B.
Deecke
and
V. M.
Janik
, “
Automated categorization of bioacoustic signals: Avoiding perceptual pitfalls
,”
J. Acoust. Soc. Am.
119
,
645
653
(
2006
).
14.
M. A.
Roch
,
H.
Klinck
,
S.
Baumann-Pickering
,
D. K.
Mellinger
,
S.
Qui
,
M. S.
Soldevilla
, and
J. A.
Hildebrand
, “
Classification of echolocation clicks from odontocetes in the Southern California Bight
,”
J. Acoust. Soc. Am.
129
,
467
475
(
2011
).
15.
X. C.
Halkias
,
S.
Paris
, and
H.
Glotin
, “
Classification of mysticete sounds using machine learning techniques
,”
J. Acous. Soc. Am.
134
,
3496
3505
(
2013
).
16.
K. E.
Frasier
,
E.
Henderson
,
H. R.
Bassett
, and
M. A.
Roch
, “
Automated identification and clustering of subunits within delphinid vocalizations
,”
Mar. Mammal Sci.
32
,
911
930
(
2016
).
17.
K. E.
Frasier
,
M. A.
Roch
,
M. S.
Soldevilla
,
S. M.
Wiggins
,
L. P.
Garrison
, and
J. A.
Hildebrand
, “
Automated classification of dolphin echolocation click types from the Gulf of Mexico
,”
PLoS Comput. Biol.
13
,
e1005823
(
2017
).
18.
M.
Malfante
,
J. I.
Mars
,
M. D.
Mura
, and
C.
Gervaise
, “
Automatic fish sounds classification
,”
J. Acoust. Soc. Am.
143
,
2834
2846
(
2018
).
19.
D.
Stowell
and
M. D.
Plumbley
, “
Automatic large-scale classification of bird sounds is strongly improved by unsupervised feature learning
,”
PeerJ
2
,
e488
(
2014
).
20.
W.
Lee
and
V.
Staneva
, “
Compact representation of temporal processes in echosounder time series via matrix decomposition
,”
J. Acoust. Soc. Am.
148
,
3429
3442
(
2020
).
21.
P. C.
Bermant
,
M. M.
Bronstein
,
R. J.
Wood
,
S.
Gero
, and
D. F.
Gruber
, “
Deep machine learning techniques for the detection and classification of sperm whale bioacoustics
,”
Sci. Rep.
9
,
12588
(
2019
).
22.
Y.
Shiu
,
K. J.
Palmer
,
M.
Roch
,
E.
Fleishman
,
X.
Liu
,
E.-M.
Nosal
,
T.
Helble
,
D.
Cholewiak
,
D.
Gillespie
, and
H.
Klinck
, “
Deep neural networks for automated detection of marine mammal species
,”
Sci. Rep.
10
,
607
(
2020
).
23.
M.
Zhong
,
M.
Castellote
,
R.
Dodhia
,
J. L.
Ferres
,
M.
Keogh
, and
A.
Brewer
, “
Beluga whale acoustic signal classification using deep learning neural networks
,”
J. Acoust. Soc. Am.
147
,
1834
1841
(
2020
).
24.
O. S.
Kirsebom
,
F.
Frazao
,
Y.
Simard
,
N.
Roy
,
S.
Matwin
, and
S.
Giard
, “
Performance of a deep neural network at detecting North Atlantic right whale upcalls
,”
J. Acoust. Soc. Am.
147
,
2636
2646
(
2020
).
25.
C.
Bergler
,
H.
Schröter
,
R. X.
Cheng
,
V.
Barth
,
M.
Weber
,
E.
Nöth
,
H.
Hofer
, and
A.
Maier
, “
ORCA-SPOT: An automatic killer whale sound detection toolkit using deep learning
,”
Sci. Rep.
9
,
1
17
(
2019
).
26.
T.-H.
Lin
,
Y.
Tsao
, and
T.
Akamatsu
, “
Comparison of passive acoustic soniferous fish monitoring with supervised and unsupervised approaches
,”
J. Acoust. Soc. Am.
143
,
EL278
EL284
(
2018
).
27.
A. K.
Ibrahim
,
H.
Zhuang
,
L. M.
Chérubin
,
M. T.
Schärer-Umpierre
, and
N.
Erdol
, “
Automatic classification of grouper species by their sounds using deep neural networks
,”
J. Acoust. Soc. Am.
144
,
EL196
EL202
(
2018
).
28.
A. K.
Ibrahim
,
L. M.
Chérubin
,
H.
Zhuang
,
M. T. S.
Umpierre
,
F.
Dalgeish
,
N.
Erdol
,
B.
Ouyang
, and
A.
Dalgeish
, “
An approach for automatic classification of grouper vocalizations with passive acoustic monitoring
,”
J. Acoust. Soc. Am.
143
,
666
676
(
2018
).
29.
C. M.
Roberts
, “
Effects of fishing on the ecosystem structure of coral reefs
,”
Cons. Biol.
9
,
988
995
(
1995
).
30.
N.
Knowlton
,
R. E.
Brainard
,
R.
Fisher
,
M.
Moews
,
L.
Plaisance
, and
M. J.
Caley
, “
Coral reef biodiversity
,” in
Life in the World's Oceans: Diversity, Distribution, and Abundance
, edited by
A. D.
McIntyre
(
Blackwell
,
London
,
2010
), pp.
65
74
.
31.
O.
Hoegh-Guldberg
,
E. S.
Poloczanska
,
W.
Skirying
, and
S.
Dove
, “
Coral reef ecosystems under climate change and ocean acidification
,”
Front. Mar. Sci.
4
,
158
(
2017
).
32.
T. P.
Hughes
,
K. D.
Anderson
,
S. R.
Connolly
,
S. F.
Heron
,
J. T.
Kerry
,
J. M.
Lough
,
A. H.
Baird
,
J. K.
Baum
,
M. L.
Berumen
,
T. C.
Bridge
,
D. C.
Claar
,
C. M.
Eakin
,
J. P.
Gilmour
,
N. A. J.
Graham
,
H.
Harrison
,
J.-P. A.
Hobbs
,
A. S.
Hoey
,
M.
Hoogenboom
,
R. J.
Lowe
,
M. T.
McCulloch
,
J. M.
Pandolfi
,
M.
Pratchett
,
V.
Schoepf
,
G.
Torda
, and
S. K.
Wilson
, “
Spatial and temporal patterns of mass bleaching of corals in the Anthropocene
,”
Science
359
,
80
83
(
2018
).
33.
C. S.
Rogers
,
G.
Garrison
,
R.
Grober
,
Z. M.
Hillis
, and
M. A.
Franke
, “
Coral reef monitoring manual for the Caribbean and Western Atlantic
,” report (Florida Integrated Science Center, St. Petersburg, FL,
1994
).
34.
L. A.
Freeman
and
S. E.
Freeman
, “
Rapidly obtained ecosystem indicators from coral reef soundscapes
,”
Mar. Eco. Prog. Ser.
561
,
69
82
(
2016
).
35.
S.
Elise
,
I.
Urbina-Barreto
,
R.
Pinel
,
V.
Mahamadaly
,
S.
Bureau
,
L.
Penin
,
M.
Adjeroud
,
M.
Kulbicki
, and
J. H.
Bruggemann
, “
Assessing key ecosystem functions through soundscapes: A new perspective from coral reefs
,”
Ecol. Indicators
107
,
105623
(
2019
).
36.
T. A.
Gordon
,
A. N.
Radford
,
I. K.
Davidson
,
K.
Barnes
,
K.
McCloskey
,
S. L.
Nedelec
,
M. G.
Meekan
,
M. I.
McCormick
, and
S. D.
Simpson
, “
Acoustic enrichment can enhance fish community development on degraded coral reef habitat
,”
Nat. Commun.
10
,
5414
(
2019
).
37.
S. E.
Freeman
,
F. L.
Rohwer
,
G. L.
D'Spain
,
A. M.
Friedlander
,
A. K.
Gregg
,
S. A.
Sandin
, and
M. J.
Buckingham
, “
The origins of ambient biological sound from coral reef ecosystems in the Line Islands archipelago
,”
J. Acoust. Soc. Am.
135
,
1775
1788
(
2014
).
38.
S. E.
Freeman
,
L. A.
Freeman
,
G.
Giorli
, and
A. F.
Haas
, “
Photosynthesis by marine algae produces sound, contributing to the daytime soundscape on coral reefs
,”
PloS One
13
,
e0201766
(
2018
).
39.
D. A.
Mann
and
P. S.
Lobel
, “
Propagation of damselfish (Pomacentridae) courtship sounds
,”
J. Acoust. Soc. Am.
101
,
3783
3791
(
1997
).
40.
K. P.
Maruska
,
K. S.
Boyle
,
L. R.
Dewan
, and
T. C.
Tricas
, “
Sound production and spectral hearing sensitivity in the Hawaiian sergeant damselfish, Abudefduf abdominalis
,”
J. Exp. Biol.
210
,
3990
4004
(
2007
).
41.
T. C.
Tricas
and
K. S.
Boyle
, “
Acoustic behaviors in Hawaiian coral reef fish communities
,”
Mar. Ecol. Prog. Ser.
511
,
1
16
(
2014
).
42.
M. C. P.
Amorim
, “
Diversity of sound production in fish
,” in Communication in Fishes, edited by F. Ladich, S. P. Collin, P. Moller, and B. G. Kapoor (Science Publishers, Enfield and Plymouth,
2006
), pp.
71
104
.
43.
A. K.
Ibrahim
,
H.
Zhuang
,
L. M.
Chérubin
,
M. T. S.
Umpierre
,
A. M.
Ali
,
R. S.
Nemeth
, and
N.
Erdol
, “
Classification of red hind group using random ensemble of stacked autoencoders
,”
J. Acoust. Soc. Am.
146
,
2155
2162
(
2019
).
44.
J.
Xie
,
R.
Girshick
, and
A.
Farhadi
, “
Unsupervised deep embedded for clustering analysis
,” in
Proceedings of the 33rd International Conference on Machine Learning
(
2016
).
45.
X.
Guo
,
L.
Gao
,
X.
Lui
, and
J.
Yin
, “
Improved deep embedded clustering with local structure preservation
,” in
Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI)
(
2017
), pp.
1753
1759
.
46.
D.
Snover
,
C. W.
Johnson
,
M. J.
Bianco
, and
P.
Gerstoft
, “
Deep clustering to identify sources of urban seismic noise in Long Beach, California
,”
Seismol. Res. Lett.
92
,
1
12
(
2020
).
47.
W. F.
Jenkins
,
P.
Gerstoft
,
M.
Bianco
, and
P. D.
Bromirski
, “
Unsupervised deep clustering of seismic data: Monitoring the Ross Ice Shelf, Antarctica
,” ESSOAr (
2021
), pp.
1
30
.
48.
A. M.
Thode
,
K. H.
Kim
,
R. G.
Norman
,
S. B.
Blackwell
, and
C. R.
Greene
, “
Acoustic vector sensor beamforming reduces masking from underwater industrial noise during passive monitoring
,”
J. Acoust. Soc. Am.
139
,
EL105
EL111
(
2016
).
49.
C.
Biemann
,
Structure Discovery in Natural Language: Theory and Applications of Natural Language Processing
(
Springer-Verlag
,
Berlin
,
2012
), pp.
73
75
.
50.
I.
Goodfellow
,
Y.
Bengio
, and
A.
Courville
,
Deep Learning
(
MIT Press
,
Cambridge
,
2016
), Chap. 14, pp.
163
215
, 493–495, 499–500.
51.
D. P.
Kingma
and
J.
Ba
, “
Adam: A method for stochastic optimization
,” in
Proceedings of the 3rd ICLR
(
2014
).
52.
F.
Chollet
and others, “
Keras
,” https://keras.io (
2015
).
53.
M.
Abadi
,
A.
Agarwal
,
P.
Barham
,
E.
Brevdo
,
Z.
Chen
,
C.
Citro
,
G. S.
Corrado
,
A.
Davis
,
J.
Dean
,
M.
Devin
,
S.
Ghemawat
,
I.
Goodfellow
,
A.
Harp
,
G.
Irving
,
M.
Isard
,
Y.
Jia
,
R.
Jozefowicz
,
L.
Kaiser
,
M.
Kudlur
,
J.
Levenberg
,
D.
Mane
,
R.
Monga
,
S.
Moore
,
D.
Murray
,
C.
Olah
,
M.
Schuster
,
J.
Shlens
,
B.
Steiner
,
I.
Sutskever
,
K.
Talwar
,
P.
Tucker
,
V.
Vanhoucke
,
V.
Vasudevan
,
F.
Viegas
,
O.
Vinyals
,
P.
Warden
,
M.
Wattenberg
,
M.
Wicke
,
Y.
Yu
, and
X.
Zheng
, “
TensorFlow: Large-scale machine learning on heterogeneous systems
,” arXiv:1603.04467 (
2015
).
54.
K. P.
Murphy
,
Machine Learning: A Probabilistic Perspective
(
MIT Press
,
Cambridge
,
2012
), pp.
389
397
, 897–900.
55.
D.
Arthur
and
S.
Vassilvistkii
, “
K-means++: The Advantages of Careful Seeding
,” in
SODA '07: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms
(
2007
), pp.
1027
1035
.
56.
T.
Hastie
,
R.
Tibshirani
, and
J.
Friedman
,
Elements of Statistical Learning
, 2nd ed. (
Springer
,
New York
,
2001
),
Chap. 13.2.1
, p.
460
.
57.
C.
Bishop
,
Pattern Recognition and Machine Learning
(
Springer
,
New York
,
2006
),
Vol. 1
, pp.
424
428
.
58.
J. J. H.
Ward
, “
Hierarchical grouping to optimize an objective function
,”
Am. Stat. Ass. J.
58
,
236
244
(
1963
).
59.
Mathworks
,
Statistics and Machine Learning Toolbox: User's Guide (R2019b)
(
Mathworks
,
Natick, MA
,
2019
).
60.
B. S.
Everitt
,
S.
Landau
,
M.
Leese
, and
D.
Stahl
,
Cluster Analysis
, 5th ed. (
Wiley
,
New York
,
2011
), pp.
73
83
.
61.
S.
Hands
and
B.
Everitt
, “
A Monte Carlo study of the recovery of cluster structure in binary data by hierarchical clustering techniques
,”
Multivariate Behav. Res.
22
,
235
243
(
1987
).
62.
P.
Virtanen
,
R.
Gommers
,
T. E.
Oliphant
,
M.
Haberland
,
T.
Reddy
,
D.
Cournapeau
,
E.
Burovski
,
P.
Peterson
,
W.
Weckesser
,
J.
Bright
,
S. J. v d
Walt
,
M.
Brett
,
J.
Wilson
,
J. K.
Millman
,
N.
Mayorov
,
A. R. J.
Nelson
,
E.
Jones
,
R.
Kern
,
E.
Larson
,
C. J.
Carey
,
Ä.
Polat
,
Y.
Feng
,
E. W.
Moore
,
J.
VanderPlas
,
D.
Laxalde
,
J.
Perktold
,
R.
Cimrman
,
I.
Henriksen
,
E. A.
Quintero
,
C. R.
Harris
,
A. M.
Archibald
,
A. H.
Ribeiro
,
F.
Pedregosa
,
P. v
Mulbregt
, and
SciPy 1.0 Contributors
, “
SciPy 1.0: Fundamental algorithms for scientific computing in Python
,”
Nat. Methods
17
,
261
272
(
2020
).
63.
S.
Mann
and
S.
Haykin
, “
The Chirplet Transform: A Generalization of Gabor's Logon Ttransform
,”
Vis. Interface
1991
,
205
212
(
1991
).
64.
D.
Manolakis
and
J. G.
Proakis
,
Digital Signal Processing: Principles, Algorithms, and Applications
, 3rd ed. (
Prentice-Hall
,
Hoboken, NJ
,
1996
),
Chap. 2.1.2
, pp.
47
52
.
65.
A. M.
Thode
,
A. S.
Conrad
,
E.
Ozanich
,
R.
King
,
S. E.
Freeman
,
L. A.
Freeman
,
B.
Zgliczynski
,
P.
Gerstoft
, and
K. H.
Kim
, “
Automated two-dimensional localization of underwater acoustic transient impulses using vector sensor image processing (vector sensor localization)
,”
J. Acoust. Soc. Am.
149
,
770
787
(
2021
).
66.
R. V.
Lenth
, “
On finding the source of a signal
,”
Technometrics
23
,
149
154
(
1981
).
67.
S. B.
Martin
,
K.
Lucke
, and
D. R.
Barclay
, “
Techniques for distinguishing between impulsive and non-impulsive sound in the context of regulating sound exposure for marine mammals
,”
J. Acoust. Soc. Am.
147
,
2159
2176
(
2020
).
68.
L. van der
Maaten
and
G.
Hinton
, “
Visualizing data using t-SNE 2579–2605
,”
J. Mach. Learn. Res.
9
,
2579
2605
(
2008
).
69.
L. van der
Maaten
, “
Barnes-Hut-SNE
,” arXiv:1301.3342 (
2013
).
70.
A. M.
Thode
,
T.
Sakai
,
J.
Michalec
,
S.
Rankin
,
M. S.
Soldevilla
,
B.
Martin
, and
K. H.
Kim
, “
Displaying bioacoustic directional information from sonobuoys using ‘azigrams
,’ ”
J. Acoust. Soc. Am.
146
,
95
102
(
2019
).