Deep clustering was applied to unlabeled, automatically detected signals in a coral reef soundscape to distinguish fish pulse calls from segments of whale song. Deep embedded clustering (DEC) learned latent features and formed classification clusters using fixed-length power spectrograms of the signals. Handpicked spectral and temporal features were also extracted and clustered with Gaussian mixture models (GMM) and conventional clustering. DEC, GMM, and conventional clustering were tested on simulated datasets of fish pulse calls (fish) and whale song units (whale) with randomized bandwidth, duration, and SNR. Both GMM and DEC achieved high accuracy and identified clusters with fish, whale, and overlapping fish and whale signals. Conventional clustering methods had low accuracy in scenarios with unequal-sized clusters or overlapping signals. Fish and whale signals recorded near Hawaii in February–March 2020 were clustered with DEC, GMM, and conventional clustering. DEC features demonstrated the highest accuracy of 77.5% on a small, manually labeled dataset for classifying signals into fish and whale clusters.
I. INTRODUCTION
Deep learning, a powerful, recent subfield of machine learning,1 excels at learning representations of large amounts of data and often outperforms shallower machine learning methods.2–8 Deep convolutional networks have been particularly effective for image classification due to their scalable capacity. In ocean bioacoustics, machine learning has become an effective signal detection and classification tool2,9–25 with research encompassing both supervised methods,9–13,18 in which reliable labeled data are present, and unsupervised methods16,17,19,26 when labeled data are limited or unavailable. Deep convolutional learning of marine mammal signals in the time-frequency domain has shown promising results for supervised detection and classification.15,20–25 For other ocean bioacoustic signals, particularly marine fishes, just a few studies have considered deep learning classification.26–28 This study proposes a variant of deep convolutional learning dubbed deep clustering as a method for time-frequency representation learning and classification of unlabeled, automatically detected signals in a cacophonous coral reef environment, characterized by significant presence of spatially dense calling fish.
Coral reefs are amongst the most biodiverse ecosystems in the ocean but are under threat from global climate change, overfishing, and pollution.29–32 Passive acoustic studies allow non-invasive study of reef ecology over continuous time scales of days to years, and offer a complement to measurements that traditionally have been collected through direct observation by SCUBA divers or snorkelers, taking point measurements in time.33 Coral reef ambient biological sound, or soundscapes, are an emerging topic of interest in the coral reef scientific community. Reef sound has been linked to relative ecosystem health, abundance of both reef building coral and fleshy macroalgae, and fish density.34,35 Healthy reef sounds have also been shown to enhance larval recruitment,36 suggesting that reef soundscapes are not just a by-product of biological activity but an integral part of complex ecosystem function. The biological components of reef soundscapes comprise of inadvertent sounds from organism activities such as feeding, locomotion,37 and photosynthesis,38 and acoustic communication such as the fish calls discussed here. Acoustic classification of individual fish calls informs our understanding of spatial and temporal movement, species assemblage, and response to human activities.
Acoustic classification of tropical marine fishes, such as damselfish (family Pomacentridae), has been improved through passive acoustic field experiments39–41 but lacks established terminology across studies.42 To reduce the labeling burden, a few studies have considered automatic classification of fish calls by utilizing machine learning tools.18,26 Unsupervised methods such as Gaussian mixture models (GMM) showed improved detection of fish chorusing events compared to the conventional spectral energy detector.26 Deep neural networks, including convolutional networks (CNN), recurrent networks (RNN),27 and sparse autoencoders (SAE)43 have achieved higher classification accuracy of grouper calls than a sparse feature classification method.28
In this study, deep clustering is applied to automatically detected, unlabeled signals, combining the benefits of deep learning with the flexibility of clustering. Due to the complexity of the dataset, clustering is restricted to two distinct signals: low-frequency fish pulse trains (fish) and segments of whale song units (whale). Standard GMM-based clustering and other conventional clustering of handpicked features are conducted for comparison.
Spectral and time domain features were manually chosen, or handpicked, based on their observed relation to coral reef fish calling and on studies of fish calling spectral and temporal properties.39–41 Clustering of events using handpicked features provides physical intuition about the event signals but is difficult due to the varying feature properties. Thus, the feature extraction and classification steps are combined into one algorithm using deep learning.
Fixed-length spectrograms were used in deep embedded clustering (DEC),44 a deep-learning image compression algorithm that produces an accurate image reconstruction. The DEC latent features were directly clustered with GMM, then the DEC was trained further using a joint clustering loss to encourage deep cluster formation among the latent features.45–47
Simulated signals were used to compare the limitations of the two approaches before applying them to data recorded on a Hawaiian coral reef between February and March 2020.48
In Sec. II, the clustering methods are overviewed and metrics are presented for measuring classification success. Section III outlines the simulated signals used for method comparison. Experimental data collection and automatic signal detection on a Hawaiian coral reef in February–March 2020 are discussed in Sec. IV along with discussion of handpicked feature extraction. A comparison of clustering results on the simulated datasets are presented in Sec. V A, with an analysis of the effect of the DEC latent dimension. Finally, Sec. V B presents DEC, GMM, and conventional clustering of two signal types for the experimentally detected signals. Section VI summarizes the approach and discusses challenges associated with the methods and dataset.
II. METHODOLOGY
Clustering methods are unsupervised frameworks for categorizing data according to their similarities.49 This section describes DEC for learning the latent feature vectors and forming clusters directly from high-dimensional data. For comparison, the GMM, K-means, and hierarchical agglomerative clustering algorithms, common methods that rely on approximations of the input feature vector properties, are also presented.
A. DEC
DEC is a modified convolutional autoencoder, a deep feature learning method,50 that uses a joint loss function to transform a set of N input data, , into a set of latent feature vectors, . The output is a reconstruction of the input data, , estimated using the latent feature vectors. The DEC structure consists of two stacked networks (Fig. 1): the encoder network, which maps input data into a lower-dimensional latent space, and the decoder network, which reconstructs an approximation of the input from the latent space. For a given latent feature dimension, P, DEC learns representative latent features using the mean squared reconstruction error of the inputs,
Inspired by recent applications to spectrograms of unlabeled seismic events,46,47 the compressive architecture downsamples the input using convolutions of stride length 2. The rectified linear unit (ReLU) was used to transform the outputs at each layer. The neural network structure46 was adapted by reducing the convolutional filter size and adding convolutional layers to downsample the input.
(Color online) Architecture of the deep embedded encoding (DEC) convolutional model. Convolutional filters were sized 3 × 3 and used the ReLU activation function.
(Color online) Architecture of the deep embedded encoding (DEC) convolutional model. Convolutional filters were sized 3 × 3 and used the ReLU activation function.
The DEC weights were pretrained for 1000 epochs using the MSE of fixed-length spectrogram images containing Nf frequencies and Nt time samples, , and their reconstructions, . The Adam optimizer51 was used with learning rate of . The spectrogram length Nt corresponds to a signal duration of 0.5 s = NtdT and was motivated by the signals of interest in the coral reef soundscape. Events longer than 0.5 s were clipped to length to ensure a fixed size.
Each latent feature vector is assumed to belong to one of K clusters, Ck, with means . The latent features were clustered with GMM to initialize the deep clustering, with means . These clustering results are reported as “GMM, latent features” for the simulated and experimental data.
Last, the DEC latent feature vectors were updated by incorporating the Kullback-Leibler (KL) divergence into the training loss. DEC was trained for an additional 20 epochs using the joint clustering/reconstruction loss function44,45
where Eq. (4) is the empirically estimated Student's t-distribution of the latent feature vectors around each cluster mean and Eq. (5) further penalizes points that are distant from cluster centers.45 The objective of additional training is to learn latent feature vectors that are both disjoint in feature space and representative of the inputs. The weights for each loss term in Eq. (2) are from Ref. 45. Details of the model are given in Fig. 1 and Table I. The DEC was written in keras52 using tensorflow.53
Network architecture used in DEC. The input and output shapes are given as [height, width, depth]. The kernel size is the shape of the two dimensional convolutional filters in [height, width].
Layer Name . | Layer Type . | Input shape . | Filters . | Kernel size . | Stride . | Activation . | Output shape . | Parameters . |
---|---|---|---|---|---|---|---|---|
Conv1 | 2D Convolution | [90,20,1] | 8 | [3,3] | [2,2] | ReLU | [45,10,8] | 80 |
2D Conv. | [45,10,8] | 8 | [2,1] | [1,1] | ReLU | [44,10,8] | 136 | |
Conv2 | 2D Conv. | [44,10,8] | 16 | [3,3] | [2,2] | ReLU | [22,5,16] | 1168 |
2D Conv. | [22,5,16] | 16 | [1,2] | [1,1] | ReLU | [22,4,16] | 528 | |
Conv3 | 2D Conv. | [22,4,16] | 32 | [2,1] | [2,1] | ReLU | [11,4,32] | 1056 |
2D Conv. | [11,4,32] | 64 | [2,1] | [1,1] | ReLU | [10,4,64] | 4160 | |
Conv4 | 2D Conv. | [10,4,64] | 64 | [2,1] | [2,1] | ReLU | [5,4,64] | 8256 |
Flatten | Flatten | [5,4,64] | − | − | − | − | [1280] | 0 |
Encoded | Fully Connected | [1280] | − | − | − | ReLU | [10] | 6405 |
Dense | Fully Connected | [15] | − | − | − | ReLU | [1280] | 7680 |
Reshape | [1280] | − | − | − | − | [5,4,64] | 0 | |
TConv4 | Transposed | − | − | − | − | − | − | − |
Convolution | [5,4,64] | 32 | [2,1] | [2,1] | ReLu | [10,4,32] | 4128 | |
T. Conv. | [10,4,32] | 32 | [2,1] | [1,1] | ReLu | [11,4,32] | 2080 | |
TConv3 | T. Conv. | [11,4,32] | 16 | [2,1] | [2,1] | ReLu | [22,4,16] | 1040 |
T. Conv. | [22,4,16] | 16 | [1,2] | [1,1] | ReLu | [22,5,16] | 528 | |
TConv2 | T. Conv. | [22,5,16] | 8 | [3,3] | [2,2] | ReLu | [44,10,8] | 1160 |
T. Conv. | [44,10,8] | 8 | [2,1] | [1,1] | ReLu | [45,10,8] | 136 | |
TConv1 | T. Conv. | [45,10,8] | 1 | [3,3] | [2,2] | Linear | [90,20,1] | 73 |
Layer Name . | Layer Type . | Input shape . | Filters . | Kernel size . | Stride . | Activation . | Output shape . | Parameters . |
---|---|---|---|---|---|---|---|---|
Conv1 | 2D Convolution | [90,20,1] | 8 | [3,3] | [2,2] | ReLU | [45,10,8] | 80 |
2D Conv. | [45,10,8] | 8 | [2,1] | [1,1] | ReLU | [44,10,8] | 136 | |
Conv2 | 2D Conv. | [44,10,8] | 16 | [3,3] | [2,2] | ReLU | [22,5,16] | 1168 |
2D Conv. | [22,5,16] | 16 | [1,2] | [1,1] | ReLU | [22,4,16] | 528 | |
Conv3 | 2D Conv. | [22,4,16] | 32 | [2,1] | [2,1] | ReLU | [11,4,32] | 1056 |
2D Conv. | [11,4,32] | 64 | [2,1] | [1,1] | ReLU | [10,4,64] | 4160 | |
Conv4 | 2D Conv. | [10,4,64] | 64 | [2,1] | [2,1] | ReLU | [5,4,64] | 8256 |
Flatten | Flatten | [5,4,64] | − | − | − | − | [1280] | 0 |
Encoded | Fully Connected | [1280] | − | − | − | ReLU | [10] | 6405 |
Dense | Fully Connected | [15] | − | − | − | ReLU | [1280] | 7680 |
Reshape | [1280] | − | − | − | − | [5,4,64] | 0 | |
TConv4 | Transposed | − | − | − | − | − | − | − |
Convolution | [5,4,64] | 32 | [2,1] | [2,1] | ReLu | [10,4,32] | 4128 | |
T. Conv. | [10,4,32] | 32 | [2,1] | [1,1] | ReLu | [11,4,32] | 2080 | |
TConv3 | T. Conv. | [11,4,32] | 16 | [2,1] | [2,1] | ReLu | [22,4,16] | 1040 |
T. Conv. | [22,4,16] | 16 | [1,2] | [1,1] | ReLu | [22,5,16] | 528 | |
TConv2 | T. Conv. | [22,5,16] | 8 | [3,3] | [2,2] | ReLu | [44,10,8] | 1160 |
T. Conv. | [44,10,8] | 8 | [2,1] | [1,1] | ReLu | [45,10,8] | 136 | |
TConv1 | T. Conv. | [45,10,8] | 1 | [3,3] | [2,2] | Linear | [90,20,1] | 73 |
B. GMM
GMM aim to partition a set of N feature vectors, into K clusters, Ck, , assuming the kth cluster is Gaussian with prior probability πk, mean , and covariance .54 Ck is a nonempty set. If the number of clusters, K, is assumed a priori, then the Gaussian parameters can be iteratively estimated.
GMM employs the expectation-maximization (EM) algorithm to estimate from via the complete data likelihood.54 EM is an alternating algorithm that updates the weighted posterior probability (responsibility, rnk) and the prior, mean, and covariance by54
for . represents the estimated parameters of the kth cluster at the tth iteration. In GMM, the data likelihood p is Gaussian,
where represents proportionality. EM is iterated until πk, , and converge for all k. The estimated covariance matrix can become ill-conditioned if there are fewer than K true clusters.
The cluster priors πk, means , and covariance were initialized for 100 starts using the K-means++ algorithm (see Appendix A).55
C. K-Means
K-means56 is an approximation to the EM algorithm that partitions a set of data-derived feature vectors into K clusters, Ck, , where Ck is a nonempty set. The number of clusters K must be assumed a priori. The K-means algorithm is derived assuming the feature vectors belonging to the kth cluster are drawn from a Gaussian distribution with covariance and prior probability , where is the identity matrix,57
for . denotes the cardinality or size of the kth cluster.
The cluster means were initialized for 100 starts using the K-means++ algorithm ( Appendix A).55
K-means is commonly used due to its computational speed and interpretability, but its validity is subject to the isotropic covariance assumption and the equality of the cluster priors. If the true number of classes differs from K, the clusters will be incorrectly estimated.
D. Agglomerative hierarchical clustering
Agglomerative hierarchical clustering,49,54 also called bottom-up clustering, partitions a set of N data-derived feature vectors, , into K clusters Ck, , by grouping the most similar data at each step. Hierarchical clustering successively merges nearby clusters until the stop criterion is achieved. In this case, the stop criterion is met when K or fewer clusters remain, where K must be set by the practitioner.
Agglomerative clustering does not require the number of clusters to be assumed in advance. As demonstrated with classification of dolphin echolocation clicks,16,17 bottom-up methods allow for detailed clusters to be sequentially merged as additional information or labels become available.
In agglomerative clustering, the kth cluster at the mth iteration is represented as . The number of clusters is decreased to N – m by merging the two closest clusters that satisfy
is a metric measuring the distance between all points in sets i and . This agglomerative process is repeated times, until there are at most K clusters Ck remaining, with .
The distance metric d chosen here is from Ward's method.58 Ward's method measures the incremental sum-of-squares resulting from merging two sub-clusters.59 For clusters Cj and Ck with cluster means and that were merged for form with mean , the Ward's method distance is
E. Metrics
The performance of the handpicked feature clustering and DEC was measured by
where ti is the true label and is the categorical prediction. The indicator function if x is true and otherwise. Here, whale (label 0) was the true category. Fish were indicated by label 1. The label 2 was used in a scenario to indicate overlapping fish/whale signals. Precision (positive predictive value or PPV) measured the ratio of correctly predicted whale to total predicted whale. Recall (hit rate or true positive rate) measured the ratio of correctly classified whale to the true total. Higher metrics correspond to improved performance, with perfect performance given by accuracy, precision, and recall all equal to 1.
III. SIMULATED DATASETS
A set of simulated coral reef bioacoustic signals, motivated by observed signals from a coral reef soundscape near Hawaii, was simulated to compare conventional clustering of handpicked features to DEC.
One hundred simulation trials were conducted, with 10 000 signals simulated for each trial. The signals were simulated as timeseries sampled at 1 kHz and mimicked whale and fish signals (Fig. 2). A spectrogram image of each signal was generated using a 256–pt FFT with 90% overlap. The fixed-length spectrograms were created by clipping all signals to 0.5 s.
(Color online) Simulated (a), (b) and measured (c), (d) coral reef bioacoustic signals before clipping. Whales were simulated with an FM upsweep (a), (c), and fish were simulated with superimposed Gaussian pulses (b), (d). White noise was added to both signals at 25 dB SNR relative to the mean signal power.
(Color online) Simulated (a), (b) and measured (c), (d) coral reef bioacoustic signals before clipping. Whales were simulated with an FM upsweep (a), (c), and fish were simulated with superimposed Gaussian pulses (b), (d). White noise was added to both signals at 25 dB SNR relative to the mean signal power.
The DEC latent feature dimension was optimized over a range of values (Fig. 7). The optimal DEC performance was compared to clustering with automatically extracted handpicked features. Then, a third category containing an overlapping fish and whale signals was included, with 33% of the total samples belonging to each category.
Quadratic FM sweeps mimic parts of humpack whale song units (whale). The equation for the instantaneous frequency of a quadratic sweep is62
where t is the continuous time variable, f0 is the initial sweep frequency, is the total bandwidth, and T is the duration of the signal. The signal is an FM upsweep when , an FM downsweep when and a tonal when β = 0. The phase of the time domain signal is found by integrating the instantaneous frequency62
where t0 is the signal start time. For simulation, t is discretized at 0.001 s (sampling frequency of 1 kHz).
Timeseries of impulses, or pulse trains, simulated fish pulse calling (fish). The pulses were a set of Np superimposed Gaussian-modulated sinusoids63 spaced apart
where τ is the half-power pulse width and fc is the center frequency.
The signal parameters were varied randomly for each sample (Table II). Pulse width and center frequency were fixed to achieve a representative pulse structure. The number of pulses and spacing were drawn from experimentally estimated distributions. The duration, initial frequency, and total bandwidth of the FM sweep were drawn at uniform random from a range of realistically observed values. All signals were centered within the 0.5 s spectrogram and assigned a random delay of within ±0.1 s.
Signal parameters were drawn from random distributions for each simulated event.
. | FM Sweep . | Pulse train . |
---|---|---|
Duration/width (s) | ||
Delay (s) | ||
Frequency (Hz) | fc = 200 | |
Peak spacing (s) | beta(4, 23) | |
Number of peaks | ||
SNR (dB) |
. | FM Sweep . | Pulse train . |
---|---|---|
Duration/width (s) | ||
Delay (s) | ||
Frequency (Hz) | fc = 200 | |
Peak spacing (s) | beta(4, 23) | |
Number of peaks | ||
SNR (dB) |
White noise was added to the simulated signals using a fixed signal-to-noise ratio (SNR),
where is the signal power and is the noise power. The SNR range of each signal was determined from the experimental spectrograms during manual labeling. The SNR in dB was estimated as the difference of the peak signal power to the median power of the background.
The signal power was estimated as the bandwidth-normalized mean power over the signal duration,64
Three handpicked features were extracted: peak frequency (25), kurtosis (26), and number of timeseries peaks (27). Duration, median power, and cross-sensor coherence were excluded due to limitations of the fixed simulation parameters. Then, GMM, K-means, and hierarchical clustering were applied to the feature matrix to discover K = 2 clusters (fish or whale) or K = 3 (fish, whale, or both).
IV. HAWAIIAN CORAL REEF DATASET
A. Automatically detected signals in experimental data
Three directional autonomous seafloor acoustic recorders (DASARs) were deployed adjacent to a coral reef westward of the island of Hawaii. The DASARs, labeled N, W, and S for the north-most, middle, and south-most sensors, measured pressure and horizontal particle velocity with x– and y–components oriented at orthogonal compass directions. The array was roughly oriented N-S with inter-sensor spacing about 15 m (Fig. 3).
(Color online) Diagram of the DASAR array deployed adjacent to a coral reef on the island of Hawaii. The estimated detection locations are shown as gray dots. The majority of the reef is located due east of the array. The sensor positions were measured on the seafloor relative to DASAR W.
(Color online) Diagram of the DASAR array deployed adjacent to a coral reef on the island of Hawaii. The estimated detection locations are shown as gray dots. The majority of the reef is located due east of the array. The sensor positions were measured on the seafloor relative to DASAR W.
The DASARs recorded continuously for 7 days with a sampling rate of 1 kHz. On February 25, 2020, the dominant soundscape contributors below 500 Hz were reef fish, humpback whales, and motor noise from transiting surface boats, with boat noise occurring predominantly during daylight and fish calling most pronounced during dusk. The data were processed in 5 min chunks to account for DASAR clock drift. For each chunk, the pressure spectrograms on the three sensors were cross-correlated and aligned to within a time bin.65 The spectral density power spectrogram, denoted as matrix , was generated using a 256–point FFT with 90% overlap and Hanning window and normalized to have units of μPa Hz−1, with s and Hz.
A directional event detector65 was developed to utilize the DASARs' directional capability by combining two DASARs ( Appendix B). For this study, the detector parameters were , T = 120 Hz, ( s), and s. Detected events were localized66 to ensure that their signal had sufficient bandwidth for feature extraction. Detections for which the localization algorithm failed to converge were discarded. The remaining events were spatially filtered within a 100 m by 100 m box from DASAR S (Fig. 3). A total of 92 736 potentially localizable detections within 100 m were kept for further analysis, on average about 1 detection per second.
B. Spectral feature extraction
The handpicked features (Table III, Fig. 4) are time-frequency properties known to relate to fish call type including power, duration, and peak frequency39–41 as well as timeseries estimates of impulsive noise.18,67
Event features estimated for automatically detected signals.
Feature Name (units) . | Description . |
---|---|
Kurtosis | Fourth moment normalized by the |
squared variance | |
Npeaks (count) | Number of peaks with at least 5 |
dB prominence re the standard | |
deviation | |
Peak frequency (Hz) | Frequency of the peak power |
spectral density | |
Features for experimental data only | |
Duration (s) | Length of detected event |
Coherence | Normalized time coherence |
between DASARs | |
Median PSD (dB) | Median power spectral density (PSD) |
across event mask |
Feature Name (units) . | Description . |
---|---|
Kurtosis | Fourth moment normalized by the |
squared variance | |
Npeaks (count) | Number of peaks with at least 5 |
dB prominence re the standard | |
deviation | |
Peak frequency (Hz) | Frequency of the peak power |
spectral density | |
Features for experimental data only | |
Duration (s) | Length of detected event |
Coherence | Normalized time coherence |
between DASARs | |
Median PSD (dB) | Median power spectral density (PSD) |
across event mask |
(Color online) Handpicked features of a fish call event on February 25, 07:12 HST, measured on directional autonomous seafloor acoustic recorders (DASARs) N, W, and S at 1 kHz. The spectrograms were generated using a 256–point FFT with 90% overlap. A call duration of 0.52 s was determined during the detection process. (a)–(c) The spectrograms were used to extract peak frequencies (black star) from 219 to 332 Hz and median PSD from 72.8 to 76.3 dB. (d)–(f) The timeseries envelope was used to extract the kurtosis values of 18–21, cross-sensor coherences of 0.72–0.74, and 8 temporal peaks. DASAR S was the closest to the call.
(Color online) Handpicked features of a fish call event on February 25, 07:12 HST, measured on directional autonomous seafloor acoustic recorders (DASARs) N, W, and S at 1 kHz. The spectrograms were generated using a 256–point FFT with 90% overlap. A call duration of 0.52 s was determined during the detection process. (a)–(c) The spectrograms were used to extract peak frequencies (black star) from 219 to 332 Hz and median PSD from 72.8 to 76.3 dB. (d)–(f) The timeseries envelope was used to extract the kurtosis values of 18–21, cross-sensor coherences of 0.72–0.74, and 8 temporal peaks. DASAR S was the closest to the call.
If and represent the absolute start and end of the nth detected signal in seconds, the event duration is
The median power and peak frequency of the nth detection were extracted from the spectrogram, [Fig. 4(a)–4(c)], with
The spectral features extracted at DASAR S, closest to the majority of detected signals (Fig. 3), were used for clustering.
C. Temporal feature extraction
The pressure timeseries of the nth detected signal was extracted between and (Fig. 4) and contained M time samples. For the experimental data, the vector sensor x– and y–velocity during the nth detected signal, and , were used to create a beamformed pressure timeseries, , for improved detection SNR,
where is the impedance in water with density ρ (kg m3) and sound speed c (m s−1).48 is the signal azimuth estimated during the directional detector processing.65 For the simulated data, .
Three metrics were chosen for timeseries extraction: kurtosis, number of peaks, and cross-sensor coherence. The kurtosis is a ratio of moments and has recently been applied to the task of differentiating impulsive from non-impulsive sounds,18,67
is the variance and the overbar represents the arithmetic mean.
The number of peaks and cross-sensor coherence were extracted from the Hilbert transform of the beamformed pressure timeseries (Fig. 4)
Here, the number of peaks is the number of local maxima with at least 5 dB prominence relative to the standard deviation
where (a, b) is an interval in indicates the highest trough in (a, b), and σ is the standard deviation of Yb.
Last, for the experimental detections, the normalized correlation coefficient of the timeseries envelope across DASARs was computed to measure the spatial coherence of the signal propagation,
where and are the Hilbert transform of the beamformed timeseries at the North and South DASARs. The measure of coherence assumes that a good quality biological signal will be received as similar timeseries on two spatially separated sensors.
Figure 5 shows the number of events detected along with extracted feature median, 10%, and 90% levels for every 15 min. The number of peaks in the timeseries has been shown to be an indicator of fish species and call context for Hawaiian reef fish.40,41 In this study, fish were most common during nighttime [Fig. 5(g)], with pulse trains peaking during the evening chorus after nautical twilight (19:15 HST). The evening chorus corresponded to a visible increase in median power, event duration, and number of time peaks and a visible decrease in the 90th percentiles of peak frequency and kurtosis.
(Color online) (a) Number of detected events per 15 min on February 25 and (b)–(g) 10%, 50%, and 90% levels per 15 min for each feature: (b) peak frequency, (c) median time-frequency power, (d) event duration, (e) normalized time coherence between sensors, (f) kurtosis, and (g) number of time peaks. All features were measured on the DASAR S. The coherence is the normalized correlation lag coefficient between DASARs N and S.
(Color online) (a) Number of detected events per 15 min on February 25 and (b)–(g) 10%, 50%, and 90% levels per 15 min for each feature: (b) peak frequency, (c) median time-frequency power, (d) event duration, (e) normalized time coherence between sensors, (f) kurtosis, and (g) number of time peaks. All features were measured on the DASAR S. The coherence is the normalized correlation lag coefficient between DASARs N and S.
V. RESULTS
A. Simulations
The handpicked features were examined for separability for K = 2 equal-sized clusters (fish, whale) and K = 3 clusters (fish, whale, both). Figure 6 shows the feature values colored by true signal type for one simulation trial. The whale and fish signals overlapped in their peak frequency and in the number of automatically extracted temporal peaks. The signals were most strongly separated in kurtosis, with whale signals having very low kurtosis.
(Color online) Handpicked features demonstrate some separability by kurtosis, number of timeseries peaks, and peak frequency but with overlapping clusters for (a)–(c) two clusters (whale, fish) and (d)–(f) (bottom) three clusters (whale, fish, both). The dots are scaled to indicate density of feature pairs, with each dot increased by 1 pt for every 100 samples.
(Color online) Handpicked features demonstrate some separability by kurtosis, number of timeseries peaks, and peak frequency but with overlapping clusters for (a)–(c) two clusters (whale, fish) and (d)–(f) (bottom) three clusters (whale, fish, both). The dots are scaled to indicate density of feature pairs, with each dot increased by 1 pt for every 100 samples.
For two clusters, all clustering methods assumed K = 2. When the cluster sizes were equal, all clustering methods achieved high accuracy (Table IV). For the unequal cluster scenario with 25% whale and 75% fish (2500/7500), GMM achieved much higher accuracy than either K-means or hierarchical agglomerative clustering on the handpicked features (Table IV). These simulation results are supported by a supplementary example of the clustering algorithm assumptions ( Appendix A), which demonstrates that GMM has fewer prior assumptions and is thus more flexible. The classification accuracy of all methods for three equal-sized clusters (Table IV), with whale, fish, and a cluster containing a temporally overlapping whale and fish, was lower than for two equal-sized clusters due to the signal overlap. GMM is capable of estimating overlapping clusters and achieved the best accuracy and recall on the handpicked features of the overlapping signal scenario.
Median of classification accuracy, recall, and precision across 100 simulations for clustering with hand-crafted features (K-means, GMM, hierarchical) and deep learning (DEC). Whale was defined as the positive category.
Method . | Accuracy . | Recall . | Precision . | Accuracy . | Recall . | Precision . | Accuracy . | Recall . | Precision . |
---|---|---|---|---|---|---|---|---|---|
K = 2 simulated clusters | K = 3 simulated clusters | K = 2 (unequal) | |||||||
(DEC P = 10) | (DEC P = 15) | simulated clusters | |||||||
(DEC P = 10) | |||||||||
K-means | 0.96 | 1.0 | 0.93 | 0.56 | 0.65 | 0.99 | 0.68 | 1.0 | 0.44 |
GMM | 0.98 | 1.0 | 0.96 | 0.86 | 1.0 | 0.94 | 0.97 | 1.0 | 0.88 |
Hierarchical | 0.98 | 0.99 | 0.98 | 0.60 | 0.58 | 0.99 | 0.66 | 1.0 | 0.42 |
GMM, latent features | 0.95 | 0.96 | 0.99 | 0.69 | 0.90 | 1.0 | 0.96 | 0.89 | 1.0 |
DEC | 0.98 | 0.99 | 0.99 | 0.75 | 0.92 | 0.99 | 0.77 | 1.0 | 0.52 |
Method . | Accuracy . | Recall . | Precision . | Accuracy . | Recall . | Precision . | Accuracy . | Recall . | Precision . |
---|---|---|---|---|---|---|---|---|---|
K = 2 simulated clusters | K = 3 simulated clusters | K = 2 (unequal) | |||||||
(DEC P = 10) | (DEC P = 15) | simulated clusters | |||||||
(DEC P = 10) | |||||||||
K-means | 0.96 | 1.0 | 0.93 | 0.56 | 0.65 | 0.99 | 0.68 | 1.0 | 0.44 |
GMM | 0.98 | 1.0 | 0.96 | 0.86 | 1.0 | 0.94 | 0.97 | 1.0 | 0.88 |
Hierarchical | 0.98 | 0.99 | 0.98 | 0.60 | 0.58 | 0.99 | 0.66 | 1.0 | 0.42 |
GMM, latent features | 0.95 | 0.96 | 0.99 | 0.69 | 0.90 | 1.0 | 0.96 | 0.89 | 1.0 |
DEC | 0.98 | 0.99 | 0.99 | 0.75 | 0.92 | 0.99 | 0.77 | 1.0 | 0.52 |
The effect of the DEC latent feature dimension was examined across 100 random simulation trials (Fig. 7). The DEC model, motivated by a successful architecture for classifying seismic spectrograms,46 was retrained and tested on all trials at varying latent feature vector dimensionality (P) for the K = 2 equal-sized clusters and K = 3 cluster scenarios (Fig. 7). When the latent dimension was low, e.g., P < 8, all features were likely to be clustered in a single cluster. Above P > 8, DEC accuracy was consistent. Values of P = 10 and P = 15 were chosen for the K = 2 and K = 3 cluster scenarios.
(Color online) Accuracy of DEC over 100 trials for each latent dimension (P). The whiskers extend to the 10th (lower) and 90th (upper) percentiles. The box marks the 25th to 75th percentile. The method fails at low P but is otherwise stable.
(Color online) Accuracy of DEC over 100 trials for each latent dimension (P). The whiskers extend to the 10th (lower) and 90th (upper) percentiles. The box marks the 25th to 75th percentile. The method fails at low P but is otherwise stable.
Figure 8 shows t-SNE visualizations from a representative trial of the latent features clustered with GMM [Figs. 8(b), 8(e), and 8(h)], and the DEC latent features and cluster predictions after 20 additional training epochs using (2) [Figs. 8(c), 8(f), and 8(i)]. Additional training with (2) increased the accuracy of DEC clustering prediction from 96% to 99.2% in the case of 2 equal sized clusters and from 70.8% to 78.2% in the case of 3 clusters. In the scenario containing 2 unequal-sized clusters, the additional DEC training reduced the accuracy from 97.6% to 78.3%. In that case, high recall and low precision (Table IV) indicate that DEC overpredicted the smaller cluster. GMM clustering of the latent features learned from spectrograms achieved comparable accuracy to GMM clustering of handpicked features in the non-overlapping scenarios with K = 2 (Table IV). For the K = 3 scenario, the highest classification confusion for the spectrogram-based deep learning was between fish and the overlapping signal cluster. Figure 9 shows that fish signals with large bandwidth and duration may dominate the spectral signature, which may lead to increased misclassification.
(Color online) t-SNE representation of 10 000 deep embedded feature vectors from a sample trial (a), (b), (d), (e), (g), (h) before, and (c), (f), (i) after deep clustering, with (a)–(c) equal-sized clusters, (d)–(f) unequal-sized clusters of fish and whale, and (g)–(i) fish, whale, and fish/whale clusters. The perplexity for visualization was 200.
(Color online) t-SNE representation of 10 000 deep embedded feature vectors from a sample trial (a), (b), (d), (e), (g), (h) before, and (c), (f), (i) after deep clustering, with (a)–(c) equal-sized clusters, (d)–(f) unequal-sized clusters of fish and whale, and (g)–(i) fish, whale, and fish/whale clusters. The perplexity for visualization was 200.
(Color online) Simulated coral reef bioacoustic signals successfully classified using DEC. From top to bottom rows: input spectrograms, simulated timeseries, initial DEC reconstruction, and DEC reconstruction after adding clustering loss.
(Color online) Simulated coral reef bioacoustic signals successfully classified using DEC. From top to bottom rows: input spectrograms, simulated timeseries, initial DEC reconstruction, and DEC reconstruction after adding clustering loss.
B. Experiment
The clustering methods were applied to a subset of 4000 unlabeled detections randomly selected from the Hawaii 2020 experiment. Each detection contained one or more directional signals. All clustering methods assumed K = 2. Then, for post-clustering analysis, labels of whale/no whale (only fish) were manually assigned to all 4000 samples, based on the signal within the detection window. About two-thirds were labeled as no whale and contained only fish.
Figure 10 shows the clustering results using t-SNE representations of the learned latent feature vectors [Figs. 10(a)–10(c)] and the DEC latent features after additional training [Fig. 10(d)]. Clustering of the P = 6–dimensional handpicked feature vectors did not correspond well to the manual labels, with K-means achieving the highest accuracy of 67%. GMM clustering of the latent features learned from the signal spectrograms achieved an accuracy of 78% compared to the manual labels, which was the highest accuracy of all methods examined. GMM of latent features had high precision and low recall (Table V).
(Color online) Features of experimentally detected signals from a Hawaiian coral reef shown as a t-SNE representation of (a)–(c) learned latent features and (d) modified DEC features of 4000 manually labeled detections. All clustering methods assumed K = 2.
(Color online) Features of experimentally detected signals from a Hawaiian coral reef shown as a t-SNE representation of (a)–(c) learned latent features and (d) modified DEC features of 4000 manually labeled detections. All clustering methods assumed K = 2.
Classification accuracy determined from manually labeled experimental detections for clustering with handpicked features (K-means, GMM, hierarchical) and deep learning (DEC).
Method . | Accuracy . | Recall . | Precision . |
---|---|---|---|
K-means (K = 2) | 0.66 | 0.42 | 0.47 |
GMM (K = 2) | 0.55 | 0.84 | 0.40 |
Hierarchical (K = 2) | 0.63 | 0.23 | 0.38 |
GMM (K = 2), latent features | 0.78 | 0.33 | 0.90 |
DEC (P = 15, K = 2) | 0.66 | 0.75 | 0.48 |
Method . | Accuracy . | Recall . | Precision . |
---|---|---|---|
K-means (K = 2) | 0.66 | 0.42 | 0.47 |
GMM (K = 2) | 0.55 | 0.84 | 0.40 |
Hierarchical (K = 2) | 0.63 | 0.23 | 0.38 |
GMM (K = 2), latent features | 0.78 | 0.33 | 0.90 |
DEC (P = 15, K = 2) | 0.66 | 0.75 | 0.48 |
In the experimental signals, additional training of the DEC latent features reduced the classification accuracy [Fig. 10(d)]. Similar to the simulated unequal-sized cluster scenario, DEC obtained higher recall with lower precision (Table V).
Spectrograms correctly classified by DEC are shown in Fig. 11, with the reconstructed spectrograms shown before and after additional DEC training using (2). These demonstrate that whale signals were primarily identified by their narrow bandwidth and temporal extent, whereas fish signals were identified as broadband. The timeseries in Fig. 11 demonstrate the magnitude variation between events that was not attributed to signal type, motivating the normalization of the spectrograms.
(Color online) Experimentally detected coral reef bioacoustic signals successfully classified using DEC. From top to bottom rows: input spectrograms, recorded timeseries, initial DEC reconstruction, and DEC reconstruction after adding clustering loss.
(Color online) Experimentally detected coral reef bioacoustic signals successfully classified using DEC. From top to bottom rows: input spectrograms, recorded timeseries, initial DEC reconstruction, and DEC reconstruction after adding clustering loss.
VI. CONCLUSION
A deep clustering approach was presented for interpreting unlabeled coral reef bioacoustic detections. This approach leverages deep learning and clustering, motivated by recent improvements in classification accuracy of fish calling using unsupervised detection and deep neural network classifiers.18,26,27,43
Clustering of simulated fish (Gaussian pulses) and whale (FM sweeps) demonstrated that unique signal categories could be clustered with GMM of deep latent features learned from spectrograms, DEC clustering,45,46 or with clustering of handpicked features motivated by studies of coral reef fish.39–41 GMM applied to either deep latent features or handpicked features achieved the highest accuracy when there were two clusters of unequal size, and GMM of handpicked features had the highest accuracy when the clusters contained overlapping signals. Deep clustering learned features directly from spectrograms and was successful in separating fish and whale clusters. DEC trained with a joint clustering loss function increased the classification accuracy in simulations compared to clustering the deep latent features with GMM, except when the cluster sizes were significantly different. For unequal-sized simulated clusters, applying GMM to the learned latent features demonstrated higher classification accuracy.
Broadband bioacoustic events detected on a Hawaiian coral reef in February–March 2020 using a directional detector were analyzed with deep clustering and clustering of handpicked features. A labeled subset of these detections with whale/no whale indicated that about two-thirds of the detections contained primarily fish and one-third contained a whale signal. GMM with K = 2 applied to latent features learned from the signal spectrograms achieved the highest accuracy and tended to overpredict the larger cluster containing fish. After further training DEC with a joint clustering loss, the DEC recall increased but DEC classification accuracy was reduced, indicating that whale was overpredicted. DEC reconstructions of the input spectrograms demonstrate that the learned features are representative of spectral features identified by manual labelers. Clustering of handpicked features with K = 2 on the experimental, manually labeled data achieved lower accuracy, precision, and recall.
These results demonstrate that deep clustering is a promising method for classifying unlabeled bioacoustic signals with distinct spectral signatures. In simulated and experimental studies, both GMM applied to deep latent features and DEC trained with a joint clustering loss achieved competitive classification accuracy for a range of latent feature dimensions, without requiring the selection of handpicked features. Our results indicate that the accuracy of handpicked feature clustering depends strongly on the feature properties and the choice of clustering method. Finally, the choice of clustering algorithm is an important consideration for applications that are subject to signal clusters of unequal size.
ACKNOWLEDGEMENTS
Thank you to Greeneridge Sciences for providing the DASAR sensors and to Alex Conrad for assisting with DASAR post-processing. Thank you to Richard Walsh for assistance with hardware deployment and recovery. This work was supported by the Office of Naval research under Award No. N00014-18-1-2065.
APPENDIX A: CLUSTERING OF GAUSSIANS
For a continuous data variable, x, that is Gaussian with mean and covariance , the posterior probability of the cluster is given by57
where Ck and Cj represent the kth and jth classes and is the determinant of . The boundary between the two classes occurs when there is an equal probability of x belonging to either class,
The general solution for x in Eq. (A3) is a multi-dimensional parabola, which simplifies to a linear boundary if the distributions have a shared covariance such that . The solution in Eq. (A3) is extensible to K > 2 classes by considering the joint distributions of all classes (see Bishop57 Chap. 4.2.1 for details).
1. Clustering simulations
Three 2D Gaussian distributions, each with N = 10 000 points, were used to simulate overlapping clusters (Fig. 12). The true cluster means were −, and . The covariance of the first dataset was
with . For the first dataset [Fig. 12(a)], there were no off diagonal covariance terms.
(Color online) Clustering on two data distributions using (a), (b) GMM, (c), (d) K-means, and (e), (f) agglomerative clustering with Ward's method, with maximum likelihood boundaries shown by black lines.
(Color online) Clustering on two data distributions using (a), (b) GMM, (c), (d) K-means, and (e), (f) agglomerative clustering with Ward's method, with maximum likelihood boundaries shown by black lines.
The second dataset [Fig. 12(b)] was generated by rotating the data counterclockwise at θ, with
with = 3. The three clusters were rotated by and . The non-isotropic covariance is a result of correlation between variables and indicates feature correlations. A non-isotropic covariance matrix of a feature vector indicates correlation between features.
GMM assumes that the clusters are Gaussian distributed around a cluster mean. The K-means algorithm further assumes the Gaussians have the same diagonal covariance and uniform prior probability. Thus, GMM and K-means perform best for the first dataset [Figs. 12(a), 12(c), 12(e)] with identical, diagonal covariance for all clusters. By incorporating intercluster distance through the Ward metric, hierarchical agglomerative identifies 3 similar clusters.
When the cluster covariances are not of the form , K-means and agglomerative clustering provide solutions that are suboptimal to the true class boundaries. GMM achieves better accuracy [Figs. 12(b), 12(d), and 12(f)] but does not converge to the maximum likelihood solution due to estimation errors in the covariances. All three methods provide insight into the cluster memberships despite invalid class assumptions, but GMM is the most versatile clustering method.
2. K-means++ algorithm
K-means++ is an algorithm proposed to efficiently initialize parameters for Gaussian clustering methods, including GMM and K-means. K-means++ improves the runtime of the clustering algorithms and the quality of the final solution.55 For a set of data samples , the K-means++ algorithm follows:55
Initialize the priors .
Initialize the covariance matrices as , with var .
Select the first cluster mean as a random data sample, .
Assign all data samples to the cluster with the nearest mean. We denote that the mth sample belongs to cluster Ci as , satisfying . Note that when j = 1 (first step), , i.e., all data samples belong to one cluster.
- Select the j + 1-th cluster mean at random from the remaining data samples with probability(A6)d is the Mahalanobis distance,(A7)
Equation (A6) ensures that each mean is selected with a probability proportional to its distance from all existing means.
Repeat steps 4 and 5 until K cluster means are chosen.
3. Visualization of high-dimensional data
For data with more than two dimensions, for P > 2, clusters may be visualized by applying dimensionality reduction. In this study, 2D t-stochastic neighbor embedding (t-SNE)68 was used to visualize P–dimensional features.
The similarity of one point, , to another point, , is found from the conditional probability that the points are neighbors within a Gaussian density with mean ,68
for N points i, j = . The neighborhood of , as determined by σi, is defined implicitly in terms of the perplexity (Fig. 13),68,69
where , and H is the Shannon entropy. The optimal value of σi in Eq. (A8) for each point is solved with a binary search for a given value of perplexity.69
(Color online) Varying values of perplexity for t-SNE on N = 1000 randomly drawn points .
(Color online) Varying values of perplexity for t-SNE on N = 1000 randomly drawn points .
Then, a set of projected data is randomly initialized with zero-mean Gaussians of low variance,68 . The are iteratively updated until the Kullback-Leibler (KL) divergence between the conditional probability in Eq. (A8) and the Student's t-distribution of the projected data is minimized,
In contrast to classic SNE which uses only Gaussians, t-SNE's use of the Student t-distribution further penalizes outliers in the projected data,68 resulting in a more compact representation. As shown in Fig. 13, the value of perplexity should be varied according to user preference to obtain the desired visualization.
APPENDIX B: DIRECTIONAL DETECTOR
A directional event detector65 was developed to utilize the DASARs' directional capability by combining two DASARs. The complex spectrograms of the x– and y–particle velocity, matrices and , were generated identically to the spectrogram X with units m s−1.
The active intensity, a measure of in-plane energy, was used to determine the noise directionality
where the asterisk is the complex conjugate and the real component. a tan2 is a piecewise function that computes the elementwise angle between the elements of two matrices, with domain defined counterclockwise from the y axis ( = North). A is therefore the time-frequency representation of compass directionality. In the following, matrix is called the azigram70 for DASAR N, likewise for and .
The event detector makes the following signal assumptions:
An event arrives from a constant azimuthal sector for each DASAR.
Target events are broadband below 500 Hz. The minimum required bandwidth was set with an empirical threshold (see Appendix in Thode et al.65).
The detection algorithm is demonstrated in Fig. 14. First, the azigrams for the north- and southmost DASARs, and , were used to create binary maps and of time-frequency points within a fixed azimuthal sector [Fig. 14(a)],
and likewise for [Fig. 14(b)]. is the elementwise indicator function, with . Binary maps were generated for all combinations of azimuthal sectors . Here, . Next, overlapping signals on both DASARs were discovered by creating a combined map [Fig. 14(c)], which was summed across frequency to determine the detection timeseries,
d measures the bandwidth of an event. Event start and end times were determined for for threshold T. Events separated by less than were merged and events longer than were removed.
Directional detector for a fish call event on February 25, 07:12 HST, with DASAR N looking between 135° and 225° and DASAR S looking between 45° and 135° (clockwise from north). The overlap of the binary masks, summed across frequency, defines the detection timeseries.
Directional detector for a fish call event on February 25, 07:12 HST, with DASAR N looking between 135° and 225° and DASAR S looking between 45° and 135° (clockwise from north). The overlap of the binary masks, summed across frequency, defines the detection timeseries.