There is a growing interest in the ability to detect and classify animal vocalizations in large scale bioacoustic databases for the purposes of conservation and research. To aid in this, two methods are proposed for the quick and accurate detection of harmonic cetacean and fish vocalizations: Normalized summation of sound harmonics and spectrogram masking. These methods utilize a normalization scheme that enables robust performance, achieving 30% more precision and recall than traditional spectrogram cross correlation in the presence of wideband noise and low signal-to-noise ratios. The proposed methods also perform up to 135 times faster than spectrogram cross correlation.
1. Introduction
The field of bioacoustics has enabled a closer inspection of the acoustic behavior of a diverse range of animal species, while also providing insight on the effect of anthropogenic sound on the animal kingdom.1,2 Passive acoustic monitoring (PAM) has made it possible to collect large volumes of sound data from the target species' acoustic environment in a non-invasive manner, aiding in the conservation of said species. PAM is particularly suited for underwater environments, where the low transparency of water obscures the visual monitoring of marine animals. As researchers attempt to analyze the passive acoustic recordings, they will often require a method to detect and classify the different animal calls present. This is a necessity, as there usually exists weeks or months of recording data, containing significant amounts of noise, ambient or anthropogenic. The annotation of passive acoustic data sets is subsequently a very time-consuming endeavour. The detection and classification of cetaceans3 (and to a lesser extent, fish4,5) has garnered much attention, as the underwater acoustic environment poses many challenges to traditional detection methods, methods such as voice activity detection, and match filtering. The variance of ambient noise in the ocean is quite large, causing the overall signal-to-noise ratio (SNR) of target vocalizations to decrease, resulting in many missed detections. However, shipping noise, seismic activity, airgun surveys,6 and other transient noise sources cause a plethora of false detections when analyzed by the aforementioned detection methods. The shortage of labeled bioacoustic data sets5,7 also poses a problem, as too little training data causes supervised detection methods, such as neural networks, to overfit and attain poor true detection rates.
Jiang et al.8 used a LeNet5 convolutional neural network (CNN) to detect and then classify killer whale and long-finned pilot whale whistles, achieving a high percentage of true detection and true classification rates with less than two hours of training data. Allen et al.9 proposed the use of a ResNet-50 model to detect humpback whale songs and obtained high precision, although they required over 200 h of training data to yield these results. Riera et al.10 detected fish sounds by first segmenting acoustic events in a certain frequency band and then classifying the fish sounds using a random forest classifier, although their method garnered many false positives. Harakawa et al.5 proposed the use of a hybrid supervised learning method to detect the sounds of fish from the Sciaenidae family (commonly called drums or croakers), achieving good results even when the amount of labeled training data was small. These machine learning methods all have their respective merits in the detection and classification of cetacean vocalizations, however, the need for a suitable number of training samples is a problem for researchers with large quantities of unlabeled data. To address this problem, a fast and reliable detection approach is required, one that does not require a large amount of training data.
Many cetacean and fish calls exhibit not only a fundamental frequency, but also overtones (harmonics). In the past, the harmonic structure of human speech has been exploited for numerous purposes, such as voice activation detection11 and pitch estimation.12,13 The harmonic structure of many cetacean and fish calls can also be used similarly, although very little research has been done in this regard. In their comparison of different tonal detectors, Bouffaut et al.7 utilised, among others, the harmonic product spectrum to test its effectiveness in the detection of Antarctic blue whale calls; however, Antarctic blue whales do not always produce harmonic sounds, therefore, this method was not well suited for their study. Shapiro et al.14 proposed a method to track the pitch of both human speech and killer whale vocalizations using their discrete logarithmic Fourier transformation-pitch detection algorithm (DLFT-PDA), producing reliable results in both high and low SNR situations. Their work demonstrated the value of applying signal processing methods originally developed for human speech to underwater environments.
In this paper, two approaches to the detection of cetacean and fish sounds are proposed, aimed at providing fast, accurate, and robust results without the need for copious amounts of training data. The methods are based on the summation of sound harmonics and spectrogram masking, with a normalization scheme that ensures acceptable detection performance in noisy environments.
2. Methodology
2.1 Normalized summation of sound harmonics (NSSH)
This normalization ensures that if most of the signal's power lies in the harmonic bands, then its value for will be large, as the value for will be very similar to that of Px, whereas if the signal's power is not present in the harmonic bands or if the signal is wideband, then the value for will be much smaller than that of Px, resulting in a smaller . If the harmonic sum spectrum is normalized by the total power instead of the average power (i.e., , where N is the number of samples in the PSD), it will yield the percentage of signal power contained in the harmonic bands. Note that if the sum is normalized by the power of the noise only (i.e., ), the resulting quantity would be the SNR. The maximum value for is considered and if this value exceeds a certain threshold the signal is classified as a target vocalization. To minimize the impact of disturbances outside the desired frequency band (i.e., for and ), the signal should be band limited, otherwise will be too low, and false negatives can occur. This detection method will be referred to as the NSSH. This method should perform well in detecting harmonic vocalizations, even if the vocalizations differ in duration, as the method only utilizes the power density spectrum of a signal frame.
2.2 Spectrogram masking
If the target vocalization contains frequency characteristics that change over its duration, then the NSSH method would have to consider smaller frame lengths for the signal and the harmonic spectrum would span a larger fundamental frequency interval. This could make the method quite costly and decrease performance. To address this, the method in Sec. 2.1 can be extended from the two-dimensional power density spectrum to the three-dimensional spectrogram.
Let us consider a spectrogram image of our input signal and let us consider a spectrogram kernel that will mask our original image, where l is the index of the PSD vector and k is the index for the discrete frequency components. The kernel is constructed to resemble our target vocalization by taking a few spectrograms of the vocalization and averaging them. A threshold can then be used to create a binary mask.
The binary kernel can then be used to mask the input spectrogram through an element-wise multiplication. With this, only the frequency components present in are retained in the resulting spectrogram .
The quantity m is a ratio that compares the power contained in the selected frequency bands (per the masking kernel) to the total power in the spectrogram. This method utilizes NL floating point multiplications. This is faster than a full spectrogram cross correlation,15 which utilizes floating point multiplications to derive the cross correlation at each time-delay. The ratio m can be compared to a threshold to determine if the spectrogram contains a target vocalization or if it contains only noise. For this normalization to work optimally, the power components outside the duration and frequency band of the target calls need to be minimized. This can be done by band-limiting the signals to retain only the frequency components in the target vocalizations, and ensuring the duration of the spectrograms is only as long as the duration of an average target call.
3. Experiments
This section describes the bioacoustic recordings and the use thereof in detection experiments. The performance of the NSSH and spectrogram masking methods will be compared to two other popular detection methods to gauge their effectiveness at discerning bioacoustic signals from stationary and wideband noise.
3.1 Data
Two sets of audio data were used for testing, both captured by hydrophones on the coast of False Bay, South Africa. The first set contained approximately 190 southern right whale (SRW) vocalizations and was one hour long. The SRW calls were captured at 96 KHz with a drifting buoy nine meters under the surface of the ocean. The drifting buoy was deployed during summer for two to four hours before being retrieved. The southern right whale data were downsampled to 8000 Hz and band limited to 100–700 Hz. The SRW vocalizations had a fundamental frequency between Hz and displayed four harmonics at most. Most of the calls displayed a good SNR (between 15 and 20 dB). The calls had durations that ranged between 250 and 700 ms.
The second set contained approximately 340 vocalizations from an unknown fish species and was 30 min long. The fish calls were assumed to be from the same species, possibly a type of croaker. The fish dataset was recorded at 8 KHz using an Audiomoth,16 deployed 28 meters under the ocean. The Audiomoth was deployed for four to six weeks during winter and was recording continuously until it was retrieved. The fish data were downsampled to 2000 Hz and band limited to 100–500 Hz. The fish vocalizations had a fundamental frequency between Hz and had eight to nine visible harmonics (the harmonics below 150 Hz had a very low SNR, disappearing completely into the background noise). The fish recordings contain calls that have SNRs between 2 and 20 dB. The recordings also contain plenty of wideband disturbances, mostly due to noise from the anchor that moored the hydrophone. The fish calls had durations between 200–700 ms. Figure 1(a) contains example calls from the first data set and Fig. 1(b) contains example calls from the second data set.
Spectrogram image of (a) southern right whale calls and (b) fish calls.
3.2 Detection methods
The vocalizations in both data sets were subject to detection by four methods, namely, NSSH, spectrogram masking, normalized spectrogram cross correlation (NSPCC), and RavenPro's band limited energy detector (BLED)17 (The Cornell Lab of Ornithology, Ithaca, NY).
The band limited energy detector takes an input recording and calculates the energy within a certain frequency bandwidth and then compares this energy to a detection threshold to determine if it contains a signal of interest. It succeeds in quickly detecting potential vocalizations in the presence of weak background noise and is often used as a detection benchmark.18–20
For the NSSH method, frame lengths of 125 ms were taken to calculate the power density spectra for the SRW data set, using 512-point fast Fourier transforms (FFTs). For the fish data set, frame lengths of 200 ms and zero-padded 2048-point FFTs were used. The reason for the longer FFT length is that the fish harmonics are spaced much closer to each other than the SRW harmonics, requiring a higher frequency resolution. The power density spectra were calculated by time-averaging overlapping FFT frames, according to Welch's method.24 For both data sets, a frame overlap of 50% was used.
Signal frames of 687.5 ms and 700 ms were used for the SRW and the fish spectrograms, respectively. The reason for the different spectrogram durations is due to rounding from discretization. For spectrogram masking, a spectrogram overlap of 50% was used, while for NSPCC no overlap was used, as the correlation calculation ensured full coverage of the entire spectrogram.
The BLED was implemented using RavenPro software, whilst the other detection algorithms were implemented in c++.
3.3 Evaluation
To obtain the number of true positives, false positives, and false negatives from the detected segments, the ground truth labels (manual annotations) were discretized to contain the same number of samples as the automatic detections. After this, the ground truth labels and automatic detections were automatically compared to determine the number of true positives, false positives, and false negatives.
4. Results
The detection results for the SRW data set and the fish data set are captured in Figs. 2(a) and 2(b), respectively. It is clear that the detection algorithms performed well when applied to calls from the southern right whale, obtaining F-scores above 90%. The NSSH, NSPCC, and spectrogram masking methods all performed similarly, while the BLED performed slightly worse.
Precision-recall curve of (a) southern right whale calls and (b) fish calls.
In Fig. 2(b), it is quite apparent that the methods had difficulty detecting the vocalizations present in the fish recording, largely due to the large amounts of wideband noise and the variable duration of the fish sounds. The NSSH and spectrogram masking methods produced the best results, reaching 80% precision and recall at the optimal operating point, whereas the NPSCC struggled to break 50% precision. The BLED fared even worse, struggling to reach 40% precision. This means that NSSH and spectrogram masking yield half the false positive rate of the BLED for the same true positive rate.
The timing results are displayed in Table 1. The NSSH method performed the fastest, with an execution speed 1348–1874 times faster than real time (this is computed by taking the sound file duration and dividing it by the algorithm execution time), while spectrogram cross correlation performed the slowest at 74.3–113 times faster than real time. In the case of NSSH, the FFT calculations take the longest to compute, while for spectrogram masking and spectrogram cross correlation, the algorithms take longer to compute. Both the algorithm times and FFT times are dependent on the length of the signal, the parameters that could otherwise affect execution times are frame lengths, overlap percentage, and FFT length. For example, if a spectrogram overlap of 75% was used for spectrogram masking, then the algorithm would take twice as long. The execution time of the BLED could not be estimated as accurately as the other methods, given that it is proprietary software and does not have any method of accurately measuring its execution time. The execution time (for both soundfiles) is estimated to be about two seconds.
Execution times for NSSH, spectrogram masking, and normalized spectrogram cross correlation. The time to compute the FFT time and execute the different algorithms (Algorithm time) on a one hour southern right whale recording (sampled at 8000 Hz) and a 30 min fish recording (sampled at 2000 Hz) is shown. The total execution time for each method is given as well.
. | SRW . | Fish . | ||||
---|---|---|---|---|---|---|
. | FFT time (N = 512) . | Algorithm time . | Total . | FFT time (N = 2048) . | Algorithm time . | Total . |
NSSH | 221 ms | 1921 ms | 335 ms | 1335 ms | ||
Spectrogram masking | 1700 ms | 5906 ms | 7606 ms | 1000 ms | 7694 ms | 8906 ms |
NSPCC | 30 000 ms | 31 700 ms | 23 220 ms | 24 220 ms |
. | SRW . | Fish . | ||||
---|---|---|---|---|---|---|
. | FFT time (N = 512) . | Algorithm time . | Total . | FFT time (N = 2048) . | Algorithm time . | Total . |
NSSH | 221 ms | 1921 ms | 335 ms | 1335 ms | ||
Spectrogram masking | 1700 ms | 5906 ms | 7606 ms | 1000 ms | 7694 ms | 8906 ms |
NSPCC | 30 000 ms | 31 700 ms | 23 220 ms | 24 220 ms |
It might seem strange that the FFT computation time for the fish dataset is shorter than the SRW dataset given that the fish dataset has much longer FFT lengths. However, the fish dataset is shorter in duration and has a lower sampling rate, which means there are less FFTs to calculate, hence the shorter computation time. The longer FFT lengths for the fish dataset have a significant effect on the algorithm run time, as there are more FFT points being processed than for the SRW dataset (hence more floating-point operations), which is why the algorithm run times are longer for the fish dataset.
5. Discussion
When choosing a detector to automatically detect bioacoustic signals, it is important to have a good balance between the accuracy and the speed of the method. A method could perform very well but if it takes longer than manual detection, then it serves no purpose. Conversely, if a method performs very quickly but returns a plethora of false positives while missing many true positives, then its execution speed is irrelevant. For the purposes of detecting harmonic bioacoustic signals, both the NSSH and spectrogram masking methods work well, yielding robust results in noisy conditions within an acceptable timeframe. NSSH works particularly well when the target vocalizations are known to be harmonic and time is of the essence, as it performs the fastest and achieves the highest true positive rate and lowest false positive rate.
Spectrogram masking achieves a precision and recall close to that of NSSH, while performing slower than NSSH. Spectrogram masking can be utilized for more general call types, as it requires only a kernel of the target vocalization. If the target signal has a complex shape that would require the kernel to fit almost completely to ensure detection, then the overlap should be increased to improve detection performance. The masking would then be similar to cross correlation; the only difference is the normalization scheme. Normalizing by only the input signal's total power provides a good boost in detection performance, as there is a small probability that a signal's frequency components will be concentrated in the target vocalization's frequency contour without it being a target vocalization. Spectrogram masking performs faster than spectrogram cross correlation while delivering more true positives and less false positives, making it a suitable alternative for the quick detection of bioacoustic signals.
Both NSSH and spectrogram masking succeeded in the quick and accurate detection of harmonic cetacean and fish vocalizations, achieving 30% more precision than normalized spectrogram cross correlation when applied to the fish dataset and performing 135 times faster when applied to the SRW dataset (not counting the FFT time), which corroborate their use as automatic detection tools that can aid in the detection of marine animal calls in large bioacoustic data sets. This is of particular interest in an age where supervised learning, particularly deep learning, is growing more prevalent and the need for labeled data is becoming ever more significant.
Acknowledgments
This work is based on the research supported in part by the National Research Foundation of South Africa (Grant Numbers: 129416).
AUTHOR DECLARATIONS
Conflict of interest
There are no conflicts of interests to disclose.
Ethics Approval
The University of Stellenbosch has given ethics approval under the title “Passive acoustic recordings of Cetaceans,” protocol number ACU-2022-23512. We also have a research permit from the Department of Forestry, Fisheries and Environment (DFFE), permit number RES2023-47.
DATA AVAILABILITY
The audio data that support the findings of this study are available from the corresponding author upon reasonable request.