There is a growing interest in the ability to detect and classify animal vocalizations in large scale bioacoustic databases for the purposes of conservation and research. To aid in this, two methods are proposed for the quick and accurate detection of harmonic cetacean and fish vocalizations: Normalized summation of sound harmonics and spectrogram masking. These methods utilize a normalization scheme that enables robust performance, achieving 30% more precision and recall than traditional spectrogram cross correlation in the presence of wideband noise and low signal-to-noise ratios. The proposed methods also perform up to 135 times faster than spectrogram cross correlation.

The field of bioacoustics has enabled a closer inspection of the acoustic behavior of a diverse range of animal species, while also providing insight on the effect of anthropogenic sound on the animal kingdom.1,2 Passive acoustic monitoring (PAM) has made it possible to collect large volumes of sound data from the target species' acoustic environment in a non-invasive manner, aiding in the conservation of said species. PAM is particularly suited for underwater environments, where the low transparency of water obscures the visual monitoring of marine animals. As researchers attempt to analyze the passive acoustic recordings, they will often require a method to detect and classify the different animal calls present. This is a necessity, as there usually exists weeks or months of recording data, containing significant amounts of noise, ambient or anthropogenic. The annotation of passive acoustic data sets is subsequently a very time-consuming endeavour. The detection and classification of cetaceans3 (and to a lesser extent, fish4,5) has garnered much attention, as the underwater acoustic environment poses many challenges to traditional detection methods, methods such as voice activity detection, and match filtering. The variance of ambient noise in the ocean is quite large, causing the overall signal-to-noise ratio (SNR) of target vocalizations to decrease, resulting in many missed detections. However, shipping noise, seismic activity, airgun surveys,6 and other transient noise sources cause a plethora of false detections when analyzed by the aforementioned detection methods. The shortage of labeled bioacoustic data sets5,7 also poses a problem, as too little training data causes supervised detection methods, such as neural networks, to overfit and attain poor true detection rates.

Jiang et al.8 used a LeNet5 convolutional neural network (CNN) to detect and then classify killer whale and long-finned pilot whale whistles, achieving a high percentage of true detection and true classification rates with less than two hours of training data. Allen et al.9 proposed the use of a ResNet-50 model to detect humpback whale songs and obtained high precision, although they required over 200 h of training data to yield these results. Riera et al.10 detected fish sounds by first segmenting acoustic events in a certain frequency band and then classifying the fish sounds using a random forest classifier, although their method garnered many false positives. Harakawa et al.5 proposed the use of a hybrid supervised learning method to detect the sounds of fish from the Sciaenidae family (commonly called drums or croakers), achieving good results even when the amount of labeled training data was small. These machine learning methods all have their respective merits in the detection and classification of cetacean vocalizations, however, the need for a suitable number of training samples is a problem for researchers with large quantities of unlabeled data. To address this problem, a fast and reliable detection approach is required, one that does not require a large amount of training data.

Many cetacean and fish calls exhibit not only a fundamental frequency, but also overtones (harmonics). In the past, the harmonic structure of human speech has been exploited for numerous purposes, such as voice activation detection11 and pitch estimation.12,13 The harmonic structure of many cetacean and fish calls can also be used similarly, although very little research has been done in this regard. In their comparison of different tonal detectors, Bouffaut et al.7 utilised, among others, the harmonic product spectrum to test its effectiveness in the detection of Antarctic blue whale calls; however, Antarctic blue whales do not always produce harmonic sounds, therefore, this method was not well suited for their study. Shapiro et al.14 proposed a method to track the pitch of both human speech and killer whale vocalizations using their discrete logarithmic Fourier transformation-pitch detection algorithm (DLFT-PDA), producing reliable results in both high and low SNR situations. Their work demonstrated the value of applying signal processing methods originally developed for human speech to underwater environments.

In this paper, two approaches to the detection of cetacean and fish sounds are proposed, aimed at providing fast, accurate, and robust results without the need for copious amounts of training data. The methods are based on the summation of sound harmonics and spectrogram masking, with a normalization scheme that ensures acceptable detection performance in noisy environments.

A method developed by Hermes12 for pitch tracking, called subharmonic-summation, is adapted here for the detection of cetacean and fish vocalizations. The method first computes the power spectral density (PSD) S x x [ k ] of an input signal frame x [ n ], after which the power components sitting at integer multiples of the fundamental frequency are summed to produce a harmonic sum spectrum ϒ [ k ], expressed algebraically as
ϒ [ k ] = h = 1 H S x x [ h · k ] ,
(1)
where H is the number of harmonics to consider, h the index for the harmonics, k the index for discrete frequency components, and n the sample index. The harmonic sum spectrum spans the fundamental frequencies between f min and f max.
This spectrum is then normalized by the average power of the original signal (Px),
ϒ norm [ k ] = ϒ [ k ] P x .
(2)

This normalization ensures that if most of the signal's power lies in the harmonic bands, then its value for ϒ norm [ k ] will be large, as the value for ϒ [ k ] will be very similar to that of Px, whereas if the signal's power is not present in the harmonic bands or if the signal is wideband, then the value for ϒ [ k ] will be much smaller than that of Px, resulting in a smaller ϒ norm [ k ]. If the harmonic sum spectrum is normalized by the total power instead of the average power (i.e., N P x, where N is the number of samples in the PSD), it will yield the percentage of signal power contained in the harmonic bands. Note that if the sum is normalized by the power of the noise only (i.e., N P x ϒ [ k ]), the resulting quantity would be the SNR. The maximum value for ϒ norm [ k ] is considered and if this value exceeds a certain threshold the signal is classified as a target vocalization. To minimize the impact of disturbances outside the desired frequency band (i.e., for f < f min and f > H f max), the signal should be band limited, otherwise ϒ norm [ k ] will be too low, and false negatives can occur. This detection method will be referred to as the NSSH. This method should perform well in detecting harmonic vocalizations, even if the vocalizations differ in duration, as the method only utilizes the power density spectrum of a signal frame.

If the target vocalization contains frequency characteristics that change over its duration, then the NSSH method would have to consider smaller frame lengths for the signal and the harmonic spectrum would span a larger fundamental frequency interval. This could make the method quite costly and decrease performance. To address this, the method in Sec. 2.1 can be extended from the two-dimensional power density spectrum to the three-dimensional spectrogram.

Let us consider S [ l , k ] a spectrogram image of our input signal and let us consider K [ l , k ] a spectrogram kernel that will mask our original image, where l is the index of the PSD vector and k is the index for the discrete frequency components. The kernel is constructed to resemble our target vocalization by taking a few spectrograms of the vocalization and averaging them. A threshold can then be used to create a binary mask.

The binary kernel can then be used to mask the input spectrogram S through an element-wise multiplication. With this, only the frequency components present in K are retained in the resulting spectrogram M.

The frequency components in M are summed and the result is normalized by the total power in the input spectrogram S ( P S),
m = l = 0 L 1 k = 0 N 1 S [ l , k ] K [ l , k ] P S ,
(3)
where N is the number of samples in the PSD (half the length of the Fourier transform) and L is the number of PSDs in a spectrogram.

The quantity m is a ratio that compares the power contained in the selected frequency bands (per the masking kernel) to the total power in the spectrogram. This method utilizes NL floating point multiplications. This is faster than a full spectrogram cross correlation,15 which utilizes N ( 2 L 1 ) floating point multiplications to derive the cross correlation at each time-delay. The ratio m can be compared to a threshold to determine if the spectrogram contains a target vocalization or if it contains only noise. For this normalization to work optimally, the power components outside the duration and frequency band of the target calls need to be minimized. This can be done by band-limiting the signals to retain only the frequency components in the target vocalizations, and ensuring the duration of the spectrograms is only as long as the duration of an average target call.

This section describes the bioacoustic recordings and the use thereof in detection experiments. The performance of the NSSH and spectrogram masking methods will be compared to two other popular detection methods to gauge their effectiveness at discerning bioacoustic signals from stationary and wideband noise.

Two sets of audio data were used for testing, both captured by hydrophones on the coast of False Bay, South Africa. The first set contained approximately 190 southern right whale (SRW) vocalizations and was one hour long. The SRW calls were captured at 96 KHz with a drifting buoy nine meters under the surface of the ocean. The drifting buoy was deployed during summer for two to four hours before being retrieved. The southern right whale data were downsampled to 8000 Hz and band limited to 100–700 Hz. The SRW vocalizations had a fundamental frequency between 100 f 0 150 Hz and displayed four harmonics at most. Most of the calls displayed a good SNR (between 15 and 20 dB). The calls had durations that ranged between 250 and 700 ms.

The second set contained approximately 340 vocalizations from an unknown fish species and was 30 min long. The fish calls were assumed to be from the same species, possibly a type of croaker. The fish dataset was recorded at 8 KHz using an Audiomoth,16 deployed 28 meters under the ocean. The Audiomoth was deployed for four to six weeks during winter and was recording continuously until it was retrieved. The fish data were downsampled to 2000 Hz and band limited to 100–500 Hz. The fish vocalizations had a fundamental frequency between 30 f 0 40 Hz and had eight to nine visible harmonics (the harmonics below 150 Hz had a very low SNR, disappearing completely into the background noise). The fish recordings contain calls that have SNRs between 2 and 20 dB. The recordings also contain plenty of wideband disturbances, mostly due to noise from the anchor that moored the hydrophone. The fish calls had durations between 200–700 ms. Figure 1(a) contains example calls from the first data set and Fig. 1(b) contains example calls from the second data set.

Fig. 1.

Spectrogram image of (a) southern right whale calls and (b) fish calls.

Fig. 1.

Spectrogram image of (a) southern right whale calls and (b) fish calls.

Close modal

The vocalizations in both data sets were subject to detection by four methods, namely, NSSH, spectrogram masking, normalized spectrogram cross correlation (NSPCC), and RavenPro's band limited energy detector (BLED)17 (The Cornell Lab of Ornithology, Ithaca, NY).

The band limited energy detector takes an input recording and calculates the energy within a certain frequency bandwidth and then compares this energy to a detection threshold to determine if it contains a signal of interest. It succeeds in quickly detecting potential vocalizations in the presence of weak background noise and is often used as a detection benchmark.18–20 

Normalized spectrogram cross correlation calculates the correlation C [ τ ] between a spectrogram kernel K and an input signal spectrogram S for a time-delay (or sample-delay) τ. It is normalized by the root mean square value of each spectrogram's power, i.e.,
C [ τ ] = l = k = 0 N 1 S [ τ l , k ] K [ l , k ] P S P K ,
(4)
where l is the index of the PSD vector, k is the discrete frequency components, P K is the power of the kernel spectrogram, and P S is the power of the input spectrogram. Spectrogram cross correlation sees widespread use in the field of bioacoustics,21–23 as it does not require training data, only a priori knowledge of the target vocalization's time-frequency characteristics. This method is known to struggle when vocalizations vary in frequency,22 as the target signal would then not match the kernel.

For the NSSH method, frame lengths of 125 ms were taken to calculate the power density spectra for the SRW data set, using 512-point fast Fourier transforms (FFTs). For the fish data set, frame lengths of 200 ms and zero-padded 2048-point FFTs were used. The reason for the longer FFT length is that the fish harmonics are spaced much closer to each other than the SRW harmonics, requiring a higher frequency resolution. The power density spectra were calculated by time-averaging overlapping FFT frames, according to Welch's method.24 For both data sets, a frame overlap of 50% was used.

Signal frames of 687.5 ms and 700 ms were used for the SRW and the fish spectrograms, respectively. The reason for the different spectrogram durations is due to rounding from discretization. For spectrogram masking, a spectrogram overlap of 50% was used, while for NSPCC no overlap was used, as the correlation calculation ensured full coverage of the entire spectrogram.

The BLED was implemented using RavenPro software, whilst the other detection algorithms were implemented in c++.

To evaluate the performance of the detection methods, the number of correct detections, called true positives (TP), are compared to the number of missed detections and false detections, called false negatives (FN) and false positives (FP), respectively. Precision and recall metrics combine these three quantities, with precision defined as
Precision = TP TP + FP ,
(5)
and recall defined as
Recall = TP TP + FN .
(6)
These measures work particularly well when faced with imbalanced data sets, where the negative class represents the majority of the data set. The harmonic mean of precision and recall is expressed through the F-score,
F = 2 · Precision · Recall Precision + Recall .
(7)
The optimal operating point of a detector is where the f-score is highest. A detector with good performance will have a precision-recall curve that has a large area-under-curve, having points on the curve where both precision and recall are high (the highest value for both is 1.0, on the top-right corner of the curve).

To obtain the number of true positives, false positives, and false negatives from the detected segments, the ground truth labels (manual annotations) were discretized to contain the same number of samples as the automatic detections. After this, the ground truth labels and automatic detections were automatically compared to determine the number of true positives, false positives, and false negatives.

The detection results for the SRW data set and the fish data set are captured in Figs. 2(a) and 2(b), respectively. It is clear that the detection algorithms performed well when applied to calls from the southern right whale, obtaining F-scores above 90%. The NSSH, NSPCC, and spectrogram masking methods all performed similarly, while the BLED performed slightly worse.

Fig. 2.

Precision-recall curve of (a) southern right whale calls and (b) fish calls.

Fig. 2.

Precision-recall curve of (a) southern right whale calls and (b) fish calls.

Close modal

In Fig. 2(b), it is quite apparent that the methods had difficulty detecting the vocalizations present in the fish recording, largely due to the large amounts of wideband noise and the variable duration of the fish sounds. The NSSH and spectrogram masking methods produced the best results, reaching 80% precision and recall at the optimal operating point, whereas the NPSCC struggled to break 50% precision. The BLED fared even worse, struggling to reach 40% precision. This means that NSSH and spectrogram masking yield half the false positive rate of the BLED for the same true positive rate.

The timing results are displayed in Table 1. The NSSH method performed the fastest, with an execution speed 1348–1874 times faster than real time (this is computed by taking the sound file duration and dividing it by the algorithm execution time), while spectrogram cross correlation performed the slowest at 74.3–113 times faster than real time. In the case of NSSH, the FFT calculations take the longest to compute, while for spectrogram masking and spectrogram cross correlation, the algorithms take longer to compute. Both the algorithm times and FFT times are dependent on the length of the signal, the parameters that could otherwise affect execution times are frame lengths, overlap percentage, and FFT length. For example, if a spectrogram overlap of 75% was used for spectrogram masking, then the algorithm would take twice as long. The execution time of the BLED could not be estimated as accurately as the other methods, given that it is proprietary software and does not have any method of accurately measuring its execution time. The execution time (for both soundfiles) is estimated to be about two seconds.

Table 1.

Execution times for NSSH, spectrogram masking, and normalized spectrogram cross correlation. The time to compute the FFT time and execute the different algorithms (Algorithm time) on a one hour southern right whale recording (sampled at 8000 Hz) and a 30 min fish recording (sampled at 2000 Hz) is shown. The total execution time for each method is given as well.

SRW Fish
FFT time (N = 512) Algorithm time Total FFT time (N = 2048) Algorithm time Total
NSSH    221 ms  1921 ms    335 ms  1335 ms 
Spectrogram masking  1700 ms  5906 ms  7606 ms  1000 ms  7694 ms  8906 ms 
NSPCC    30 000 ms  31 700 ms    23 220 ms  24 220 ms 
SRW Fish
FFT time (N = 512) Algorithm time Total FFT time (N = 2048) Algorithm time Total
NSSH    221 ms  1921 ms    335 ms  1335 ms 
Spectrogram masking  1700 ms  5906 ms  7606 ms  1000 ms  7694 ms  8906 ms 
NSPCC    30 000 ms  31 700 ms    23 220 ms  24 220 ms 

It might seem strange that the FFT computation time for the fish dataset is shorter than the SRW dataset given that the fish dataset has much longer FFT lengths. However, the fish dataset is shorter in duration and has a lower sampling rate, which means there are less FFTs to calculate, hence the shorter computation time. The longer FFT lengths for the fish dataset have a significant effect on the algorithm run time, as there are more FFT points being processed than for the SRW dataset (hence more floating-point operations), which is why the algorithm run times are longer for the fish dataset.

When choosing a detector to automatically detect bioacoustic signals, it is important to have a good balance between the accuracy and the speed of the method. A method could perform very well but if it takes longer than manual detection, then it serves no purpose. Conversely, if a method performs very quickly but returns a plethora of false positives while missing many true positives, then its execution speed is irrelevant. For the purposes of detecting harmonic bioacoustic signals, both the NSSH and spectrogram masking methods work well, yielding robust results in noisy conditions within an acceptable timeframe. NSSH works particularly well when the target vocalizations are known to be harmonic and time is of the essence, as it performs the fastest and achieves the highest true positive rate and lowest false positive rate.

Spectrogram masking achieves a precision and recall close to that of NSSH, while performing slower than NSSH. Spectrogram masking can be utilized for more general call types, as it requires only a kernel of the target vocalization. If the target signal has a complex shape that would require the kernel to fit almost completely to ensure detection, then the overlap should be increased to improve detection performance. The masking would then be similar to cross correlation; the only difference is the normalization scheme. Normalizing by only the input signal's total power provides a good boost in detection performance, as there is a small probability that a signal's frequency components will be concentrated in the target vocalization's frequency contour without it being a target vocalization. Spectrogram masking performs faster than spectrogram cross correlation while delivering more true positives and less false positives, making it a suitable alternative for the quick detection of bioacoustic signals.

Both NSSH and spectrogram masking succeeded in the quick and accurate detection of harmonic cetacean and fish vocalizations, achieving 30% more precision than normalized spectrogram cross correlation when applied to the fish dataset and performing 135 times faster when applied to the SRW dataset (not counting the FFT time), which corroborate their use as automatic detection tools that can aid in the detection of marine animal calls in large bioacoustic data sets. This is of particular interest in an age where supervised learning, particularly deep learning, is growing more prevalent and the need for labeled data is becoming ever more significant.

This work is based on the research supported in part by the National Research Foundation of South Africa (Grant Numbers: 129416).

There are no conflicts of interests to disclose.

The University of Stellenbosch has given ethics approval under the title “Passive acoustic recordings of Cetaceans,” protocol number ACU-2022-23512. We also have a research permit from the Department of Forestry, Fisheries and Environment (DFFE), permit number RES2023-47.

The audio data that support the findings of this study are available from the corresponding author upon reasonable request.

1.
A. N.
Popper
and
A. D.
Hawkins
, “
An overview of fish bioacoustics and the impacts of anthropogenic sounds on fishes
,”
J. Fish Biol.
94
(
5
),
692
713
(
2019
).
2.
P.
Laiolo
, “
The emerging significance of bioacoustics in animal species conservation
,”
Biol. Conserv.
143
,
1635
1645
(
2010
).
3.
A. M.
Usman
,
O. O.
Ogundile
, and
D. J. J.
Versfeld
, “
Review of automatic detection and classification techniques for cetacean vocalization
,”
IEEE Access
8
,
105181
105206
(
2020
).
4.
M.
Malfante
,
M.
Dalla Mura
,
J.
Mars
, and
C.
Gervaise
, “
Automatic fish sounds classification
,”
J. Acoust. Soc. Am.
143
(
5
),
2834
2846
(
2018
).
5.
R.
Harakawa
,
T.
Ogawa
,
M.
Haseyama
, and
T.
Akamatsu
, “
Automatic detection of fish sounds based on multi-stage classification including logistic regression via adaptive feature weighting
,”
J. Acoust. Soc. Am.
144
(
5
),
2709
2718
(
2018
).
6.
R.
Williams
,
A.
Wright
,
E.
Ashe
,
L.
Blight
,
R.
Bruintjes
,
R.
Canessa
,
C.
Clark
,
S.
Cullis-Suzuki
,
D.
Dakin
,
C.
Erbe
,
P.
Hammond
,
N.
Merchant
,
P.
O'Hara
,
J.
Purser
,
A.
Radford
,
S.
Simpson
,
L.
Thomas
, and
M.
Wale
, “
Impacts of anthropogenic noise on marine life: Publication patterns, new discoveries, and future directions in research and management
,”
Ocean Coastal Manage.
115
,
17
24
(
2015
).
7.
L.
Bouffaut
,
S.
Madhusudhana
,
V.
Labat
,
A.-O.
Boudraa
, and
H.
Klinck
, “
A performance comparison of tonal detectors for low-frequency vocalizations of Antarctic blue whales
,”
J. Acoust. Soc. Am.
147
,
260
266
(
2020
).
8.
J.-J.
Jiang
,
L.-R.
Bu
,
F.-J.
Duan
,
X.-Q.
Wang
,
W.
Liu
,
Z.-B.
Sun
, and
C.-Y.
Li
, “
Whistle detection and classification for whales based on convolutional neural networks
,”
Appl. Acoust.
150
,
169
178
(
2019
).
9.
A. N.
Allen
,
M.
Harvey
,
L.
Harrell
,
A.
Jansen
,
K. P.
Merkens
,
C. C.
Wall
,
J.
Cattiau
, and
E. M.
Oleson
, “
A convolutional neural network for automated detection of humpback whale song in a diverse, long-term passive acoustic dataset
,”
Front. Mar. Sci.
8
,
607321
(
2021
).
10.
A.
Riera
,
R.
Rountree
,
X.
Mouy
,
J.
Ford
, and
F.
Juanes
, “
Effects of anthropogenic noise on fishes at the SGaan Kinghlas-Bowie Seamount Marine Protected Area
,”
Proc. Mtgs. Acoust.
27
,
010005
(
2016
).
11.
L. N.
Tan
,
B. J.
Borgstrom
, and
A.
Alwan
, “
Voice activity detection using harmonic frequency components in likelihood ratio test
,” in
Proceedings of the 2010 IEEE International Conference on Acoustics, Speech and Signal Processing
,
Dallas, TX
(
March 14–19
,
2010
), pp.
4466
4469
.
12.
D.
Hermes
, “
Measurement of pitch by subharmonic summation
,”
J. Acoust. Soc. Am.
83
,
257
264
(
1988
).
13.
T.
Drugman
and
A.
Alwan
, “
Joint robust voicing detection and pitch estimation based on residual harmonics
,” in
Proceedings of Interspeech 2011
,
Florence, Italy
(
August 27–31
,
2011
), pp.
1973
1976
.
14.
A. D.
Shapiro
and
C.
Wang
, “
A versatile pitch tracking algorithm: From human speech to killer whale vocalizations
,”
J. Acoust. Soc. Am.
126
(
1
),
451
459
(
2009
).
15.
D. K.
Mellinger
and
C. W.
Clark
, “
Recognizing transient low-frequency whale sounds by spectrogram correlation
,”
J. Acoust. Soc. Am.
107
(
6
),
3518
3529
(
2000
).
16.
A. P.
Hill
,
P.
Prince
,
J. L.
Snaddon
,
C. P.
Doncaster
, and
A.
Rogers
, “
Audiomoth: A low-cost acoustic device for monitoring biodiversity and the environment
,”
HardwareX
6
,
e00073
(
2019
).
17.
H.
Mills
, “
Geographically distributed acoustical monitoring of migrating birds
,”
J. Acoust. Soc. Am.
108
(
5
),
2582
(
2000
).
18.
K. G.
Horton
,
W. G.
Shriver
, and
J. J.
Buler
, “
A comparison of traffic estimates of nocturnal flying animals using radar, thermal imaging, and acoustic recording
,”
Ecol. Appl.
25
(
2
),
390
401
(
2015
).
19.
M.
Rademan
,
D.
Versfeld
, and
J.
du Preez
, “
Soft-output signal detection for cetacean vocalizations using spectral entropy, k-means clustering and the continuous wavelet transform
,”
Ecol. Inf.
74
,
101990
(
2023
).
20.
C.
Erbe
and
A. R.
King
, “
Automatic detection of marine mammals using information entropy
,”
J. Acoust. Soc. Am.
124
(
5
),
2833
2840
(
2008
).
21.
S.
Sawant
,
C.
Arvind
,
V.
Joshi
, and
V. V.
Robin
, “
Spectrogram cross-correlation can be used to measure the complexity of bird vocalizations
,”
Methods Ecol. Evol.
13
(
2
),
459
472
(
2022
).
22.
A.
Širović
, “
Variability in the performance of the spectrogram correlation detector for North-east Pacific blue whale calls
,”
Bioacoustics
25
(
2
),
145
160
(
2016
).
23.
Y.
Lu
and
D.
Mellinger
, “
Whistle classification by spectrogram correlation
,”
J. Acoust. Soc. Am.
134
(
5
),
3987
(
2013
).
24.
P.
Welch
, “
The use of fast Fourier transform for the estimation of power spectra: A method based on time averaging over short, modified periodograms
,”
IEEE Trans. Audio Electroacoust.
15
(
2
),
70
73
(
1967
).