African manatees (Trichechus senegalensis) are vulnerable, understudied, and difficult to detect. Areas where African manatees are found were acoustically sampled and deep learning techniques were used to develop the first African manatee vocalization detector. A transfer learning approach was used to develop a convolutional neural network (CNN) using a pretrained CNN (GoogLeNet). The network was highly successful, even when applied to recordings collected from a different location. Vocal detections were more common at night and tended to occur within less than 2 min of one another.
1. Introduction
Sound propagates well in water and is an effective medium for communication utilized by many marine species (Au and Hastings, 2008). Species can use sound for a wide range of purposes including defending territories, finding conspecifics, identifying individuals, and coordinating activities (Garcia and Favaro, 2017; Penar et al., 2020; Tavolga, 1965). Detecting biological sound to learn about species' presence and behavior is called passive acoustic monitoring (PAM) and has been successful in the detection of many marine mammal species (e.g., Marques et al., 2013; Marian et al., 2021; MacIntyre et al., 2013; Rand et al., 2022; Romagosa et al., 2020; Rycyk et al., 2022).
African manatees (Trichechus senegalensis) are an understudied and vulnerable species (Keith Diagne, 2015). The data deficiency stems, in part, from their cryptic behavior that makes them difficult to visually detect (Mayaka et al., 2015; Takoukam Kamla, 2012). African manatees come to the surface only briefly to breathe, rarely break the water’s surface with their activity during the day, and are commonly found in environments with limited water clarity (Takoukam Kamla, 2012). While difficult to see, African manatees produce vocalizations that can be used to acoustically detect their presence (Rycyk et al., 2021).
One of the barriers to the widespread use of passive acoustic monitoring, across species, is the challenge of analyzing large amounts of data. This burden can be sharply curtailed by using deep learning techniques to develop an automated method of detecting sounds of interest, such as vocalizations. In particular, convolutional neural networks (CNNs) have grown in popularity and have been successful in extracting vocalizations of many species from large datasets (e.g., Escobar-Amado et al., 2022; Jiang et al., 2019; Merchan et al., 2020; Rasmussen and Širović, 2021; Ríos et al., 2021; Stowell, 2022; Usman et al., 2020). For example, Allen et al. (2021) trained a CNN to detect song produced by humpback whales (Megaptera novaeangliae) and applied it to more than 187 000 h of acoustic recordings. Training a CNN “from scratch” is time-consuming and requires significant knowledge of neural network architectures. This burden can be reduced by using transfer learning techniques in which a CNN trained on a large dataset is used as a starting point for developing a new CNN (Dufourq et al., 2022; Pan and Yang, 2010; Weiss et al., 2016).
To develop a CNN, a large training dataset is required. Manually finding a large number of vocalizations for training the CNN can be time-consuming. Additionally, building a training dataset from data collected from a single location can result in a similar soundscape across the training set. The resulting CNN may not be successful when applied to a new location with a different soundscape as new sounds may be incorrectly classified since the CNN was not trained on those sounds (Roch et al., 2015). In addition to the challenges of developing a CNN, there are time and financial costs associated with collecting large amounts of acoustic data. Deciding when, where, and for how long to collect acoustic data is particularly challenging for a species in which little is known about its vocal behavior. Descriptions of the timing and variability in vocal detections of a species are helpful for future studies of that species.
We address many of these challenges by (1) collecting acoustic data from multiple locations in areas where there is evidence that African manatees are present, (2) evaluating the effectiveness of a CNN trained to detect African manatee vocalizations, (3) validating the CNN using two datasets, and (4) summarizing temporal patterns in vocal detections.
2. Methods
2.1 Acoustic data collection
We used previously collected recordings from Lake Ossa Cameroon (3.76810° N, 10.02566° E) and newly collected recordings from Lekki Lagoon (6 25.137° N, 4 14.102° E) and Badagry Lagoon (6 26.713° N, 2 50.964° E) in Nigeria. The Lake Ossa samples were from recordings collected in April 2020 with an LS1 underwater acoustic recorder (HTI-96-min hydrophone, sensitivity of −180.2 dB re 1 V/µPa, frequency response 2 Hz to 30 kHz, and a sampling rate of 44.1 kHz, Loggerhead Instruments, Sarasota, FL). For more information about this dataset, see Rycyk et al., 2021. The Nigeria data were collected with a Snap underwater acoustic recorder that collected data continuously (HTI-96-min hydrophone, sensitivity of −180.6 dB re 1 V/µPa, frequency response 2 Hz–30 kHz, and a sampling rate of 44.1 kHz, Loggerhead Instruments). The Lekki Lagoon site was selected based on manatee feeding signs noticed in the environment dominated by the hippo grass (Vossia cuspidata) and the white lotus (Nymphaea lotus). The equipment was deployed between March 23–31, 2022 (8.8 day) on a 40 kg concrete platform in a vertical position approximately 50 cm off the bottom. The environmental conditions at the time of deployment were an air temperature of 33.0 °C, a water temperature of 34.0 °C, water transparency of 1.4 m, salinity of 0.0‰, and water depth of 1.4 m for Lekki Lagoon and air temperature of 34.0 °C, a water temperature of 33.5 °C, water transparency of 1.6 m, salinity of 0.1‰, and water depth of 1.6 m for Badagry Lagoon. The Badagry Lagoon site was selected based on similar environmental conditions to Lekki Lagoon and was deployed between April 20 and April 29, 2022 (10.0 day) on a 20 kg concrete platform also in a vertical position at a depth of 1.6 m.
2.2 Data processing
Recordings were split into 0.5 s clips and a spectrogram was created for each segment. Spectrograms were created by computing short-time Fourier transforms using Kaiser windows with 64 Hz frequency resolution. The default matlab colormap was used. Color mapping was magnitude dependent with the color range extents determined by the power range of the signal. Each spectrogram image was limited to 1–20 kHz and sized to 224 × 224 with red-green-blue color channels using bicubic interpolation.
2.3 Convolutional neural network training
A transfer learning approach was used to train a convolutional neural network for classifying spectrogram images as containing or not containing a manatee vocalization (see workflow in Fig. 1). A pretrained network, GoogLeNet, was used as a starting point in matlab with the Deep Learning Toolbox model (Szegedy et al., 2014; Mathworks, 2022). This network has been trained on more than 1 million images to classify object types and is 22 layers deep. It was fine-tuned for classifying the presence/absence of manatee vocalizations using a stochastic gradient descent algorithm with a learning rate of 0.0001 over 10 epochs with shuffling between each epoch. Predictions ≥ 0.5 were classified as manatee vocalizations.
The training approach described in the previous paragraph was used twice. The first training session was used to extract a larger sample of manatee vocalization samples [example vocalizations in Fig. 2(A)]. For each training session, images were randomly split into two groups, with 70% used for training and the rest for validation. The first training session used an initial dataset that contained a large number of samples from Lake Ossa (5885 vocalization samples, 2490 no vocalization samples), but a small number of samples from Lekki Lagoon (627 vocalization samples, 4022 no vocalization samples). These samples were manually found when visually scanning spectrograms of a portion of the recordings. GoogLeNet was trained using the preliminary set of samples and is called the preliminary CNN. This fine-tuned CNN was applied to the full Lekki Lagoon dataset of recordings. Newly detected vocalizations were combined with the initial vocalization sample set and false positives were combined with the initial no vocalization sample set to bolster sample size. The training approach described above was repeated using this larger dataset (8613 vocalization samples, 8613 no vocalization samples) to create the final CNN (see supplementary material for the MATLAB file of the final CNN).1
2.4 Validation
The final CNN was applied to the full Lekki Lagoon and Badagry Lagoon sets of recordings. The final CNN was trained on samples from Lekki Lagoon, therefore the final CNN should perform well on this dataset. The final CNN was not trained on samples from Badagry Lagoon, therefore the performance of the final CNN on this dataset determines whether the CNN was overfitted and if it is generalizable to other sites. After the final CNN was applied to each full dataset (Lekki Lagoon and Badagry Lagoon), 10% of the classifications from each location were manually validated. For validation, spectrogram clips classified as containing a manatee vocalization were determined to either be correctly classified as containing a manatee vocalization or not. Spectrogram clips classified as not containing a manatee vocalization were determined to either be correctly classified as not containing a manatee vocalization or not. The validation set of clips was randomly selected with 10% of clips from each day included ensuring the validation set evenly represented each recording day. From the manual validation, true detection, misses, and false alarm rates were calculated from each location (Lekki and Badagry).
2.5 Analyses
The classifications from the application of the final CNN to the full Lekki Lagoon and Badagry Lagoon datasets were evaluated for differences in the occurrence of vocalizations between sites (Lekki and Badagry) and diurnal patterns. To compare the acoustic activity between sites and by the time of day, we calculated the mean ± SE (standard error) number of vocalizations pooled per hour and day. We evaluated how close together in time vocalization detections occurred to inform selecting recording durations for future studies. For this, the number of seconds between each vocalization detection was calculated for the full Lekki Lagoon and Badagry Lagoon datasets. Intervals less than 1 s were excluded as small intervals can indicate vocalizations that were detected in neighboring 0.5 s clips. Cumulative density frequency curves for vocalization detection intervals were created for each location to visualize how different interval thresholds affect the portion of vocalizations captured. The cumulative density frequency curves can be used to guide choices concerning duty cycle and sampling intervals in future African manatee PAM research. Additionally, the 90th percentile interval values were calculated for each location to identify an interval in which the strong majority of neighboring vocalizations occur. The 90th percentile was chosen as a guide, but a different threshold may be more appropriate when designing future sampling schemes based on resources, site accessibility, and research question. All analyses were performed in Matlab (MATLAB, 2021).
3. Results
3.1 Convolutional neural network performance
The final CNN was able to classify spectrograms with African manatee vocalizations with high accuracy (see validation in Fig. 1). When tested on the Lekki Lagoon location, where part of the training data originated, the CNN accurately captured 95.9% of the samples with an African manatee vocalization. It missed only 0.4% of vocalization samples and had a false alarm rate of 0.0%. When tested on a new location, Badagry Lagoon, the CNN still had high accuracy, but less than was found for the Lekki Lagoon dataset. The final CNN captured 90.2% of samples with an African manatee vocalization in the Badagry Lagoon dataset. It missed 2.3% of samples with vocalizations and had a false alarm rate of 0.1%.
3.2 Vocalization patterns
3.3 Interval between vocalization detections
The cumulative density frequency curve of duration between vocalization detections was similar for the Lekki Lagoon and Badagry Lagoon locations (Fig. 4). The 90th percentile duration between vocalization detections (excluding < 1 s) at Lekki Lagoon was 112.6 s. The 90th percentile duration between vocalization detections (excluding < 1 s) at Badagry Lagoon was 107.7 s.
4. Discussion
Our study is the first to train a CNN to detect the presence of African manatee vocalizations in acoustic recordings. The resulting CNN was highly successful at detecting African manatee vocalizations as evidenced by the high true detection rates, the low number of misses, and the low false alarm rates (Fig. 1). Unsurprisingly, the final CNN performed better on the dataset from the location, Lekki Lagoon, where some of the training data originated. Overfitting a CNN can lead to a false sense of model effectiveness. This occurs when the model produced is too complex and specific to the training data. Overfitting can be evaluated by comparing model performance when applied to training data and to a set of new data. We tested whether the CNN was overfitted by applying it to a new location, Badagry Lagoon, and found that the CNN was still highly successful. If the CNN had been overfitted, it would have performed much worse on the Badagry Lagoon dataset. Our results suggest that when analyzing recordings from new sites, training data from another site can serve as a good foundation as long as the acoustic recorder sampled up to 20 kHz. However, it is still important to incorporate samples from the new location into the training data to account for differences in soundscape to boost performance. This would be especially important for soundscapes that are very different. Our two sites were relatively similar as they were both generally quiet locations. However, there were still differences between the soundscapes. The largest differences between the sites were the increased presence of insect choruses that lasted for hours a day at Lekki Lagoon and more bird vocalizations at Badagry Lagoon.
Building training datasets is commonly one of the most time-consuming hurdles to developing deep learning methods for extracting sounds of interest from acoustic recordings. We demonstrate that this burden can be lessened by starting with samples from another location and running a preliminary CNN to bolster the sample size (Yang et al., 2020). Adding samples from more locations will increase the representation of manatee vocalizations from more populations and incorporate samples from a wider variety of soundscapes; both will result in a more robust CNN that has larger applicability. Another advantage is that automated detection algorithms can reduce possible human error in manual auditing and support replicability. Developing and using automated detection algorithms, as the CNN developed here, is crucial for analyzing the large datasets necessary to explore larger temporal and spatial scales of African manatee abundance, distribution, and habitat use patterns.
The development of a CNN to extract African manatee vocalizations from recordings greatly decreases the amount of time required to analyze acoustic recordings, but there are still resource limitations to data collection. Acquiring acoustic recording devices, deploying/retrieving them, storing large amounts of data, and processing large amounts of data have time and financial limitations. Therefore, it is important to select the lowest duty cycle and temporal resolution necessary to answer a given research question. These considerations are difficult to evaluate without knowing how often vocal detections are likely to occur and how much it varies over time for your target species. We provide an analysis of the time between vocalization detections, temporal patterns, and a comparison between two locations to help inform these decisions.
On average, there was a more than fourfold higher number of vocalization detections at the Badagry location compared to Lekki (Fig. 2). These sites are 153.84 km apart but in similar environments. The recordings were collected approximately a month apart and the large difference in detections could stem from temporal and/or geographic differences. Additionally, there was high variance in the number of vocalization detections per day for both sites which suggests only sampling a small number of days could lead to missing manatee presence in an area. Both sites exhibited far more vocalization detections at night compared to during the day (Fig. 3). This pattern agrees with African manatee vocalization detection patterns from a stationary recorder in Lake Ossa, Cameroon (Rycyk et al., 2021). All three datasets were recorded from stationary recorders, so it cannot be ruled out that manatees were not near the recorders during the day. However, finding a similar pattern at three locations is building evidence that African manatees are more vocally active at night. A possible reason for the diel behavior is reduced disturbance from human activity at night (Keith Diagne, 2015; Takoukam Kamla, 2012). This diel pattern in vocalization detection suggests it is crucial to include night sampling when acoustically monitoring African manatees. Vocalization detections tended to occur close together in time with the majority of detections occurring within less than 2 min of one another (Fig. 4). Altogether, our findings suggest that passive acoustic monitoring of African manatees should include multiple locations, multiple days in a row, and nighttime.
Passive acoustic monitoring of African manatees can help us understand their distribution and habitat preferences. Vocalization detections may be used in the future to acoustically estimate abundance (Rycyk et al., 2022). Here, we only consider vocalizations, but similar methods can be used to develop a CNN to detect feeding sounds produced by African manatees. Acoustic detection of feeding sounds has been used to acoustically monitor feeding behavior in Amazonian (Trichechus inunguis) and West Indian manatees (Trichechus manatus) (Kikuchi et al., 2014). Combining the detection of both vocalizations and feeding sounds can increase the probability of acoustically detecting African manatees and provide information about how an area is being used by the manatees.
Acknowledgments
Data were collected in accordance with Institutional Animal Care and Use Committee Protocol No. IS00007646 from the University of South Florida. We thank the Save the Manatee Club for the manatee habitat monitoring grant awarded to D.A.B.
See supplementary material at https://www.scitation.org/doi/suppl/10.1121/10.0016543 for the MATLAB file of the final CNN.