Although many fish are soniferous, few of their sounds have been identified, making passive acoustic monitoring (PAM) ineffective. To start addressing this issue, a portable 6-hydrophone array combined with a video camera was assembled to catalog fish sounds in the wild. Sounds are detected automatically in the acoustic recordings and localized in three dimensions using time-difference of arrivals and linearized inversion. Localizations are then combined with the video to identify the species producing the sounds. Uncertainty analyses show that fish are localized near the array with uncertainties < 50 cm. The proposed system was deployed off Cape Cod, MA and used to identify sounds produced by tautog (Tautoga onitis), demonstrating that the methodology can be used to build up a catalog of fish sounds that could be used for PAM and fisheries management.
1. Introduction
Passive acoustic monitoring (PAM) of fish (i.e., monitoring fish in the wild by listening to the sound they produce) is a research field of growing interest and importance (Rountree et al., 2006). The types of sounds fish produce vary among species and regions but consist typically of low frequency (<1 kHz) pulses and amplitude-modulated grunts or croaks lasting from a few hundreds of milliseconds to several seconds (Kasumyan, 2008). As is the case for marine mammal vocalizations, fish sounds can typically be associated with specific species and behaviors (Kasumyan, 2008). Consequently, temporal and spectral characteristics of these sounds in underwater recordings could identify, non-intrusively, which species are present in a particular habitat, deduce their behavior, and thus characterize critical habitats. Unfortunately, many fish sounds have not been identified which reduces the usefulness of PAM. Many studies carried out in laboratory settings attempt to catalog fish sounds (e.g., Širović and Demer, 2009; Hawkins and Amorim, 2000). However, behavior-related sounds produced in natural habitats are often difficult or impossible to induce in captivity (e.g., spawning or interaction with conspecifics; Rountree et al., 2006). Consequently, there is a need to record and identify fish sounds in their natural habitat. Because there is no control over biological and environmental variables (e.g., number of fish vocalizing), in situ measurements are challenging and require accurate localization of the soniferous fish, both acoustically and visually (Rountree, 2008). Although numerous methods have been developed for the large-scale localization of marine mammals based on their vocalizations (see review in Zimmer, 2011), only a handful of studies have been published to date on the fine-scale localization of individual fish (Parsons et al., 2009; Parsons et al., 2010; Locascio and Mann, 2011). To our knowledge, no studies combining underwater acoustic localization and video recording to catalog fish sounds have been published. This letter develops and demonstrates the use of a compact hydrophone and video-camera array designed to record fish sounds, localize the source (acoustically), and identify the species (visually).
2. Methods
2.1 Array and data collection
The acoustic components of the array developed here consist of six Cetacean Research C55 hydrophones, denoted H1–H6, placed on each vertex of an octahedron constructed from a foldable aluminum frame, as shown in Fig. 1. Each hydrophone is located approximately 1 m from the center of the octahedron (considered the origin of the array coordinate system) and is connected by cable to a TASCAM DR-680mkII multi-track recorder (TEAC Corporation, Japan) to collect data continuously at a sampling frequency of 48 kHz with a quantization of 16 bits. A downward-facing AquaVu II fishcam (Crosslake, MN) underwater video camera is attached to the central pole of the frame below the top hydrophone and records continuously on a Sony portable DVD recorder (model VRD MC6) during the acoustic recordings. A floating light is deployed on top of the frame to improve visibility for the video recordings.
(Color online) Array configuration. (a) Hydrophones (black dots) with the downward-looking camera (cylinder). (b) System on the dock before deployment. (c) System once deployed. H1–H6 indicate the locations of the hydrophones.
(Color online) Array configuration. (a) Hydrophones (black dots) with the downward-looking camera (cylinder). (b) System on the dock before deployment. (c) System once deployed. H1–H6 indicate the locations of the hydrophones.
Underwater acoustic and video data were collected with this array during the night of 18 October, 2010, off the Cotuit town dock in Cape Cod, MA (41° 36.969′ N, 70° 26.000′ W). The array was deployed on the sea bottom off the dock in 3 m of water, while both the video and acoustic recorders stayed on the dock. A chum can was also deployed on the sea bottom to attract fish. A total of 7.5 h of continuous acoustic and video data were collected. All data collected were processed after array recovery.
2.2 Automated detection of acoustic events
Acoustic events (transient signals) were detected automatically in recordings from hydrophone H2. First, the spectrogram of the recordings was calculated (4096-sample Blackman window zero-padded to 8192 samples for FFT, with a time step of 480 samples or 10 ms) and normalized from 5 to 2000 Hz using a split-window normalizer to increase the signal to noise ratio of acoustic events in the frequency band of typical fish sounds (Struzinski and Lowe, 1984, 4-s window, 0.5-s notch). Second, the spectrogram was segmented by calculating the local energy variance on a two-dimensional kernel of size 0.01 s by 50 Hz. Events were defined in time and frequency by connecting the adjacent bins of the spectrogram with a local normalized energy variance of 0.5 or higher using the Moore neighborhood algorithm (Moore, 1968). All acoustic events with a frequency bandwidth less than 100 Hz or with a duration less than 0.02 s were discarded. All detection parameters were empirically defined to capture acoustic events whose time and frequency properties correspond to typical fish sounds. An illustration of the detection process can be found in Riera et al. (2016).
2.3 Acoustic localization by linearized inversion
The time difference of arrival (TDOA) of acoustic events between hydrophone 2 and each of the other hydrophones was used to localize the sound source in three dimensions (3D). Given their low source levels, fish sounds are typically detectable for distances of a few tens of meters (Amorim et al., 2015). In this case, the problem can be formulated by assuming that the effects of refraction are negligible and propagation can be modeled along straight-line paths with a constant sound velocity v. The TDOA Δtij between hydrophones i and j is then defined by
where x, y, z are the known 3D Cartesian coordinates of hydrophones i and j relative to the array center [Fig. 1(a)], and X, Y, Z are the unknown coordinates of the acoustic source (M = 3 unknowns). The 6-hydrophone array provides measurements of a maximum of N = 5 TDOA data, assuming the signal could be identified on all hydrophones. Localizing the acoustic source is a non-linear problem defined by
where represents the measured data and the modeled data with (in the common convention adopted here bold lower-case symbols represent vectors and bold upper-case symbols represent matrices). The expansion of Eq. (2) in a Taylor series to the first order about an arbitrary starting model can be written
or
where A is the N × M Jacobian matrix of partial derivatives with elements
This is an over-determined linear problem (N = 5, M = 3). Assuming errors in the data are identical and independently Gaussian distributed, the maximum-likelihood solution is
The location of the acoustic source can be estimated by solving for and redefining iteratively
until convergence (i.e., appropriate data misfit and stable ). In Eq. (7), is a step-size damping factor and L is the number of iterations until convergence. Localization uncertainties can be estimated from the diagonal elements of the model covariance matrix about the final solution defined by
where is the data covariance matrix with the variance of the TDOA measurement errors and I the identity matrix. The 3D localization uncertainty is defined as the square root of the sum of the variances along each axis (diagonal elements of ). All localizations were performed using the starting model , a constant sound velocity v = 1484 m/s, and step size damping factor
The TDOAs in were obtained by cross-correlating acoustic events detected on the recording from hydrophone 2 with the recordings from the other 5 hydrophones (search window: ±2.5 ms). Before performing the cross-correlation, each recording was band-pass filtered in the frequency band determined by the detector using an eighth order zero-phase Butterworth filter (filtfilt function in matlab, MathWorks, Inc., Natick, MA). Only detections with a sharp maximum peak in the normalized cross-correlation were considered for localization (peak correlation amplitude > 0.3, kurtosis > 14). The TDOA measurement errors were estimated by subtracting the measured TDOAs at each hydrophone pair (N = 5) from the predicted TDOAs for the estimated source location using Eq. (1). The variance of the measurement errors was then estimated as
where Q is the total number of acoustic events that were localized.
2.4 Video processing
To facilitate the visualization of fish in the video data, the recordings were processed to detect any movements that occurred in the camera's field of view. Each frame of the video recording was converted to a gray scale and normalized to a maximum of 1. An image representing the background scene was defined as the median of each pixel over a 5-min recording and was subtracted from each frame of the video. Finally, temporal smoothing was performed using a moving average of pixel values over 10 consecutive frames. Pixels with values greater than 0.6 were set to 1, and the others were set to zero. Each binarized image was overlaid in red on the original video image. All the processing of the acoustic and video data was performed using matlab 2017a (MathWorks, Inc., Natick, MA).
3. Results
This paper shows results from one 8-min data file. Out of the 185 acoustic events detected in this recording from hydrophone 2, 9 had a high enough cross-correlation peak with the other hydrophones to be localized. Other detections not selected for the localization stage were most often due to mechanical sounds from crabs crawling on the array frame or from sounds that were too faint to be received on all hydrophones. The standard deviation of the TDOA measurement errors was estimated to σ = 0.12 ms [Eq. (9), Q = 9]. The localization capabilities of the hydrophone array were assessed by calculating and mapping the localization uncertainties of hypothetical sound sources located every 10 cm of a 3 × 3 × 3 m cubic volume centered at [0,0,0] m. Figure 2 shows the localization uncertainties of the hydrophone array calculated for a 3D grid around the array using Eq. (8). The localization uncertainty in the middle of the water volume spanned by the arms of the array is less than 50 cm and increases progressively for sound sources farther from the center (Fig. 2). A 3D visualization of Fig. 2 is shown in Mm. 1 Localization uncertainties for sources outside the hydrophone array are generally greater than 1 m.
(Color online) Localization uncertainties of the hydrophone array in the (a) XY, (b) XZ, and (c) YZ plane.
(Color online) Localization uncertainties of the hydrophone array in the (a) XY, (b) XZ, and (c) YZ plane.
3D visualization of the localization uncertainties. This is a file of type “avi” (4071 KB).
3D visualization of the localization uncertainties. This is a file of type “avi” (4071 KB).
Figure 3 shows the acoustic localization results when a tautog (Tautoga onitis) was swimming in the field of view of the camera. Identification of the species was performed visually from the top camera and from an additional non-recording side-view camera deployed on the side of the array. The location of the tautog from the video [highlighted with red pixels in Fig. 3(a)] coincides with the acoustic localization [Fig. 3(b)] of the five low-frequency grunts detected in the acoustic recording [labeled G1–G5 in Fig. 3(c)]. Grunts G1 and G2 were detected as one acoustic event by the automated detector and were consequently localized at the same time (i.e., one localization for both grunts). Small localization uncertainties [blue lines in Fig. 3(b)] leave no ambiguity that these grunts were produced by the tautog. A video of the simultaneous video recording, sound localization, and sound detections while the tautog was swimming inside the array is shown in Mm. 2. Note that the five other sounds that were automatically detected and localized could not be identified to specific fish species because they were outside of the field of view of the camera.
Identification of sounds produced by a tautog. (a) Image from the video camera showing the tautog swimming in the middle of the array (red pixels). (b) Simultaneous acoustic localization (red dots) with uncertainties on each axis (blue lines). (c) Spectrogram of the sounds recorded on hydrophone 2. Red boxes indicate the sounds automatically detected by the detector that were used for the localization.
Identification of sounds produced by a tautog. (a) Image from the video camera showing the tautog swimming in the middle of the array (red pixels). (b) Simultaneous acoustic localization (red dots) with uncertainties on each axis (blue lines). (c) Spectrogram of the sounds recorded on hydrophone 2. Red boxes indicate the sounds automatically detected by the detector that were used for the localization.
Video showing simultaneously the video, localization results and sound detection. This is a file of type “avi” (1726 KB).
Video showing simultaneously the video, localization results and sound detection. This is a file of type “avi” (1726 KB).
Figure 4 provides the spectrogram, waveform, and spectrum for each of the identified tautog grunts. All grunts are composed of one (G3–G5), two (G2), or three (G1) double-pulses. The component pulses of a double-pulse are separated by 11.25 ± 0.7 ms (n = 8). Grunts have a peak frequency of 317 ± 28 Hz (n = 8) and a duration from 22 ms (G3) to 81 ms (G1). Most of the energy for all tautog grunts was below 800 Hz. All time and frequency measurements were performed using the waveform (band-pass filtered between 100 and 1200 Hz with an eighth order zero-phase Butterworth filter, middle column in Fig. 4), and the average periodogram (spectral resolution of 3 Hz, 2048-sample Hanning window zero-padded to 16 384 samples for FFT, with a time step of 102 samples or 2.1 ms; right column in Fig. 4), respectively.
(Color online) Spectrogram (left column), waveform (middle column), and spectrum (right column) of the five localized tautog grunts G1–G5 (each row corresponds to a tautog grunt).
(Color online) Spectrogram (left column), waveform (middle column), and spectrum (right column) of the five localized tautog grunts G1–G5 (each row corresponds to a tautog grunt).
4. Discussion
Compact hydrophone arrays, like the one used in this study, in combination with underwater cameras provide the ability to catalog fish sounds non-intrusively in the wild. Their small footprint allows such systems to be portable and easily deployable. The system described here is cabled to the surface which does not allow deployment in remote areas for extended periods. An autonomous system that can record acoustic and video data for several weeks is currently being developed. In addition to cataloging fish sounds, and use in soniferous behavior research, such an array can be used to document source levels of fish sounds, which is critical information required for assessing the impact of anthropogenic noise on fish communication.
The tautog is an important fisheries species whose stock is overfished (ASMFC, 2017). Their sounds had previously only been reported by Fish and Mowbray (1970). Unfortunately, their description of the calls provides insufficient details to positively identify tautog sounds in acoustic recordings. While more measurements are needed to fully characterize the vocal repertoire of the tautog, this paper shows that the proposed combination of instruments and automated processing methods provides a systematic and efficient way to identify fish sounds from large datasets. The methodology described here promises to become a valuable tool to aid in developing fish and invertebrate sound libraries, as well as for in situ observations of soniferous behavior. This will help to continue the cataloging effort initiated by Fish and Mombray (1970) and make PAM a more viable tool for fish monitoring and fisheries management.
Acknowledgments
This research is supported by the NSERC Canadian Healthy Oceans Network and its Partners: Department of Fisheries and Oceans Canada and INREST (representing the Port of Sept-Îles and City of Sept-Îles), JASCO Applied Sciences, the Natural Sciences and Engineering Research Council (NSERC) Postgraduate Scholarships-Doctoral Program, and MITACS. The data collection was funded by the MIT Sea Grant College Program Grant No. 2010-R/RC-119 to R.R. and F.J.