Detecting marine mammal vocalizations in underwater acoustic environments and classifying them to species level is typically an arduous manual analysis task for skilled bioacousticians. In recent years, machine learning and other automated algorithms have been explored for quickly detecting and classifying all sound sources in an ambient acoustic environment, but many of these still require a large training dataset compiled through time-intensive manual pre-processing. Here, an application of the signal decomposition technique Empirical Mode Decomposition (EMD) is presented, which does not require a priori knowledge and quickly detects all sound sources in a given recording. The EMD detection process extracts the possible signals in a dataset for minimal quality control post-processing before moving onto the second phase: the EMD classification process. The EMD classification process uniquely identifies and labels most sound sources in a given environment. Thirty-five recordings containing different marine mammal species and mooring hardware noises were tested with the new EMD detection and classification processes. Ultimately, these processes can be applied to acoustic index development and refinement.

In recent pursuits to classify the myriad of biological, geological, and anthropogenic sound sources in datasets obtained from underwater passive acoustic recorders, a large investment of manual effort has traditionally been required. With current detection and classification algorithms, large portions of the datasets need to be manually labeled by a skilled analyst to create a training dataset of extracted spectrographic features. These features are then fed into machine learning or classification algorithms such as CART and random forest (Bittle and Duncan, 2013). Using the resultant feature groupings, sound sources of interest in the test dataset(s) are then detected and classified. Precision and recall are quantified to determine the accuracy of the algorithm(s) in which the accuracy is referenced to human analyses. Recall is how well a model finds all relevant cases within a dataset, regardless of how many incorrect cases it also finds. It is usually a trade-off with precision—how well a model identifies only the relevant cases of that dataset. In other techniques, such as Gaussian Mixture Models, an unsupervised learning algorithm can be used to classify multiple features of signals based on some statistics (Pan et al., 2013). However, such techniques require iterative trials, preprocessing, and parameter alteration for better performance. Convolutional neural networks and deep neural networks are other classification examples in which the algorithms learn from the data set features (Liu et al., 2017). The accuracy of these techniques is tied to extensive amounts of training data which requires many human analyst hours.

If a process existed that both eliminated the need for manual analysis for the training set feature extraction and quickly detected all sound sources with no prior knowledge of the data, marine mammal bioacoustical analysis would be more efficient. In this work, we present a detection process using empirical mode decomposition (EMD). EMD is an adaptive tool that breaks down time-domain signals into amplitude modulated and frequency modulated (AM-FM) components called intrinsic mode functions (IMFs). EMD is the foundation of the Hilbert-Huang Transform (HHT) (Huang, 2014), but our use does not proceed with the usual final steps such as applying the Hilbert transform on the generated IMFs. The detected sound sources can then be labeled uniquely and quickly verified and grouped into categories manually. Subsequently, the unique labels (referred to as “EMD identities”) can be used to classify all detected sound sources, eliminating preliminary manual analysis and reducing the amount of time needed for manual verification tasks.

Manual analysis of biological signals in passive acoustic recordings has historically focused on assessment in the frequency domain by trained individuals. This is because spectrograms naturally convey a visualization of the waveform in such a way that humans can perform image perception discrepancy tasks. HHT outputs are similar to spectrograms only with respect to the claim that they have higher spectro-temporal resolution than wavelet transform (Peng et al., 2005). In other words, for all the cases studied to date, HHT has given sharper results than those from any of the traditional analysis methods in time-frequency-energy representations (Huang, 2005). Ultimately, HHT output would still need a bioacoustician to appropriately label the content of the pre-processed data samples. What we propose is an innovative way to perform unsupervised detection and classification in the time-domain by relying only on EMD-type processing, removing the need for both HHT and the labeling of pre-processed data by trained individuals. In this paper, we demonstrate that EMD is able to detect and classify a variety of unique sound sources. While the HHT process has so far been applied to only sperm whales and killer whales (Adam, 2006, 2008), our EMD processes show promise for the detection and classification of a greater number of transient sound sources in a dataset from the Bering Sea. This dataset is a good case study for the processes which are currently being tested for robustness of a variety of datasets.

The goal of this manuscript is that by introducing the EMD algorithms and their promise, they can subsequently be made robust enough for application in the growing field of acoustic indices. The development of acoustic indices to quantify each sound source's relative contribution to the soundscape began in terrestrial studies and was first attempted by subtracting pairs of amplitude envelopes and average frequency spectra (Sueur et al., 2008). This created the temporal and spectral dissimilarity subindices (Dt and Df), respectively, that when multiplied and scaled between 0 and 1 made an index named D (Lellouch et al., 2014). Towsey et al. (2013) has since compared 14 acoustic indices and showed that combining sets of them are more reflective of a soundscape than a single index alone. The four most common indices in the literature tend to be the Acoustic Complexity Index (Pieretti et al., 2011), Acoustic Diversity Index (ADI) (Villanueva-Rivera et al., 2011), Bioacoustic Index (Boelman et al., 2007), and Normalized Difference Soundscape Index (Kasten et al., 2012) because they were designed to be robust to anthropogenic noise.

By 2014, at least 21 α acoustic indices and 7 β diversity indices had been reported in the literature (Sueur et al., 2014). In the ecological theory of diversity, alpha, beta, and gamma diversities are defined differently than here. The α indices are a group that estimates amplitude, evenness, richness, and/or heterogeneity of a soundscape (Sueur et al., 2014). In the underwater realm, various proximities of vocalizing marine mammals to a recorder and anthropogenic masking are two instances where α indices can fail. The β diversity indices are a group that compares amplitude envelopes or, more often, frequency spectral profiles (Sueur et al., 2014). Any species that overlap spectrally, such as most baleen whales with each other and wind noise, or odontocete clicks with each other and rain or ice cracking noise, would make it difficult to reliably separate each sound source's spectral profile from others. Noise from passing boat engines can completely skew any spectral-based indices, arguing that an index based on time-domain analysis may be more appropriate. The EMD process does just this, thereby avoiding the pitfalls of frequency domain analyses.

As the world's oceans increasingly urbanize, masking may be the critical variable to overcome in any acoustic index development (Fairbrass et al., 2017). Parks et al. (2014) applied the first ADI to global marine recordings. Even though their acoustic entropy values did not correspond to biological patterns, they showed that a simple background noise removal technique on raw recordings did provide an index that was able to compensate for noise from seismic air-gun activity that had previously complicated things by masking biological signals. Their technique was proven only for low-frequency recordings, so it is possible that higher-frequency masking noises occur in recordings of wider bandwidths and will need to be compensated for in the raw data one by one. The EMD detection process that sifts out noise as part of its algorithm incorporates this critical aspect without any additional or one-by-one steps in the data processing.

Given the short acoustic index history, the goals of ecoacoustics are first to avoid masking anthropogenic sources and then to measure and understand the number of relative contributions of each acoustic component to the physical environment (Parsons et al., 2016). Therefore, should acoustic indices not strive to label each sound source, assign it to an acoustic class or species, and quantify the relative time that each source is present in the simplest way possible?

In what follows, we illustrate how our EMD process can (1) blindly detect and provide classification differences for signals that can be difficult for human analysts to visually parse out (e.g., humpback whale song vs killer whale whistles vs beluga whale whistles), and (2) detect unknown echolocation buzzes and click trains that have interclick intervals that are unstable (non-stereotyped) and thus do not conform to many detection and classification algorithms. In Bittle and Duncan (2013), the authors state that “no single algorithm is ideal for detecting and classifying all species concurrently, so any automated system requires a suite of these algorithms.” With a dataset recorded by Passive Acoustic Listeners (Nystuen, 1998) from the Bering Sea, only two algorithms instead of a suite have been used to detect and assign unique EMD identities to nearly every present sound source, despite overlaps in time and frequency.

In recent years, EMD, the key part of the HHT, has attracted attention from researchers in different signal processing fields (Huang, 2014). It has been used to generate historical earthquake time series (Ni et al., 2011) and to decompose electrocardiographic signals where Fourier analysis has failed (Stork 2012). EMD is an adaptive and fully data-driven tool that decomposes non-linear and non-stationary signals into a finite number of components called IMFs. IMFs are AM-FM portions of the original signal, zero-mean functions that represent the parameter features of decomposed signals, and are arranged by frequency content (Hz) from highest to lowest. The modes (IMFs) are collected through an iterative process called sifting. The sifting process eliminates most of the signal anomalies and makes the signal wave profile more symmetric. The frequency content embedded in the processed modes reflects the physical meaning, or order of dominance, of the underlying frequencies. EMD works as a dyadic filter bank similar to the behavior of wavelets (Flandrin et al., 2004). In order to work on real world signals, oversampling is required and is independent of Shannon's criterion (Rilling and Flandrin, 2006).

The filtering properties of EMD are used to analyze acoustic signals for feature extractions. The acoustic signals consist of multiple components that range from high to low frequency with respect to their IMF indices. The EMD algorithm for sifting the time-domain acoustic signal can be summarized as follows:

  1. Identify all extrema (maxima and minima) in the signal and interpolate them using cubic spline of the Matlab routine envelope (x, n, “peak”) to form upper and lower envelopes, respectively.

  2. Calculate the mean of the envelopes from step (1) and subtract the mean from the input signal to obtain the residual.

  3. If the residual does not satisfy the stoppage criteria,1 then the process is repeated and the residual is considered input to step (1). Otherwise, the residual is the ith IMF and the remainder of subtracting the IMF from the input signal will be processed as the input signal [steps (1)–(3)].

The resulting IMFs can be reconstructed to form the original input signal such that

Ŷ=i=1MIMFi+T,
(1)

where Ŷ is the reconstructed signal from the original input signal Y, M is the total number of IMFs, and T is the trend of Ŷ.

In this section, the proposed EMD-based detection and classification processes are presented (see Fig. 1 for further illustration). The proposed EMD-based detection process is based on a dependency or similarity measure (correlation) between consecutive IMFs. The correlation used represents the linear relationship between two IMFs. If the IMFs are dependent on each other (i.e., have similar structure), they have a higher correlation. Otherwise, if they do not depend on each other, they have a lower correlation. The classifier is then identifies the features in a given source based on the findings of the detector.

FIG. 1.

(Color online) Flow chart of how to start with raw acoustic data, refine the EMD detection algorithm to the dataset under consideration (pink boxes), and analyze the dataset through EMD detection and classification processes (gray boxes). Subplot references are to Figs. 3, 4, and 5.

FIG. 1.

(Color online) Flow chart of how to start with raw acoustic data, refine the EMD detection algorithm to the dataset under consideration (pink boxes), and analyze the dataset through EMD detection and classification processes (gray boxes). Subplot references are to Figs. 3, 4, and 5.

Close modal

The EMD sifting-based properties are used to propose a detector that is able to capture various sound sources blindly. As the EMD works by obtaining the upper/lower envelopes of the sound wave, this property is used as the key part of the proposed detector. The proposed detector take the advantage of EMD process of breaking down, crudely separate, sound source into different modes ranging from high to low frequencies in AM/FM pattern. Thus, using the resulting IMFs from the decomposition process will facilitate the task of features detection other than tackling the original sound source. However, the underwater environment is too noisy, therefore over/under shoots are unavoidable. Therefore, the presences of these anomalies (over/under shoots) in the sound source has a direct impact on the resulting IMFs as they lead to misinterpretation of envelopes.

In this paper, we adopted the Root-Mean-Square (RMS) window to obtain the upper and lower envelopes, denoted by CwiU and CwiL, respectively, of the ith IMF. Further, the difference of the obtained upper and lower envelopes, denoted by CwiD, is calculated such that

CwiD=CwiUCwiLi=1M.
(2)

As EMD works like a dyadic filter, determining the IMF in which the useful components start to dominate the noise components is essential to filter out unwanted components (noise). In order to find this particular IMF, a similarity measure (correlation coefficients) is computed between each consecutive modes such that

ρCwiD,Cwi+1D=(CwiDCw¯iD)(Cwi+1DCw¯i+1D)(CwiDCw¯iD)2(Cwi+1DCw¯i+1D)2,
(3)

where Cw¯iD and Cw¯i+1D is the mean of the current and the next mode, respectively.

For example, the resultant correlation coefficients might be in such a way: ρCw1D,Cw2D,ρCw2D,Cw3D, etc. Further, calculating the correlation coefficients sets an indicator that helps decide which IMFs are relevant and should be included in the next steps while irrelevant IMFs can be discarded. Here, “relevant” refers to the IMFs with marine mammal features and mooring self-noise as useful features to parse them out from one another and from background noises. This step is essential in obtaining the “R” index in the next step. The index R is used as a reference to carry out the partial reconstruction for the relevant IMFs. A threshold is then applied on the partially reconstructed sound source in order to detect the embedded features.

The EMD detection process applied to an audio file as a blind method to detect useful features can be summarized in Algorithm 1.

Algorithm 1: Detection algorithm

  1. Decompose the data sample with EMD to generate a set of IMFs (refer to Sec. I A).

  2. Apply a RMS window on the ith IMF to form the upper and the lower envelopes.2

  3. Calculate the difference3 of the two envelopes in step (2) using Eq. (2).

  4. Calculate the correlation coefficients between each two consecutive modes using Eq. (3).

  5. Carry out the partial reconstruction in which IMF of the highest correlation is denoted by CwRD such that:

    • Sum together all modes with ascending correlation coefficients prior to CwRD such that
      ŶR=i=ARCwiD,
      (4)

      where ŶR is the partially reconstructed signal, and A is the index where the ascending correlation coefficients starts.

    • Establish a rule4 that if R = 1, then the partially reconstructed signal, ŶR, will equal Cw1D.

  6. Calculate the detection threshold, denoted by TD, as follows:
    TD=δ×max(ŶR),
    (5)

    where δ is the detector tolerance.5

  7. If any set of samples of the partially reconstructed signal exceeds the assigned threshold in step (6), count those samples, denoted by ŶRE, as indicative of a signal's presence and use them to assign unique EMD identities later.

Example 1

In this example, the proposed detector is applied on the sound source and illustrated in Fig. 2. Herein, Fig. 2(a) illustrates the spectrogram of the sound source in which different features are labeled. On the other hand, Fig. 2(b) shows the partially reconstructed signal after applying the proposed detector. From Fig. 2(b), the underwater blow is selected to explain how the detector and later the classifier works. The time between 1.1 and 1.3 s is assigned as the first set of samples who are exceeding the threshold (at δ = 1%) where points A and B can be pinpointed and projected onto time axis (call it here “Detected Interval”) to determine the number of samples (denoted by NE).

FIG. 2.

(Color online) (a) The spectrogram of the original sound source, and (b) the “normalized” partially reconstructed signal.

FIG. 2.

(Color online) (a) The spectrogram of the original sound source, and (b) the “normalized” partially reconstructed signal.

Close modal

The IMF indices generated as part of our EMD detection process were visually compared to their corresponding spectrograms to confirm that each signal known to exist in the audio files was, in fact, detected by the steps outlined in Sec. II A. The classification of sound sources algorithm (outlined below) is applied on each Detected Interval that is determined in Algorithm 1 in which the power of each Detected Interval of the ith mode, CwiD. The EMD-based classifier uses measurements of power (variance) to build an “EMD library” (or an IMF lookup table). The EMD classification is illustrated in algorithm 2.

Algorithm 2: Classification Algorithm

  1. For the ith mode, CwiD, the power of the kth set of samples (See example 1) exceeding the threshold [Step (7) in Sec. II A] that contains a sound source is calculated as follows:
    Pk=1/NEŶREμ,
    (6)

    where Pk is the power of the kth set of samples, NE is the number of samples of that set of samples, and μ is the mean of ŶRE.

  2. The two highest power values from step (1) in Algorithm 2 are selected.6

  3. The identity for each unique sound source is written as the two IMFs having the largest variance (in descending order of power) inside brackets.

For example, killer whale clicks have the highest power value at the second IMF and the second highest power value at the first IMF. This can be denoted by “[2,1].” Continuing in this way, each sound source will have a two-digit EMD identity where order matters. Compiling these two-digit identities yields an EMD library. It is possible in environments with fewer or more sound sources that one- or three-digit identities, respectively, may suit the dataset better. For the Bering Sea dataset used as an example in this paper, two-digit identities worked well.

It was necessary to use EMD to find the timestamps of the beginning and end of a detected sound source [points A and B in Fig. 2(b)] because these times A and B cannot be pinpointed visually in a spectrogram or waveform. Having the EMD-defined points, though, makes the period containing the most important features in the original signal definitive and the IMF variances able to be calculated within it. Of note, variance could be replaced by another measurement (such as entropy) to create a different kind of EMD library. It was found that variance supported the manual analysis results better than entropy for this dataset. Regardless, assembling an EMD library using statistics is more robust than creating arbitrary EMD identities using just manual inspection of peaks in the IMF plots.

Thirty-five pal WAVE files were used to test the EMD detection process and to assign eight unique EMD identities. The eight example EMD identities presented here are mooring chain noise, Acoustic Water Column Profiler (AWCP) pings, underwater blows from a large marine mammal, sperm whale clicks, unidentified odontocete buzzes, beluga whale whistles, humpback whale song units, and walrus knocks. The two humpback whale song units in the “moderately difficult” example below were difficult to detect, likely because they are faint and continuous so could be misinterpreted as noise that would have been sifted out. While these EMD identities could have been assigned with fewer wave files, several files of the same sound source were used for the sake of redundancy. The three figures below represent the easiest, moderately difficult, and most difficult sound sources to assign EMD identities. All EMD identities discussed in this section are included in Table I.

TABLE I.

List of EMD identities established with 35 audio files using the EMD detection process and IMF variance statistical measurements.

Sound Source1st most variant IMF2nd most variant IMFEMD Identity
Underwater blow [3,4] 
Beluga whale whistle [3,5] 
Sperm whale clicks [2,1] 
Mooring chain noise [1,2] 
Humpback whale song unit [7,6] 
Walrus knocks [5,6] 
Unid odontocete buzz [1,2] 
AWCP pings [1,2] 
Sound Source1st most variant IMF2nd most variant IMFEMD Identity
Underwater blow [3,4] 
Beluga whale whistle [3,5] 
Sperm whale clicks [2,1] 
Mooring chain noise [1,2] 
Humpback whale song unit [7,6] 
Walrus knocks [5,6] 
Unid odontocete buzz [1,2] 
AWCP pings [1,2] 

The seventh pal data sample on Julian day 286 in 2010 contained four different sound sources as displayed in its spectrogram [Fig. 3(a)]. It also had 10 bands of electrical noise at 5, 10, 13, 15, 20, 25, 30, 33, 40, and 45 kHz. The electrical noise bands did not show up in any IMF indices [Fig. 3(b)] indicating that the time-domain processing inherent in EMD was insensitive to or able to “filter out” this type of noise. Because electrical noise is continuous and of lower amplitude with respect to other signal features existing in the signal time series, EMD has ignored this component in the sifting process for this dataset. Other datasets are being analyzed to determine if this is a robust feature where EMD will sift electrical noise out. The four different sound sources did show up in various IMF indices. The underwater blow at 1.2 s was evident in IMFs 2, 3, 4, 5, and somewhat in 6 [Fig. 3(b)] as the bumps just past the first second in the red, yellow, purple, green, and cyan lines, respectively. The most obvious (most powerful) IMFs of the underwater blow were 3 and 4. Its statistical EMD identity (a.k.a. the first and second largest IMF variances) is therefore [3,4]. The beluga whistle from 1.7 to 2.0 s has an EMD identity of [3,5]. (Note that the beluga whistle also shows up in IMFs 2, 4, 6, and 7 on the red, purple, cyan, and maroon lines.) The set of sperm whale clicks from 2.5 to 4.2 s have an EMD identity of [2,1]. The mooring chain noise that overlaps a sperm whale click at 3.1 s has an EMD identity of [1,2].

FIG. 3.

(Color online) (a) The spectrogram of the data sample from Julian day 286, (b) the set of IMFs represented by the difference of envelopes, (c) the correlation coefficients between consecutive IMFs, and (d) the partially reconstructed signal.

FIG. 3.

(Color online) (a) The spectrogram of the data sample from Julian day 286, (b) the set of IMFs represented by the difference of envelopes, (c) the correlation coefficients between consecutive IMFs, and (d) the partially reconstructed signal.

Close modal

The fifth data sample on Julian day 10 in 2010 contained two different sound sources [Fig. 4(a)]. The two humpback whale song units (0–0.6 and 1.0–1.9 s) were difficult to detect, likely because they are faint enough to not exceed the threshold and be sifted out as noise. They do show up in IMFs 6 and 7 though [Fig. 4(b)], providing it with a statistical EMD identity of [7,6]. The six walrus knocks between 1.8 and 3 s were primarily detected in IMFs 1, 2, 3, and 5, and its EMD identity was calculated as [5,6].

FIG. 4.

(Color online) Same as Fig. 3 except for data sample from Julian day 10.

FIG. 4.

(Color online) Same as Fig. 3 except for data sample from Julian day 10.

Close modal

The various kinds of pulses in a dataset are usually the most difficult to detect and classify when they overlap in time and frequency (Herzig, 2014). It is not surprising, therefore, that this was also the biggest challenge of the EMD detection process. The second data sample from Julian day 291 in 2010 contained two different sound sources [Fig. 5(a)]. The unidentified odontocete buzzes appear at 0.4 s (a short buzz) and 1.4–1.5 s (a longer buzz) and then decompose into a very faint click train. Not only are these fainter than the AWCP pings at 1.95, 3.4, and 4.0 s, but the second buzz overlaps entirely with an AWCP ping. Looking at IMFs 1, 2, 3, and 4, the AWCP pings are obviously the larger peaks and the buzzes spread out into lower peaks [Fig. 5(b)]. The slowing click trains are very slightly noticeable in IMFs 3 and 4 but mainly fade into the background. Regardless, the reconstructed signal plot [Fig. 5(d)] does clearly show the two buzzes at lower and wider peaks than the three AWCP pings. This represents the most difficult case we encountered when trying to assign EMD identities because both the AWCP pings and odontocete buzzes resulted in [1,2] EMD identities. This is beneficial because the AWCP pings as [1,-] are just like the mooring chain, and they are both contaminating noises. (A dash is used here when an IMF is not being discussed at the time or needed.) But parsing out the odontocete buzzes may require an additional step after the correlation reconstruction. In future work we propose to use a secondary threshold, such as a peakfinder, between 0.05 and 0.2 amplitude [Fig. 5(d)] to refine this EMD identity assignment.

FIG. 5.

(Color online) Same as Fig. 3 except for data sample from Julian day 291.

FIG. 5.

(Color online) Same as Fig. 3 except for data sample from Julian day 291.

Close modal

The power of EMD was used as an adaptive tool with dyadic-like filter characteristics to propose a system for detecting and classifying sound sources in a marine acoustic dataset. Using only 35 WAVE files to verify the functionality of the EMD detection process for eight sound sources is anecdotally fewer analysis hours and files needed than to compile training and testing sets for other techniques such as random forest or cluster analysis. In practice, the EMD detection process algorithm does not require a manually analyzed dataset—a manually annotated dataset was only used here to illustrate how the EMD detection process works and how its output can then be post-processed for the EMD classification process. Because the EMD classification process relied only on the statistical measure of IMF features (i.e., the variance), this EMD classification process step was as automatable as the EMD detection step. The only manual analysis that is still required is to (1) understand which EMD identities are being assigned to which sound source in case they need to be grouped by species and (2) check that both processes' performances are satisfactory when applied to data from a new environment/ecosystem.

The EMD-based detector performs well in terms of finding the underlying features of most signals. Performance does degrade when the signal contains extreme values, though. EMD performs poorly on signals with extreme values because of the nature of the EMD sifting process. When there is at least one very high amplitude peak in an audio sample, it will dominate the smaller peaks (lower amplitudes), making them more difficult to detect. Because the important features of our signals are usually contained in the lower amplitude peaks, the extreme amplitudes (like AWCP pings) need to be compensated for. In future work, the impact of large amplitude components could be addressed with simple pre-processing steps, such as trimming extreme values of the input time-series.

Once the EMD classification process is refined and tested in other ecosystems, it can inform the growing sector of ecoacoustics that develops and employs acoustic indices to understand and compare soundscapes. If the goals of acoustic indices, as mentioned in Sec. I, is to label each sound source, assign it to an acoustic class or species, and quantify the relative time that each source is present in the simplest way possible, EMD can offer this straightforward approach. The EMD process described here can decompose a signal such that sounds of interest are detected and unique identities are compiled for every sound source in a dataset. In addition to our EMD detection process sifting out noise, the EMD classification process also assigns unique identities to any noise sources that were not sifted out. If a section of a recording was overwhelmingly classified as masking-type sources, it would be an indication that the raw data should be cleaned for noise and run through the EMD process again. In this way, masking sources would be accounted for twofold. Ultimately, the proposed EMD process will yield a presence/absence time series for every sound source to which simple and fundamental diversity indices like Hilsenhoff or Shannon's could then be applied. No new indices need to be developed; the simplest ones that already exist need cleaner input—like a time series of signals from all species during even the most intense masking conditions.

The proposed EMD detection process semi-blindly and clearly separates sound sources without relying on the frequency domain. With this information, a human analyst does not need to pre-process spectrograms and can quickly assign the statistically generated EMD identities to species, physical processes, and anthropogenic noises. Assembling an EMD identity library and running an EMD classification process can then track how often each sound source is present over any given time frame in any environment. The EMD detection and classification processes may be sensitive to different ambient noise conditions. Any stable ambient acoustic environment with historical data, however, could have an EMD identity library established for it. This environment-specific EMD process could then be applied, in real time, as an algorithm for research and naval vessels as data are recorded on site. In this way, the EMD process would quickly determine which species are present nearby.

The next steps in the development of the methods presented in this paper are to (1) build a larger EMD library to encompass all sound sources in the Bering Sea dataset, (2) test the EMD process on datasets from other biomes, and (3) demonstrate how the EMD library can better inform acoustic indices by providing presence/absence time series robust to the effects of masking. Using signals beyond the ones in this dataset, it will be useful to determine the lowest-frequency signals that IMFs can detect. EMD fills a gap in the signal processing bioacoustics toolbox because it is immediately multi-species capable regardless of the types of species and background sound sources in any dataset, and it has flexibility in assigning EMD identities to uniquely label any transient sound sources present. If a Graphical User Interface could be developed on naval and other ships to quickly and cleanly visualize the IMF and EMD plots, there would be high potential for real-time marine mammal acoustic observations to perform distribution studies or to adhere more closely to the guidelines of the Marine Mammal Protection Act.

Thank you to all researchers involved with the NOAA PMEL Eco-FOCI program that so graciously assisted in data collection since 2007. We are also grateful to everyone who contributed to the development of the EMD algorithm over the years at the University of New Hampshire. Data collection was supported under ONR Award No. N00014-08-1-0391.

1

The stoppage criteria that is used in this paper is called the Cauchy convergence test.

2

The Matlab routine envelope (x,w, “rms”) is used in this work. The RMS envelope (w) is computed as opposed to the peak envelope to avoid any peaks that over or under shoot the threshold but are actually noise sources, thus creating false features.

3

The difference of two envelopes is proposed to amplify the sound features and suppress the sound source otherwise.

4

R = 1 is a special case whereby the most similarity is between Cw1D and Cw2D and the highest correlation coefficient (the only increasing trend) is between the first two data points. Therefore, this special cases “summation” will be only Cw1D. As a more general example, if the highest correlation coefficient is between Cw3D and Cw4D, then the index will be R = 3 and the partial reconstruction will be the summation of Cw1D,Cw2D and Cw3D.

5

Increasing δ may increase the probability of missing a detection (a signal feature would not exceed the threshold), whereas decreasing δ might increase the possibility of false alarms (unwanted noise features that exceed the threshold). For this reason, δ should be chosen carefully by an analyst for each ambient acoustic environment and to balance the trade-off between false alarms and missed detections that the project's particular hypothesis requires. A detector of large δ tolerance might lead to sacrificing some features in the reconstructed signal, especially those with lower amplitudes. In this paper, the value of δ is suggested to be equal or below 10% in order to achieve a balance of probability of false alarm and probability of miss-detection.

6

The two largest power values contain the most relevant information about each sound source therefore can be used to differentiate one source from another.

1.
Adam
,
O.
(
2006
). “
Advantages of the Hilbert Huang transform for marine mammals signals analysis
,”
J. Acoust. Soc. Am.
120
(
5
),
2965
2973
.
2.
Adam
,
O.
(
2008
). “
Segmentation of killer whale vocalizations using the Hilbert-Huang transform
,”
EURASIP J. Adv. Signal Process.
10
,
245936
.
3.
Bittle
,
M.
, and
Duncan
,
A.
(
2013
). “
A review of current marine mammal detection and classification algorithms for use in automated passive acoustic monitoring
,” in
Proceedings of Acoustics, Australian Acoustical Society
, Victor Harbor (November 17–20, 2013), pp.
1
8
.
4.
Boelman
,
N. T.
,
Asner
,
G. P.
,
Hart
,
P. J.
, and
Martin
,
R. E.
(
2007
). “
Multi-trophic invasion resistance in Hawaii: Bioacoustics, field surveys, and airborne remote sensing
,”
Ecol. Appl.
17
,
2137
2144
.
5.
Fairbrass
,
A. J.
,
Rennert
,
P.
,
Williams
,
C.
,
Titheridge
,
H.
, and
Jones
,
K. E.
(
2017
). “
Biases of acoustic indices measuring biodiversity in urban areas
,”
Ecol. Ind.
83
,
169
177
.
6.
Flandrin
,
P.
,
Rilling
,
G.
, and
Goncalves
,
P.
(
2004
). “
Empirical mode decomposition as a filter bank
,”
IEEE Signal Process. Lett.
11
,
112
114
7.
Herzig
,
D. L.
(
2014
). “
Clicks, whistles and pulses: Passive and active signal use in dolphin communication
,”
Acta Astronautica
105
,
534
537
.
8.
Huang
,
N. E.
(
2005
).
“Introduction to the Hilbert–Huang transform and its related mathematical problems,”
in
Hilbert-Huang Transform and its Applications
(
World Scientific Publishing Company
,
London, England
), pp.
1
26
.
9.
Huang
,
N. E.
(
2014
).
Hilbert-Huang Transform and its Applications
(
World Scientific Publishing Company
,
London, England
), Vol. 16.
10.
Kasten
,
E. P.
,
Gage
,
S. H.
,
Fox
,
J.
, and
Joo
,
W.
(
2012
). “
The remote environmental assessment laboratory's acoustic library: An archive for studying soundscape ecology
,”
Ecol. Inf.
12
,
50
67
.
11.
Lellouch
,
L.
,
Pavoine
,
S.
,
Jiguet
,
F.
,
Glotin
,
H.
, and
Sueur
,
J.
(
2014
). “
Monitoring temporal change of bird communities with dissimilarity acoustic indices
,”
Methods Ecol. Evol.
5
(
6
),
495
505
.
12.
Liu
,
W.
,
Wang
,
Z.
,
Liu
,
X.
,
Zeng
,
N.
,
Liu
,
Y.
, and
Alsaadi
,
F. E.
(
2017
). “
A survey of deep neural network architectures and their applications
,”
Neurocomputing
234
,
11
26
.
13.
Ni
,
S.-H.
,
Xie
,
W.-C.
, and
Pandey
,
M.
(
2011
). “
Application of Hilbert-Huang transform in generating spectrum-compatible earthquake time histories
,”
ISRN Signal Process.
2011
,
1
17
.
14.
Nystuen
,
J. A.
(
1998
). “
Temporal sampling requirements for autonomous rain gauges
,”
J. Atmos. Oceanic Tech.
15
,
1254
1261
.
15.
Pan
,
W.
,
Shen
,
X.
, and
Liu
,
B.
(
2013
). “
Cluster analysis: Unsupervised learning via supervised learning with a non-convex penalty
,”
Machine Learn. Res.
14
,
1865
1889
.
16.
Parks
,
S. E.
,
Miksis-Olds
,
J. L.
, and
Denes
,
S. L.
(
2014
). “
Assessing marine ecosystem acoustic diversity across ocean basins
,”
Ecol. Inf.
21
,
81
88
.
17.
Parsons
,
M.
,
Erbe
,
C.
,
McCauley
,
R.
,
McWilliam
,
J.
,
Marley
,
S.
,
Gavrilov
,
A.
, and
Parnum
,
I.
(
2016
). “
Long-term monitoring of soundscapes and deciphering a usable index: Examples of fish choruses from Australia
,”
Proc. Mtgs. Acoust.
27
(
1
),
010023
.
18.
Peng
,
Z. K.
,
Tse
,
P. W.
, and
Chu
,
F. L.
(
2005
). “
A comparison study of improved Hilbert–Huang transform and wavelet transform: Application to fault diagnosis for rolling bearing
,”
Mech. Syst. Signal Process.
19
,
974
988
.
19.
Pieretti
,
N.
,
Farina
,
A.
, and
Morri
,
D.
(
2011
). “
A new methodology to infer the singing activity of an avian community: The Acoustic Complexity Index (ACI)
,”
Ecol. Indic.
11
,
868
873
.
20.
Rilling
,
G.
, and
Flandrin
,
P.
(
2006
). “
On the influence of sampling on the Empirical Mode Decomposition
,” in
IEEE International Conference on Acoustics Speech and Signal Processing Proceedings
.
21.
Stork
,
M.
(
2012
). “
Hilbert-Huang transform and its applications in engineering and biomedical signal analysis
,” in
Recent Researches in Circuits and Systems
, Recent Advances in Electrical Engineering Series 3 (
WSEAS Press
).
22.
Sueur
,
J.
,
Farina
,
A.
,
Gasc
,
A.
,
Pieretti
,
N.
, and
Pavoine
,
S.
(
2014
). “
Acoustic indices for biodiversity assessment and landscape investigation
,”
Acta Acust. Acust.
100
,
772
781
.
23.
Sueur
,
J.
,
Pavoine
,
S.
,
Hamerlynck
,
O.
, and
Duvail
,
S.
(
2008
). “
Rapid acoustic survey for biodiversity appraisal
,”
PLoS One
3
(
12
),
1
9
.
26.
Towsey
,
M.
,
Wimmer
,
J.
,
Williamson
,
I.
, and
Roe
,
P.
(
2013
). “
The use of acoustic indices to determine avian species richness in audio-recordings of the environment
,”
Ecol. Inf.
21
,
110
119
.
27.
Villanueva-Rivera
,
L. J.
,
Pijanowski
,
B. C.
,
Doucette
,
J.
, and
Pekin
,
B.
(
2011
). “
A primer of acoustic analysis for landscape ecologists
,”
Landsc. Ecol.
26
,
1233
1246
.