This letter proposes an empirical mode decomposition (EMD) based hidden Markov model (HMM) approach for the detection of mysticetes' pulse calls such as the Bryde's whales. The HMM detection capabilities depend on the deployed feature extraction (FE) technique. The EMD is proposed as a performance efficient alternative to the popular Mel-scale frequency cepstral coefficient (MFCC) and linear predictive coefficient (LPC) FE techniques. The amplitude modulation–frequency modulation components derived from the EMD process are modified to form feature vectors for the HMM. Also, the ensemble EMD (EEMD) is adapted in a similar way as the EMD. These proposed EMD-HMM and EEMD-HMM approaches achieved better performance in comparison to the MFCC-HMM and LPC-HMM approaches.

Some mysticetes such as the Bryde's whale produce short pulse calls for communications or echolocations. These characteristic pulse calls are difficult to study with the traditional visual inspection. Therefore, reliable algorithms which can accurately detect these pulse calls with little or no false positive rate, F, is necessary. There have been numerous detection algorithms put forward in recent times to proffer a solution in this respect. Among others, machine learning (ML) tools such as k-means clustering,1 Gaussian mixture models (GMMs),2 support vector machines (SVMs),3 hidden Markov models (HMMs),4 and so on, are widely used in the literature because of their ability to model the characteristics of mysticetes pulse calls.5 Likewise, these ML tools have been used to detect and analyse the movement of other obscure but vocal marine species.6 

With respect to HMMs, it has been widely used for detection and classification of cetaceans' sounds in general. HMM is an efficient and reliable ML tool that can be easily used to model cetaceans' vocal repertoire.7 It can easily detect and classify sounds from a defined series of observation, thereby achieving a high sensitivity, S, and a low sensitivity, F.5,6 Regardless, the performance of the HMM depends on the feature vector, which we consider a reliability vector, R, of the HMM. In most cases, the more reliable the R, the better the S and F performance of the HMM. The reliability vector, R, fed into the HMM can be derived using different feature extraction (FE) techniques. However, the Mel-scale frequency cepstral coefficients (MFCCs)8 and linear predictive coefficients (LPCs)9 FE techniques are widely used for cetaceans' vocal signals in general, especially for mysticetes' (suborder of cetaceans) pulse calls detection and classification.

In this letter, we introduced the empirical mode decomposition (EMD)10 based HMM approach for the detection of mysticetes' pulse calls. The EMD is a data-driven multi-resolution decomposition tool suitable for the analysis of non-linear and non-stationary dataset. It decomposes a signal over a random duration of time into amplitude-modulation–frequency-modulation components, called intrinsic mode functions (IMFs). These IMFs are modified in this letter to form reliability vectors, R, which are fed into a HMM. Furthermore, as a result of the mode mixing drawback of the EMD, the ensemble EMD (EEMD)11 is adapted in a similar approach as the EMD to form reliability vector which can be combined with the HMM. Although the EMD have been used in sound and speech recognition in recent times,12,13 the EMD have not been adapted as a feature vector for the HMM as proposed in this letter. The performance of these proposed EMD-HMM and EEMD-HMM approaches were evaluated on the acoustic dataset of a continuous recording of an inshore Bryde's whale pulse calls sampled at 9600 Hz, recorded in the endmost southwest of South Africa, as shown in Fig. 1 (image produced using Sonic Visualiser). The contribution and result of this letter is innovative. First, the EMD-HMM and EEMD-HMM produce good sensitivity, S, and negligible false positive rate, F, in comparison with the traditional techniques such as the MFCC-HMM and LPC-HMM. Second, the EMD-HMM and EEMD-HMM reduces the computational load added to HMM as compared to the MFCC-HMM and LPC-HMM. However, the EEMD-HMM approach achieved a better FE characteristic of the pulse call in comparison to the EMD-HMM because of the mixing mode problem associated with the EMD. Though we verified the EMD-HMM and EEMD-HMM results with the Bryde's whale pulse calls in this letter, the technique can be extended to other mysticetes' that produce characteristic pulse calls.

The HMM is a ML tool popularly used for human and animal sound analysis and recognition as discussed in a tutorial by Rabiner.4 HMM is a stochastic state machine, where a change in the hidden states will end with the emission of a symbol. It is pliable and can effortlessly model and classify the features of a set of sound waveforms. The HMM process can be viewed in two steps: training step and detection step. During training, the HMM estimates three major parameters from the R of a sound waveform: (1) start probability, s, (2) transition matrix, T, and (3) emission distribution, ψ ={ν, Σ, ϕ}, where ν is the mean, ∑ is the covariance matrix, and ϕ is the mixture weight. Thus, HMM represents the signal as a sequence of states across time as defined by T. The T ‘s contain probability values, which are used to transit from one state to another. The ψ parameters are computed using the Baum-Welch (BM) algorithm.14 Although, the parameters ν, Σ, and ϕ assume flat starts or random values at the beginning of the BM process, HMMs are quite sensitive to flat values because it limit the performance of the model. Here, the k-means clustering coupled with the GMM are used to initialise the emission distribution parameters as viewed in Fig. 2. The trained ψ parameters are matched to the sound to be recognised in the detection step using the Viterbi algorithm (V-alg).15 The V-alg takes the s,T, and R from the sound to be recognised (note that this R have been processed with the trained ψ parameters as indicated in Fig. 2), and it outputs all the closely matched hidden states. The HMM approach have been shown to be viable in marine mammal sound detection and classification, recent works include.5,16

Executing the HMM with the BM algorithm and the V-alg have been used in several applications as mentioned. Notwithstanding, the R plays an important role in employing the HMM approach in sound detection and classification. A reliable R aids the detection capability of the HMM. We emphasise that the more reliable the R, the better is the performance of the HMM both in terms of S and F. Therefore, this letter focuses on improving the reliability of the feature vector. A couple of FE methods have been used with the HMM but the MFCC and LPC methods are widely used for marine sounds. The MFCC extracts the feature coefficients from the sound signal by converting it from the time domain into the Mel frequency scale. This process can be achieved in seven sequential steps as discussed in Ref. 8: (1) pre-emphasis, (2) framing (3), windowing (4), fast Fourier transform, (5) mel-scale filter bank, (6) logarithm operation, and (7) discrete cosine transform. On the other hand, the LPC calculates the feature coefficients by using a linear combination of the previously regenerated sound signal to predict an approximated value for the current sound signal.9 Typically, both the MFCC and LPC use between 10 to 14 coefficients, but we assume 12 coefficients in this letter. That is, the R is 12-dimensional. Note that the higher the dimension of R, the greater the computational load of the HMM. In this regards, we introduced the EMD and EEMD as a performance efficient alternative to the MFFC and LPC FE methods for mysticetes pulse calls such as the Bryde's whale.

EMD introduced in Ref. 10 has been used by different researchers to solve a number of signal processing problems. Some of the recent signal processing applications of EMD include radar signal emitter recognition,11 speech analysis,13 marine mammal sound detection,12 and so on. EMD is a completely data-driven tool that decomposes a signal z(t) over a random duration of time t into IMFs. The IMFs are the zero-mean functions representing the features of the signal that is decomposed. The IMFs are estimated in an iterative process referred to as sifting, and are arranged from the highest frequency content in Hz to the lowest. The sifting process is used to iteratively remove the large-scale features of the signal to be decomposed until only the fine-scale features of the signal remains. The IMFs must satisfy these two important conditions: (1) the number of extrema of the signal z(t) and the number of zero crossings must correspond or vary by at most one and (2) the mean value of the extrema envelopes must be equal to zero.10 In summary, the sifting process to decompose a wave signal z(t) is as follows.

  1. Calculate all the extrema of z(t) over a time t. Thus, the cubic spline function is used to interpolate all the extrema to form the upper and lower envelopes.

  2. Subtract the mean U of the extrema envelopes from the original signal z(t) to obtain the first component of z(t). That is, y(t)=z(t)U.

  3. The original signal z(t) is replaced with y(t), and further decomposition is performed on y(t) by iteratively repeating steps (1) and (2) while the two IMFs conditions hold.

  4. The first IMF (I1) obtained from steps (1)–(3) is subsequently subtracted from the original signal z(t) to obtain a residual. That is, r=z(t)I1.

  5. The residual r is taken has the original signal z(t), and further decomposition is performed as in steps (1)–(3) until the specified number of IMFs are obtained.

Note that the obtained IMFs can be combined to reconstruct the original signal as defined as

z¯(t)=n=1jIn+rj,
(1)

where z¯(t) is an approximation of z(t) and rj is the residual signal at the j number of IMFs.

The EEMD is an improvement on the mode mixing issue associated with the EMD. An IMF in EEMD is the mean of the IMF obtained over an ensemble of trails, X, using the EMD. Unlike the EMD, additive white Gaussian noise (AWGN) with unique amplitude that is randomly varied for each X is added to z(t) before decomposition.11,17 The EEMD steps is summarised as follows.

  1. Specify the number of trial, X, and set the trial index, x = 1.

  2. Add AWGN, Nx(t) [the amplitude of N(t) should be unique from the previous amplitudes], to the z(t) to obtained a modified signal. That is, wx(t)=z(t)+Nx(t).

  3. Decompose wx(t) using the EMD steps in Sec. 4 to obtain j IMFs.

  4. Redo steps (2) and (3) X times, and increment x, until x=X+1.

  5. Compute the mean of the IMFs over the X trial for each j IMFs to obtain the ensemble IMFs, eI. Note that the eI can also be combined to reconstructed an approximation of z(t) as in the case of the EMD.

In this letter, the IMFs obtained from the EMD (I) and EEMD (eI) processes are modified to compute R. Given a sound waveform z(t), the estimated I or eI, are usually of this matrix form

Ii,j=[I1,1I1,2I1,jI2,1I2,2I2,jIi,1Ii,2Ii,j],
(2)

where i is the number of points the signal is sampled at time t and j is the number of specified IMFs. If the Ii,j has τ number of windows, where each window has approximately l number of sampling points. Thus, each number of sampling point l can be represented as

Il,j=[I1,1I1,2I1,jI2,1I2,2I2,jIl,1Il,2Il,j].
(3)

Hence, we propose that R can be calculated for each τ with l sampling points as follows.

Step 1: Calculate the mean of each column of the Il,j to form a 1 × j vector as

μIl,j=[(1la=1lIa)1,(1la=1lIa)2,,(1la=1lIa)j].
(4)

Step 2: Subtract the value in each column of μIl,j from each element in the corresponding column of Il,j to obtain a filtered Il,j as

Ifl,j=Il,jμIi,j=[If1,1If1,2If1,jIf2,1If2,2If2,jIfl,1Ifl,2Ifl,j].
(5)

Step 3: Compute the root mean square (rms) of each row of Ifl,j to obtain a l ×1 vector as

Irms1,l=[1j(u=1j(If1,j)2)1j(u=1j(If2,j)2)1j(u=1j(Ifl,j)2)].
(6)

Step 4: Find the upper (U) and lower (L) envelopes of Il,j as

U=μIl,j+Irms1,landL=μIl,jIrms1,l.
(7)

Step 5: Amplify the features while suppressing all unwanted sound sources as

K=UL=[K1,1K1,2K1,jK2,1K2,2K2,jKl,1Kl,2Kl,j].
(8)

Step 6: Thus, R is obtained by finding the mean of each column of K, that is

R=[(1la=1lKa)1,(1la=1lKa)2,,(1la=1lKa)j]=[σ1,σ2,,σj].
(9)

Therefore, for the whole τ window, the extracted features for the sound signal z(t) can be in the form of a matrix as defined as

Rτ,j=[σ1,1σ1,2σ1,jσ1,1σ2,2σ2,jστ,1στ,2στ,j].
(10)

Note that the R extracted using the EMD and EEMD methods is j-dimensional. Thus, the number of IMFs, j determines the dimension of the R. In this letter, the number of IMFs is fixed to six (j =6). This is because we observed that increasing the number of IMFs do not necessary improve the performance of the detection process. Rather, it increases the computational load of the HMM. As a result, with respect to the computational complexity added to the HMM, the EMD and EEMD feature extraction methods are better in comparison to the MFCC and LPC methods.

The filter sound dataset were annotated and grouped into two: (1) identified Bryde's whale pulse calls and (2) unwanted signals, which includes noise. This implies that there are two trained HMMs assuming four states and two mixture models. One HMM for the unwanted signals (1,T1, ψ1) and the other for the pulse calls (2,T2, ψ2). The number of test samples, s is varied while training the two models. The test samples are divided into τ windows with approximately l sampling points. So, we varied l for detailed performance comparison. The separately trained HMMs are combined to form an eight state and four mixture HMM to detect the unknown waveform as either a pulse call or an unwanted signal with the aid of the V-alg.

The results of the HMM detection process are presented in Tables 1–3, in terms of the sensitivity S and the false positive rate F. The S is a measure of the detection accuracy of any detector while the F measures the reliability of the detector. A high value is desired for the S and the opposite for the F. The tables show the performance of the MFCC, LPC, EMD, and EEMD FE methods with respect to the HMM as a function of l and s. In Table 1, the EMD FE method offered superior S performance as compared to the MFCC and LPC, respectively. Also, the EMD is the most reliable in comparison to the MFCC and LPC, as it has the lowest F in Table 1. Similarly, the S and F measures of the EMD method outperforms the MFCC and the LPC in Tables 2 and 3. Also, observe that the performance of the all the FE methods improve as s increases and l reduces. Moreover, the EEMD gave a better performance as compared to the EMD method because of the drawback of this EMD method. Yet, both the EMD and EEMD methods can be used as a performance efficient alternative as shown in the results presented. Aside reducing the computational load of the HMM due to their reduced R dimension, the EMD and EEMD methods offer better S and F performances as compared to the MFCC and LPC methods.

This letter introduced the EMD and EEMD as a performance efficient alternative FE method for HMM detection of pulse calls. The proposed EMD-HMM and EEMD-HMM approaches were proven to be effective in comparison to the conventional MFCC-HMM and LPC-HMM approaches. The proposed method offered better sensitivity and false positive rate performances, while reducing the computational load of the HMM. Therefore, we suggest that the EMD-HMM and EEMD-HMM approaches can be useful in real-time detection of mysticetes pulse calls such as the Bryde's whale.

The financial assistance of the National Research Foundation (NRF) towards this research is hereby acknowledged. Opinions expressed and conclusions arrived at are those of the author and are not necessarily to be attributed to the NRF (Grant No. 116036).

1.
Forgy
,
E. W.
(
1965
). “
Cluster analysis of multivariate data: Efficiency versus interpretability of classification
,”
Biometrics
21
(
3
),
768
769
.
2.
Reynolds
,
D.
(
2009
).
Gaussian Mixture Models
(
Springer
U.S., Boston, MA
), pp.
659
663
.
3.
Roch
,
M.
,
Soldevilla
,
M.
,
Hoenigman
,
R.
,
Wiggins
,
S.
, and
Hildebrand
,
J.
(
2008
). “
Comparison of machine learning techniques for the classification of echolocation clicks from three species of odontocetes
,”
Can. Acoust.
36
,
41
47
.
4.
Rabiner
,
L. R.
(
1989
). “
A tutorial on hidden Markov models and selected applications in speech recognition
,”
Proc. IEEE
77
(
2
),
257
286
.
5.
Putland
,
R.
,
Ranjard
,
L.
,
Constantine
,
R.
, and
Radford
,
C.
(
2018
). “
A hidden Markov model approach to indicate Bryde's whale acoustics
,”
Ecol. Ind.
84
,
479
487
.
6.
Yao
,
R.
,
Johnson
,
M.
,
Clemins
,
P.
,
Michael
,
D.
,
Glaeser
,
S.
,
Osiejuk
,
T.
, and
Ebenezer
,
O.-N.
(
2009
). “
A framework for bioacoustic vocalization analysis using hidden Markov models
,”
Algorithms
2
,
1410
1428
.
7.
Jurafsky
,
D.
and
James
,
H. M.
(
2019
). “
Speech and language processing
,” https://web.stanford.edu/∼jurafsky/slp3/edbook_oct162019.pdf (Last viewed 15 April, 2019), pp.
1
621
.
8.
Majeed
,
S.
,
Husain
,
H.
,
Abdul Samad
,
S.
, and
Idbeaa
,
T.
(
2015
). “
Mel frequency cepstral coefficients (MFCC) feature extraction enhancement in the application of speech recognition: A comparison study
,”
J. Theor. Appl. Inf. Technol.
79
(
1
),
38
56
.
9.
Makhoul
,
J.
(
1975
). “
Linear predicion: A tutorial review
,”
Proc. IEEE
63
(
4
),
561
580
.
10.
Huang
,
N. E.
,
Shen
,
Z.
,
Long
,
S. R.
,
Wu
,
M. C.
,
Shih
,
H. H.
,
Zheng
,
Q.
,
Yen
,
N.-C.
,
Tung
,
C. C.
, and
Liu
,
H. H.
(
1998
). “
The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis
,”
Proc. R. Soc. London, Ser. A: Math. Phys. Eng. Sci.
454
(
1971
),
903
995
.
11.
Liu
,
B.
,
Feng
,
Y.
,
Yin
,
Z.
, and
Fan
,
X.
(
2019
). “
Radar signal emitter recognition based on combined ensemble empirical mode decomposition and the generalized S-transform
,”
Math. Probl. Eng.
2019
,
1
15
.
12.
Seger
,
K.
,
Al-Badrawi
,
M.
,
Miksis-Olds
,
J.
,
Kirsch
,
N.
, and
Lyons
,
A.
(
2018
). “
An empirical mode decomposition-based detection and classification approach for marine mammal vocal signals
,”
J. Acoust. Soc. Am.
144
,
3181
3190
.
13.
Sharma
,
R.
,
Vignolo
,
L.
,
Schlotthauer
,
G.
,
Colominas
,
M.
,
Rufiner
,
H. L.
, and
Prasanna
,
S.
(
2017
). “
Empirical mode decomposition for adaptive AM-FM analysis of speech: A review
,”
Speech Commun.
88
,
39
64
.
14.
Baum
,
L. E.
,
Petrie
,
T.
,
Soules
,
G.
, and
Weiss
,
N.
(
1970
). “
A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains
,”
Ann. Math. Stat.
41
(
1
),
164
171
.
15.
Viterbi
,
A.
(
1967
). “
Error bounds for convolutional codes and an asymptotically optimum decoding algorithm
,”
IEEE Trans. Inf. Theory
13
(
2
),
260
269
.
16.
Brown
,
J.
and
Miller
,
P. J.
(
2007
). “
Automatic classification of killer whale vocalizations using dynamic time warping
,”
J. Acoust. Soc. Am.
122
,
1201
1207
.
17.
Wu
,
Z.
and
Huang
,
N.
(
2009
). “
Ensemble empirical mode decomposition: A noise-assisted data analysis method
,”
Adv. Adapt. Data Anal.
1
,
1
41
.