A compressive-sensing approach called Sparse Representation Classifier (SRC) is applied to the classification of bottlenose dolphin whistles by type. The SRC algorithm constructs a dictionary of whistles from the collection of training whistles. In the classification phase, an unknown whistle is represented sparsely by a linear combination of the training whistles and then the call class can be determined with an *l*_{1}-norm optimization procedure. Experimental studies conducted in this research reveal the advantages and limitations of the proposed method against some existing techniques such as *K*-Nearest Neighbors and Support Vector Machines in distinguishing different vocalizations.

## I. Introduction

The Sparse Representation Classifier (SRC),^{1,2} originally developed for face recognition, has since been found to be useful for classification of a wide range of signals from speech and music^{3,4} to biomedical signals^{5} and bird phrases.^{6} This paper reports on the research done to develop its adaptation to the classification of dolphin whistles represented as spectrographic images in the time-frequency domain.

The sparseness exploited in the SRC method relies on the training data set being a near complete representation of its class so that any test data vector can be expressed as a sparse linear combination of the training data. If the coefficient sparseness can be satisfied by the training and the test data, then the exact nature of the features is no longer critical as long as they are adequate in quantity. Claims of both the liberal choice of features and the robustness to occlusion are appealing to classification of any underwater recording, and certainly apply to dolphin whistles.

Face recognition with the SRC uses multiple images for every person in the training pool. The test image is expected to be written as a weighted sum of all the images so that the coefficients corresponding to the matching person's images are much larger than all the others. One can then rule that the representation is sparse in the space of coefficients and use the *l*_{1}-norm to match the concentration of the coefficients to the person. Deviations from the ideal case of zero coefficients for the “non-matching” person's images are treated as noise. The coefficients are typically determined by an inner product or a zero-lag correlation measurement so that if a facial image is sufficiently cropped, that is if the face occupies most of the image space, then a high correlation is expected within the class. In contrast, the dolphin whistles are narrowband signals so the identifying parts of the whistle spectrograms have very small support on the time-frequency plane. They are embedded in ambient noise and mixed with echolocation clicks. That the whistles are not dense in the spectrogram prevents the training set of the spectrogram images from being a near complete representation of its class. Such sparseness also begets a very low signal-to-noise ratio (SNR) and signal-to-interference ratio (SIR) overall. Here we distinguish the underwater ambient noise which is present nearly everywhere in the time-frequency domain and the localized echo-location clicks which we model as interference. The fortunate circumstance has to do with the localized version of the SNR and SIR: The former is very high (20 to 30 dB) in the vicinity of the whistles and the echo-location interference does not overlap the whistles except at a negligible number of narrow regions. Cropping the spectrogram to a region that contains only the defining segment of a whistle is akin to analyzing the signal over a sub-band.

To prepare the whistle spectrogram data for the SRC classifier we chose to preprocess it using the Local Binary Pattern (LBP)^{7} operator. Most preprocessing procedures for the undertaken task involve contour tracing^{8–11} but the LBP technique does not rely on whistle contours for obtaining salient information. The LBP operator encodes both the global and the local characteristics of the calls into a compact representation and eliminates the need for tedious formulations, parameter derivations, denoising, and other prior processing. To establish identifying feature vectors, the operation creates binary pattern templates of contours by exploiting the difference between connected, line forming pixels, and diffuse textures. For the undertaken task, it eases the classification by eliminating some preprocessing algorithms as well as contour tracing. LBP operates on the spectrogram defined over the time-frequency domain to extract the important features that are directly fed to the SRC algorithm. Classes of dolphin calls can then be determined by the linear basis pursuit algorithm or other procedures that minimize the *l*_{1}-norm of the error vector. Introducing refinements to the simple SRC implementation can improve significantly the classification performance. The results of our experimental studies demonstrate that the SRC method coupled with LBP features is capable of distinguishing classes of vocalizations with nearly perfect accuracy.

## II. SRC

The application of compressive sensing to classification has been reported by Wright *et al.*^{1} where the test sample's sparseness in the training data is exploited. Let us consider a dictionary $U=[U1,U2,...,Ud]$ with *d* classes where each class $Ui\u2208$$\mathbb{R}m\xd7ni$ consists of *n _{i}* training vectors as

The hypothesis is that a test sample *y*$\u2208\mathbb{R}m$ from the *i*th class can be represented as a linear combination of training samples of the same class as

where $\beta ij,\u200a\u2009j=1,2,...,ni$ are weights or coefficients evaluated via a correlation measuring inner product of the test sample *y* and the training vectors $uij$. Under ideal circumstances, Eq. (2) can also be rewritten by using all the training samples of *d* distinct classes as

Organizing all the weights in order of class into vector $\alpha 0\u2208\mathbb{R}(n1+n2+...+nd)$, we may write *y = Uα*_{0}. If Eq. (2) is satisfied exactly, then all the elements of the weight vector *α*_{0} should be zero but those associated with the same class as $[0,...,0,\beta i1,...,\beta ini,0,...,0]$.

We deduce that the data set *y* is sparse in the domain of *U* and the objective function of a convex optimization problem with a constraint establishing a bound on the residual energy is

where $\epsilon >0$ is error tolerance and $\alpha \xaf$is obtained using a linear programming procedure called the Basis Pursuit algorithm whose computational complexity is proportional to *n*^{3} forming the basis for the SRC algorithm. Also class assignment follows by matching the indices of its nonzero elements to the corresponding columns of *U*.

1. Construct training matrix $U=[U1,U2,...,Ud]$, test signal y. |

2. Solve one of the following convex optimization problems $\alpha \xaf=arg\u200a\u200amin\Vert \alpha 0\Vert 1\u2003\u2009subject\u2009\u2009to\u2003\Vert y\u2212U\u200a\alpha 0\Vert 2\u2264\epsilon .$ |

3. Find the residue for each class $ri=\Vert y\u2212Ui\u200a\alpha \xafi\Vert 2\u2003for\u2003i=1,2,...,d.$ |

4. Assign the appropriate class $class(y)=arg\u2009mini(ri).$ |

1. Construct training matrix $U=[U1,U2,...,Ud]$, test signal y. |

2. Solve one of the following convex optimization problems $\alpha \xaf=arg\u200a\u200amin\Vert \alpha 0\Vert 1\u2003\u2009subject\u2009\u2009to\u2003\Vert y\u2212U\u200a\alpha 0\Vert 2\u2264\epsilon .$ |

3. Find the residue for each class $ri=\Vert y\u2212Ui\u200a\alpha \xafi\Vert 2\u2003for\u2003i=1,2,...,d.$ |

4. Assign the appropriate class $class(y)=arg\u2009mini(ri).$ |

## III. Results and discussion

Recordings of free-ranging bottlenose dolphins from the resident Sarasota Bay located in north-west Florida were made nearly annually during brief capture-release events.^{12,13} Custom-built suction-cup hydrophones were attached on the forehead of each individual, allowing researchers to identify unequivocally the vocalizing dolphins. Thus the SNR and general background noise were similar in all the recordings. The hydrophones were not calibrated because amplitude values were not being measured. Whistles were recorded onto either Marantz PMD-430 (Marantz, Itasca, IL) or Sony TC-D5M stereo-cassette recorders (approximate frequency response 30 to 20 000 Hz; Sony Electronics Inc., New York) or Panasonic AG-6400 or AG-7400 video-cassette recorders (approximate frequency response 20 to 32 000 Hz; Panasonic Corp. of North America, Secaucus, NJ).

To evaluate the performance of the proposed algorithms, a collection of 100 dolphin whistles belonging to one of the four different types was extracted from underwater passive acoustic recordings of bottlenose dolphins. Training and testing samples were randomly selected to avoid any bias. Half of this collection was declared as the training set and the remaining half used as testing data. Figure 1 shows the spectrogram trajectory of four distinctive whistle types defined based on their shapes: Upswing, convex-up, convex-down, and up-and-down.^{14} All spectrograms in this work were obtained using a Hamming window of 1024 length (∼13 ms windows) with 50% overlap at a sampling rate of 80 kHz.

It is observed that clicks are mixed with whistles, and their frequency range goes well beyond the fundamental whistle. In addition, the frequency range of fundamental whistles in our data set is between 4 and 16 kHz. Therefore, a cascade of high- and low-pass filters that enforce the desired frequency bandwidth was designed to abate low frequency environmental noise and the higher harmonics.

To compare the performance of the SRC with other classification algorithms, we conducted two experiments: One experiment uses the LBP features to extract salient information directly from the whistle spectrograms and another uses the Time-Frequency Parameters^{15} (TFPs). The TFP algorithm measures from the isolated fundamental whistle contour several temporal and spectral values such as maximum and minimum frequency, inflection point, etc.^{16} Since the LBP algorithm is relatively new, a brief discussion of the algorithm is given next. For a more detailed description of the algorithm, readers are referred to the relevant literature.^{7,17}

An LBP label is an 8-bit binary number created for each pixel where each bit is assigned 1 or 0 based on its similarity to the *P* pixels at a distance *R* from the pixel. The feature vector assignment algorithm steps are illustrated in Fig. 2. The algorithm consists of (a) labeling all the pixels, absent the borders, using the LBP operator, (b) dividing the image into $m$ small equal sized rectangular regions $R0,R1,...,Rm\u22121$, (c) obtaining the histogram of the labels for each region, and (d) concatenating all the histograms into one feature vector. The feature vector size is reduced using a clustering procedure to count bit transitions and eliminate frequent transitions on account that they contain no significant pattern or texture information.

Our simulation specifics are as follows: The band limited spectrogram has 615 bins along the frequency axis. The number of temporal bins varies from 21 pixels up to 42 pixels depending on the signal length. To create a dictionary matrix, all spectrograms are cropped along the temporal domain to a common size of 32 pixels. The 615 × 32 pixel spectrogram is then divided into 60 segments of 41 × 8 pixels (see the top right image of Fig. 2) and the segments are operated on by a 59-bin LBP. A feature vector of length 3540 is created by concatenating 60 LBP histograms of length 59. The LBP parameters and the segment size are empirically selected after consideration of trade-off between the feature length and recognition accuracy, and bounds on probable errors. The histograms of the four whistles depicted in Fig. 1 are given in Fig. 3(a). To illustrate our previous statement that the LBP method does not need contour tracing, we create from the LBP labels what we call the “LBP image” of the four whistles and show them in Fig. 3(b). It is evident by inspection that the LBP images preserve the texture information of the whistle. All of our algorithms were developed in matlab 7.14. The classifier uses the leave-one-out method and $\epsilon $ in Eq. (4) was chosen heuristically to be 0.45. In addition, classification accuracy is defined as the ratio of correctly classified cases and the total number of cases.

Table I displays the percentage of correct calls classification achieved by combination of LBP and three different classifiers. With no misclassified whistles, the LBP-SRC combination shows evidence of robust performance. Classifiers *K*-nearest neighbor (KNN)^{18} and Support Vector Machine (SVM)^{19} coupled with the LBP feature vectors achieved overall accuracies of 94% and 98%, respectively. In another test, the TFP algorithm is applied to extract features from the contours of the dolphin whistles. Table II shows that the KNN, SVM, and SRC classifiers achieved accuracies of 96%, 100%, and 94%, respectively.

. | Percent of correct classifications . | ||
---|---|---|---|

. | KNN
. | SVM
. | SRC
. |

First class | 92 | 100 | 100 |

Second class | 100 | 94 | 100 |

Third class | 100 | 100 | 100 |

Fourth class | 86 | 100 | 100 |

. | Percent of correct classifications . | ||
---|---|---|---|

. | KNN
. | SVM
. | SRC
. |

First class | 92 | 100 | 100 |

Second class | 100 | 94 | 100 |

Third class | 100 | 100 | 100 |

Fourth class | 86 | 100 | 100 |

. | Percent of correct classifications . | ||
---|---|---|---|

. | KNN
. | SVM
. | SRC
. |

First class | 100 | 100 | 85 |

Second class | 100 | 100 | 100 |

Third class | 67 | 100 | 83 |

Fourth class | 100 | 100 | 100 |

. | Percent of correct classifications . | ||
---|---|---|---|

. | KNN
. | SVM
. | SRC
. |

First class | 100 | 100 | 85 |

Second class | 100 | 100 | 100 |

Third class | 67 | 100 | 83 |

Fourth class | 100 | 100 | 100 |

The following conclusions can be reached by analyzing the results of whistle type classification: LBP-SRC outperforms the combination of either KNN or SVM with LBP. Clearly, the texture information encoded in the LBP vectors is an advantage to the SRC algorithm whereas the contour-based TFP features degrade its performance. It is easy to conjecture that the nonlinear nature of the TFP features, derived from the whistle curves, is not suitable for the SRC algorithm, which finds the best linear combination of the training whistle features. The texture information contained in the LBP feature vectors, on the other hand, may be conjectured to be accumulative, therefore conducive to the performance of the SRC, though any proof of linearity of the LBP is expected to be approximate.

Feature vectors with high dimensions are usually encountered in computer vision applications. For instance, Wright *et al.*^{1} used feature vector dimensions 32, 56, 120, and 504 and Min and Dugelay^{2} tested dimensions 4096, 1024, and 256 for feature vectors for the purpose of face recognition. In order to study the effect of reducing the dimension of feature vectors on the classification accuracy, we first reduced the resolution of the LBP operator. Enlarging each sub-region from 41 × 8 to 77 × 16 pixels reduced the classification accuracy to 98%. Classification accuracy continued to degrade when the length of the feature vector is reduced further.

We also tested the effect of dimension reduction by applying the Principle Component Analysis (PCA) on the feature vector set. We kept only the eigenvectors corresponding to the significant eigenvalues of the relevant correlation matrix. The classification accuracy dropped to 96%. The eigenvalues accounted for 98% of the total in norm.

It has to be emphasized that the dimension of the feature vectors is independent of the number of data points. It only depends on the resolution of the LBP operator on the spectrograms. This means that even if the number of data points were increased to many thousands, which could occur in up-call detection, the size of the feature vector would remain the same.

## IV. Conclusion

Through experimental studies, we have explored the efficacy of the SRC for classifying various types of dolphin whistles. On limited data, we have shown that the SRC performance may be improved using the LBP feature vectors. We attribute the improvement to the accumulative nature of the encoded textures in the LBP vectors. In contrast, the TFP vectors, which are constructed from curves, are highly nonlinear. Meanwhile, in the process of extracting the TFP vectors, rich texture information is lost. Therefore their use degrades the SRC accuracy. The significant result of the work is that there exists a simple preprocessing procedure (LBP) that prepares the dolphin whistles to be classified accurately with the SRC algorithm. A disadvantage of the SRC algorithm is its high dimensionality. However, the algorithm is scalable in the sense that the length of the feature vector is independent of number of data points. One may also apply a data reduction algorithm to reduce its feature dimension, though the accuracy performance may suffer a bit. We understand that the data set for testing the proposed algorithms is small but never-the-less the proof of concept is encouraging. In a future study, we will explore ways to test and refine the LBP-SRC combination on a large data set of vocalizations from a diverse set of marine mammals.

## ACKNOWLEDGMENTS

This research was supported by funds from the Southeast National Marine Renewable Energy Center and by a SEED grant from Florida Atlantic University. We thank Dr. George Frisk and Dr. Edmund Gerstein of Florida Atlantic University for their valuable comments, and Dr. Laela Sayigh and Mary Ann Daher from WHOI for supplying the acoustic data in this research.

## REFERENCES AND LINKS

*IEEE International Conference on Multimedia and Expo (ICME)*, July 1–6,

*47th Annual Conference on Information Sciences and Systems (CISS)*, March 1–6,

*Sensor Signal Processing for Defense*, September 1–5,

*19th European Signal Processing Conference (EUSIPCO 20100)*, Spain, September (

*IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*, May (