This paper reports on an automated and openly available tool for automatic acoustic analysis and transcription of primate calls, which takes raw field recordings and outputs call labels time-aligned with the audio. The system's output predicts a majority of the start times of calls accurately within 200 milliseconds. The tools do not require any manual acoustic analysis or selection of spectral features by the researcher.

A central topic in bioacoustics is the description of animal call repertoires, including what the calls are and how they are combined and used. However, traditional acoustic analysis of calls requires a significant amount of manual work, which means that only a fraction of the data collected in the field is actually used, and the majority of the otherwise useful data does not serve its role in answering scientific questions (Kobayasi and Riquimaroux, 2012). Recently, techniques from speech processing have been applied to animal vocalizations. The key advance they offer is to bypass a step where researchers extract preselected acoustic features, such as durations or peak frequencies. Standard speech processing tools represent signals using rich, general purpose spectral representations, with no hand selection of acoustic features. Previous analyses automatically classified isolated calls by call type, species, and caller using such representations (Mielke and Zuberbühler, 2013). Our system, in addition to labeling isolated calls, detects and labels calls in raw field recordings. We apply it to three primate species with acoustically diverse calls (see Fig. 1): Blue monkeys (Cercopithecus mitis), Titi monkeys (Callicebus nigrifrons), and Colobus monkeys (Colobus guereza).

Fig. 1.

Spectrograms of calls. Left: Blue monkey Hack (top) and Pyow (bottom) calls; center: Titi A (top) and B (bottom); right: Colobus Roar sequence (top) and Snort (bottom).

Fig. 1.

Spectrograms of calls. Left: Blue monkey Hack (top) and Pyow (bottom) calls; center: Titi A (top) and B (bottom); right: Colobus Roar sequence (top) and Snort (bottom).

Close modal

Recordings of three species were taken from several field researchers for a total of 5.58 h of audio. A trained primatologist marked the start and end times (calls typically do not overlap), and labeled each Blue monkey call Hack (also called Ka) or Pyow (Papworth et al., 2008), Colobus calls as Roar or Snort (Marler, 1972), and Titi calls as A or B (Cäsar et al., 2012). Table 1 documents the length of the audio recordings for each data set, the percentage of that time taken up by calls, and the token count for each type of call. Estimated signal-to-noise ratios for these data sets (Vondrášek and Pollák, 2005) were low (between 0.5 and 5.3), typical of field recordings in primatology.

Table 1.

Length of recordings, source, % of signal with calls present, counts of labeled calls.

SpeciesSource (Location)RecorderMicrophoneDuration (% calls)TypesN
Blue Murphy (Budongo Reserve, Uganda) Marantz PMD660 Sennheiser ME66-K6 1:56:45 (0.33%) Hack Pyow 145 108 
Blue Fuller (Kakamega Forest, Kenya) Marantz PMD660 Sennheiser ME67 0:59:15 (4.31%) Hack Pyow 510 364 
Titi Cäsar (Serra do Caraça, Brazil) Marantz PMD660 Sennheiser ME66-K6 0:11:58 (3.24%) A B 125 539 
Colobus Schel (Budongo Reserve, Uganda) Sony TCD D8 Sennheiser ME66-K6 2:27:02 (5.09%) Roar Snort 739 141 
SpeciesSource (Location)RecorderMicrophoneDuration (% calls)TypesN
Blue Murphy (Budongo Reserve, Uganda) Marantz PMD660 Sennheiser ME66-K6 1:56:45 (0.33%) Hack Pyow 145 108 
Blue Fuller (Kakamega Forest, Kenya) Marantz PMD660 Sennheiser ME67 0:59:15 (4.31%) Hack Pyow 510 364 
Titi Cäsar (Serra do Caraça, Brazil) Marantz PMD660 Sennheiser ME66-K6 0:11:58 (3.24%) A B 125 539 
Colobus Schel (Budongo Reserve, Uganda) Sony TCD D8 Sennheiser ME66-K6 2:27:02 (5.09%) Roar Snort 739 141 

Acoustic features were automatically extracted from the audio recordings using a standard speech feature extraction pipeline, adapted minimally. Since the recordings had non-zero mean and varying average amplitudes within each recording due to recording conditions and manual adjustment of the gain levels by field scientists, we removed the DC component with a notched high pass filter. We increased the ratio between calls and noise with a five-point temporal median filter (i.e., averaging windows of five consecutive samples in the time domain) followed by a two-dimensional three-point median filter pass in the spectral domain. The first enhances the ratio of the amplitude of the calls to noise, and the second flattens the spectrum for low transient noise passages and enhances the contrast with calls. We then estimated a noise signature based on the spectral components of the first half second of each audio file (which never included a call) and subtracted this noise signature from the rest of the audio stream. We calculated spectral representations of the signal using short-term Fourier transforms on overlapping windows of 25 ms shifted by 10 ms, and transformed the frequency components through a set of 40 filters evenly spaced on the Mel scale. This filter distribution is common in speech processing and is copied here for generality. Finally, each filter was mean-variance normalized independently.

In this section, we describe three experiments classifying isolated calls using the generic acoustic features just described. Each call was represented by concatenating the first 50 frames from the call onset (a 40 × 50 = 2000-dimensional vector, corresponding to 515 ms), capturing the full length of 84% of calls. In experiment 1, we assessed the ability to classify call types within each species based on these representations. In experiment 2, we assessed classification of species. In experiment 3, we assessed the six-way labeling of species and call type required when all three species are pooled.

To predict the calls, we used a sparse radial basis function support vector machine (SVM) trained with block coordinate descent with squared hinge loss and L1 regularization. This is a standard statistical approach to classification problems that may not be amenable to classification using a linear model. Instead of computing the full Gram matrix of the kernel, we employ the Nyström approximation to significantly speed up the training time of our classifiers (Williams and Seeger, 2001). The approximation computes the eigendecomposition on a random small subset of the Gram matrix and scales the results up to the original number of dimensions (the number of samples). We achieved good results with a 500-component approximation. Experiments 2 and 3 involve more than two classes, so we employed a one-versus-rest strategy (training N individual binary classifiers, where N is the number of classes). Training was on 80% of the data, with evaluation on the remaining, unseen, 20%. Three hyperparameters (weight of the loss term, C, weight of the penalty term, λ, and kernel coefficient, γ) were optimized using the sequential model-based algorithm configuration (SMAC) technique (Hutter et al., 2011) by fivefold cross-validation within the training set.

Table 2 shows the results of experiments 1–3. We give precision (positive predictive value: among the calls the classifier gives label x, the fraction that are actually x and not false positives) and recall (sensitivity: among the calls that should be labeled x, the fraction that are labeled x and not false negatives), and F1 [2precisionrecall/(precision+recall)]. Classification was good, with average F1 of between 0.91 and 0.99. Experiment 1 extends previous findings using different methodology and new species (Mielke and Zuberbühler, 2013). Experiments 1 and 3 were repeated with subsets of increasing sizes of the full (i.e., 80%) training set. Figure 2 shows the F1 score on the test set as a function of the number of annotated calls given for training.

Table 2.

Classification results for experiments 1 (call type, within species), 2 (species only), and 3 (species and call type).

LabelsPrecisionRecallF1 scoreTest Support
Experiment 1     
Blue Hack 0.97 0.99 0.98 131 
Blue Pyow 0.99 0.96 0.97 95 
Average 0.98 0.98 0.98  
Colobus Roar 0.94 1.00 0.97 148 
Colobus Snort 1.00 0.68 0.81 28 
Average 0.95 0.95 0.94  
Titi A 0.89 0.68 0.77 25 
Titi B 0.93 0.98 0.95 108 
Average 0.92 0.92 0.92  
Experiment 2     
Blue 0.99 0.98 0.98 226 
Titi 0.99 0.98 0.98 176 
Colobus 0.98 1.00 0.99 133 
Average 0.99 0.99 0.99  
Experiment 3     
Blue Hack 0.99 0.95 0.97 131 
Blue Pyow 0.95 0.95 0.95 95 
Colobus Roar 0.86 0.97 0.91 148 
Colobus Snort 0.92 0.43 0.59 28 
Titi A 0.85 0.68 0.76 25 
Titi B 0.88 0.94 0.91 108 
Average 0.91 0.91 0.91  
LabelsPrecisionRecallF1 scoreTest Support
Experiment 1     
Blue Hack 0.97 0.99 0.98 131 
Blue Pyow 0.99 0.96 0.97 95 
Average 0.98 0.98 0.98  
Colobus Roar 0.94 1.00 0.97 148 
Colobus Snort 1.00 0.68 0.81 28 
Average 0.95 0.95 0.94  
Titi A 0.89 0.68 0.77 25 
Titi B 0.93 0.98 0.95 108 
Average 0.92 0.92 0.92  
Experiment 2     
Blue 0.99 0.98 0.98 226 
Titi 0.99 0.98 0.98 176 
Colobus 0.98 1.00 0.99 133 
Average 0.99 0.99 0.99  
Experiment 3     
Blue Hack 0.99 0.95 0.97 131 
Blue Pyow 0.95 0.95 0.95 95 
Colobus Roar 0.86 0.97 0.91 148 
Colobus Snort 0.92 0.43 0.59 28 
Titi A 0.85 0.68 0.76 25 
Titi B 0.88 0.94 0.91 108 
Average 0.91 0.91 0.91  
Fig. 2.

(Color online) Classification performance (y) by species, as a function of the number of annotated calls provided in the training set (x).

Fig. 2.

(Color online) Classification performance (y) by species, as a function of the number of annotated calls provided in the training set (x).

Close modal

In experiment 4, we trained a call transcription system whose input is raw, unsegmented field recordings. It predicts call labels using a support vector machine and uses a conditional random field (CRF) to correct unlikely sequences.

The SVM was trained on annotated data to predict call labels from individual frames. Input features consisted of a concatenation of MFCC features (13 cepstral coefficients with first and second derivatives) with activations from a voice activity detection (VAD) system (Lee and Hasegawa-Johnson, 2007). The classifier was trained within species to predict one of the two call types or a third class indicating the absence of a call. The sequence of Platt-calibrated predictions of the SVM were used as input to a linear chain CRF. The CRF's predictions are also sequences of frame labels, but the CRF takes into account statistical dependencies between adjacent frames and smoothes the predictions in the time domain. The hyperparameters of the SVM and the CRF were optimized using SMAC. We evaluated on a 10% held out test set. The third label (absence of any call) is removed from the output for evaluation.

The system outputs call sequences, time-aligned with an audio file. We evaluate these transcriptions for the held-out test data. Considering the sequences of calls (not the alignment with the audio), we evaluate using word error rate (WER) and match error rate (MER), used in speech recognition (Morris et al., 2004). Results are in Table 3. The majority of calls are correctly identified. Most errors are deletions (missing calls) for Blue and Colobus monkeys and insertions (noise identified as calls) for Titis, perhaps because Titi calls are high frequency, similar to the noise.

Table 3.

Evaluation of transcriber: word and match error rate (WER, MER), number of hits (H), deletions (D), substitutions (S), insertions (I), and number of calls (N).

SpeciesWERMERHDSIN
Blue 35.1% 32.1% 213 69 26 288 
Colobus 34.4% 33.8% 106 47 157 
Titi 32.9% 28.1% 68 14 82 
SpeciesWERMERHDSIN
Blue 35.1% 32.1% 213 69 26 288 
Colobus 34.4% 33.8% 106 47 157 
Titi 32.9% 28.1% 68 14 82 

To evaluate how well the predicted calls are time-aligned, we match each call in the gold transcription to the nearest predicted call whose onset and offset are within a 200 ms tolerance of the real onset and offset, and count a gold call as having a true positive only if it has such a match, and that match is correctly labeled; otherwise, it counts as a false negative. Similarly, for each predicted call, we look for the nearest such match among the calls in the gold transcription, and count a false positive if there is no match or if the match is mislabeled. Since it is likely easier to accurately mark the onsets of calls than their offsets, both for our human annotator and for the transcription system, we also compute an alternative scoring in which only call onsets need to be matched within the 200 ms tolerance. For both scorings, we compute precision, recall, and F1, as shown in Table 4. The results show that call onsets are indeed much easier to match to the annotation than offsets, particularly for Colobus monkeys, where performance is relatively poor when offsets are required to be correctly marked.

Table 4.

Evaluation of predicted calls versus the nearest gold transcribed call with both its onset and offset (left) or just its onset (right) within 200 ms.

SpeciesCall detectionOnset detection
PrecisionRecallF1PrecisionRecallF1
Blue 0.76 0.65 0.70  0.85 0.72 0.78 
Colobus 0.46 0.33 0.38  0.74 0.54 0.62 
Titi 0.63 0.68 0.66  0.71 0.77 0.74 
SpeciesCall detectionOnset detection
PrecisionRecallF1PrecisionRecallF1
Blue 0.76 0.65 0.70  0.85 0.72 0.78 
Colobus 0.46 0.33 0.38  0.74 0.54 0.62 
Titi 0.63 0.68 0.66  0.71 0.77 0.74 

General purpose acoustic features and voice activity detection techniques, as used in speech recognition, can automate the labeling of primate calls, both in isolation and in unannotated recordings, using data representative of field recordings. The system needs to be bootstrapped by a set of annotated examples. We showed that good isolated call labeling requires less than 200 labeled examples. It accurately transcribes around 90% of the frames in an audio file, vastly reducing the amount of manual work.

The results also imply that generic acoustic features, rather than specialized acoustic measurements taken manually by the researcher, can be used for detailed analysis. For example, there are competing descriptions of the call repertoires of certain species. Previous analyses have appealed to clustering analyses on hand-selected acoustic features as evidence (Fuller, 2014; Keenan et al., 2013). The results here validate an automated process of feature extraction that may be used as the input to these analyses. Both results allow much larger data sets from the field to be used than are currently being used for research and make it easier to create shared databases between researchers. Our tools can be downloaded at http://github.com/mwv/mcr.

This research received funding from the European Research Council under the EU's Seventh Framework Programme (FP/2007–2013), Grant Nos. 324115-FRONTSEM (Schlenker) and ERC-2011-AdG-295810 BOOTPHON (Dupoux), and from the Agence Nationale pour la Recherche ANR-10-IDEX-0001-02 PSL*, ANR-10-LABX-0087 IEC.

2.
Cäsar
,
C.
,
Byrne
,
R. W.
,
Young
,
R. J.
, and
Zuberbühler
,
K.
(
2012
). “
The alarm call system of wild black-fronted titi monkeys, Callicebus nigrifrons
,”
Behav. Ecol. Sociobiol.
66
,
653
667
.
3.
Fuller
,
J. L.
(
2014
). “
The vocal repertoire of adult male blue monkeys (Cercopithecus mitis stulmanni): A quantitative analysis of acoustic structure
,”
Am. J. Primatol.
76
,
203
216
.
6.
Hutter
,
F.
,
Hoos
,
H. H.
, and
Leyton-Brown
,
K.
(
2011
). “
Sequential model-based optimization for general algorithm configuration
,” in
Proceedings of LION-5
, pp.
507
523
.
7.
Keenan
,
S.
,
Lemasson
,
A.
, and
Zuberbühler
,
K.
(
2013
). “
Graded or discrete? A quantitative analysis of Campbell's monkey alarm calls
,”
Anim. Behav.
85
(
1
),
109
118
.
8.
Kobayasi
,
K. I.
, and
Riquimaroux
,
H.
(
2012
). “
Classification of vocalizations in the Mongolian gerbil, Meriones unguiculatus
,”
J. Acoust. Soc. Am.
131
,
1622
1631
.
9.
Lee
,
B.
, and
Hasegawa-Johnson
,
M.
(
2007
). “
Minimum mean-squared error a posteriori estimation of high variance vehicular noise
,” in
Proc. Biennial on DSP for In-Vehicle and Mobile Systems
,
Istanbul, Turkey
, June
2007
.
10.
Marler
,
P.
(
1972
). “
Vocalizations of East African monkeys II
,”
Behaviour
42
,
175
197
.
11.
Mielke
,
A.
, and
Zuberbühler
,
K.
(
2013
). “
A method for automated individual, species and call type recognition in free-ranging animals
,”
Anim. Behav.
86
(
2
),
475
482
.
12.
Morris
,
A. C.
,
Maier
,
V.
, and
Green
,
P.
(
2004
). “
From WER and RIL to MER and WIL: Improved evaluation measures for connected speech recognition
,” in
INTERSPEECH 2004
, pp.
2765
2768
.
13.
Papworth
,
S.
,
Böse
,
A. S.
,
Barker
,
J.
,
Schel
,
A. M.
, and
Zuberbühler
,
K.
(
2008
). “
Male blue monkeys alarm call in response to danger experienced by others
,”
Biol. Lett.
4
(
5
),
472
475
.
14.
Vondrášek
,
M.
, and
Pollák
,
P.
(
2005
). “
Methods for speech SNR estimation: Evaluation tool and analysis of vad dependency
,”
Radioengineering
14
(
1
),
6
11
.
15.
Williams
,
C.
, and
Seeger
,
M.
(
2001
). “
Using the Nyström method to speed up kernel machines
,” in
Proceedings of the 14th Annual Conference on Neural Information Processing Systems
, EPFL-CONF-161322, pp.
682
688
.