Source separation is an important step to study signals that are not easy or possible to record individually. Common methods such as deep clustering, however, cannot be applied to signals of an unknown number of sources and/or signals that overlap in time and/or frequency—a common problem in bioacoustic recordings. This work presents an approach, using a supervised learning framework, to parse individual sources from a spectrogram of a mixture that contains a variable number of overlapping sources. This method isolates individual sources in the time-frequency domain using only one function but in two separate steps, one for the detection of the number of sources and corresponding bounding boxes, and a second step for the segmentation in which masks of individual sounds are extracted. This approach handles the full separation of overlapping sources in both time and frequency using deep neural networks in an applicable manner to other tasks such as bird audio detection. This paper presents method and reports on its performance to parse individual bat signals from recordings containing hundreds of overlapping bat echolocation signals. This method can be extended to other bioacoustic recordings with a variable number of sources and signals that overlap in time and/or frequency.
I. INTRODUCTION
The separation of sources in acoustic mixtures is an inevitable step in the study of many environments when recordings of isolated signals of interest are not available.1–3 This problem has been studied in the context of many different applications, and proposed solutions typically depend on the relative information contained in sources4,5 and mixtures.6,7 Recovering individual sources in a complex and noisy environment (sometimes referred to as the “cocktail party problem”) has been an active area of research for decades and is still in progress.8–12 Popular approaches in speech separation, singing-voice separation, and signal denoising mainly suffer from underlying assumptions that make them inapplicable to mixtures of other sounds, such as bioacoustic recordings. For example, it is often assumed that the number of sources is known and fixed and/or sources are independent.2,13,14 When the number of sources is more than the number of recording channels (underdetermined system), sources are similar, the number of sources is unknown, or sources are highly overlapped in time and/or frequency, such as in the case of many bioacoustic recordings, common methods such as independent component analysis (ICA),15 nonnegative matrix factorization (NMF),16 or deep clustering (DC)17 are not applicable. Solutions based on NMF aim to represent the additive linear interpolation of a known number of sources through a basis (or template) matrix. ICA methods are developed for the determined system and rely on the assumption of the independence of sources. The DC methods train a neural network to generate discriminative embeddings for time-frequency bins where target sources are complementary masked spectrograms of the mixture. Mapping-based methods, unlike masking-based ones, look for complimentary masks in a continuous range, but these approaches are designed for dereverberation and denoising, not separation of multiple sources. The specific problem of detecting sound events may vary based on the amount of information in the input and the level of details in the output, but typical approaches are not designed to estimate the number of sources or to discriminate overlapped similar sources. Furthermore, masking-based methods with complementary outputs cannot fully represent overlapping sources because each time-frequency bin is assigned to only one source. Estimating the number of sources is a front-end processing step while estimating the separation function is usually defined as a back-end learning process, with the design of the separation function relying on a fixed number of sources.
This paper describes an approach to separate the bioacoustic signals of an unknown number of overlapping sources. Specifically, mixtures of interest are recordings of many echolocating signals collected within swarms of bats and target representation of sources are single component connected binary regions in the time-frequency domain. This technical problem is motivated by our study of the behavior of individual bats in a swarm. Some bat species form large groups in a cave and emerge from the cave in a dense swarm each evening.18 The primary sensory modality of these bats is echolocation, in which individuals make frequency-modulated ultrasonic signals.19–21 When flying together in small groups, bats may adjust their echolocation to avoid interference by switching between active and passive mode, adjusting the time-frequency characteristics of signals, or spending substantial time in silence.22–24 It is currently unknown, however, how bats in extreme dense swarms, such as during cave emergence, avoid the problem of mutual interference or sonar jamming.24–29 Pilot data indicate individual bats may produce signals with discrete time-frequency structure,30 but this has not been verified from recordings of dense swarms due to the challenge of overlapping echolocation signals. Two spectrograms of these overlapping signals are depicted in Fig. 1, where Fig. 1(a) is a simulated mixture by adding ten single call recordings and Fig. 1(b) is a real recording within a swarm of approximately 30 echolocating bats. Understanding how individual bats can adapt and use their sonar signaling techniques to navigate in complex environments is the primary incentive of this work. To gain this understanding, complex acoustic mixtures need to be represented in terms of individual sources.31–34 In this work, individual sources are defined to be single-harmonic bat calls. One can extract bat-wise signals by sequentially associating predicted calls to a specific bat. However, performing the post-processing step of the so-called data association is out of the scope of this work.
(Color online) Spectrograms of two mixtures of bat echolocation calls. (a) The spectrogram of an artificial mixture of ten single call recordings. (b) A spectrogram of mixture of recording within the swarm of approximately 30 echolocating bats.
(Color online) Spectrograms of two mixtures of bat echolocation calls. (a) The spectrogram of an artificial mixture of ten single call recordings. (b) A spectrogram of mixture of recording within the swarm of approximately 30 echolocating bats.
The approach presented in this paper is built on ideas that have been used to address the segmentation problem found in computer vision systems. Instance segmentation is the task of assigning every pixel into a category and discriminating between individual object instances that are spatially large and sparse. A common approach to instance segmentation is to split the task into two separate problems: First, detecting all areas that potentially contain an object, known as region proposals; then, second, running a mask generator on each proposal for final segmentation. In segmentation by detection methods, once an object is detected, segmentation can be relatively easy using characteristics of the object. Most of the instance segmentation methods such as mask regional-convolutional neural networks (Mask R-CNN)35 either show poor performance on overlapping objects due to the local non-maximum suppression or stack multiple computationally expensive neural networks.
In this paper, a deep neural network (DNN) with DenseUNet architecture is proposed to model the separation function where the neural network operates to perform in two steps. First, a detection step detects signal measurements that are large enough to allow them to be segmented from the remaining signal content. Then, by reducing the observation to just a few target candidates, the problem reduces to mapping the most probable candidates to the representation of individual sources. The model performs in the time-frequency representation of mixtures. Outputs of the detection and segmentation steps are bounding boxes and binary mappings of individual sources, respectively. Time-domain predicted sources can be achieved from an inverse time-frequency transformation. The performance of the proposed method is evaluated in the time-frequency domain through common metrics including F1-score and false-negative rate (FNR).
II. METHOD
Let denote the mixture of S sources where and T is the length of the recording. The source separation problem is to estimate the separation function that minimizes a distance where and are predicted sources and their quantity, respectively. .
This section first reviews different assumptions on the mixing process. The second part then explains the importance of signal representation and provides a transformation option. The model section proposes a DNN which models the separation function g and presents the process of estimating parameters—training—and an algorithm for preparing ground-truth sources used in the training.
A. Mixing system
Source separation is considered as an inverse problem, and the first assumption underlying most of the proposed solutions is the mixing function that represents the environment and process of mixing sources. The separation function g is one of the (possibly many) inverse functions of the mixing function and should reflect the properties of the mixing environment.
A linear and time-invariant system which adds up sources in an instantaneous system does not fully express properties of non-stationary environments in which mixing variables are time-dependent. Furthermore, in reverberant environments, reflections of sources persist after they are produced and the mixing process should be in a convolutive manner to highlight the occurrence of reflections. However, an adequately large enough environment is usually approximated with a linear system. In this paper, the mixing function is assumed to be , where and capture the relative power of sources and the environment noise (including non-source signals), respectively.
B. Signal representation
Source separation is often performed in the time-frequency domain. Signal transformations such as Fourier analysis are not only used as feature extractors that improve the separation performance but also serve as a visualization scheme that provides insight into the analysis of the signal. An appropriate signal transformation is reversible and projects time-domain samples to an interpretable space that highlights the contrast between different sources and similarities between identical sources. Depending on the application, some transformations have shown to be more appropriate. Time-frequency representations like the short-time Fourier transform (STFT) are among the most common transformations used in source separation.
Let be a two-dimensional (2 D) transformation that projects T time-domain samples into N features of dimension F. Parameters of the transformation including sampling frequency, window size, and hop size play an important role in the resolution of time-frequency space () such that increasing F reduces N and vice versa. Due to this inherent trade-off of short-time analysis, various joint distributions have been proposed to control the interplay between time and frequency by adding more parameters to the transformation .36 Furthermore, useful information in separating sources is primarily embodied in the signal energy and not in its phase.37 Therefore, most separation methods rely on energy-based representations known as spectrograms. In some cases, applying other transformations such as logarithmic and Mel scales and recently proposed per-channel energy normalization (PCEN)38 on spectrograms can help with the suppression of non-source components. The log and PCEN compression reduce the dynamic range of Mel-bank energy so that scales down a large range of low-level signals, such as silence and small noise (non-source). Since a log function is loudness dependent and devotes a large range to weak signals, PCEN has shown to be more effective in suppressing non-stationary noise.
Using the spectrogram of a mixture and sources for , the separation function extracts sources and the source separation problem is to estimate a function such that the distance
is minimized where S ground-truth and predicted sources are noted by Ys and , respectively. For an F × N-dimensional A, the norm is the average over norms of A element, i.e., . In a matrix of distances for and , the first term of averages over minimums of rows, while the second term overages over the minimums of columns. Considering unique predictions, Eq. (1) is minimized if predictions are a permutation of S ground-truth sources. A missing prediction increases the first term of Eq. (1), while the second term quantifies extra predictions.
Instead of directly looking for the source representations within an F × N-dimensional real-valued space, it can be useful to reduce the range of the outputs to the binary-valued -dimensional space.37 The time-frequency masking approach was proposed to identify sources with the largest amplitude in each individual time-frequency bin. The ideal binary mask (IBM),39 inspired by the auditory masking phenomenon, is defined as , where is the element-wise indicator function with an amplitude threshold of ω. Let Ms be the mask of the sth source for . Having a mask Ms and the phase of the mixture, one can reconstruct the time-domain source s as where denotes the element-wise multiplication.
C. Model
Conventional separating functions are dependent on the number of sources (or an estimation of the number of sources when it is unknown) and are typically composed of source-independent and source-dependent parts. The source-independent part projects the mixture to a latent space in which sources are easily separable. This part aims to maximize the distance between samples from different sources and the similarity between identical sources. Finally, for each source, a source-dependent function projects the latent vector onto the corresponding hyperplane of the source. In a separation function modeled with a DNN, the last layer with S-channels operates as the source-dependent part. However, when the number of sources is unknown and variable, the last layer needs to operate adaptively.
To model a separation function that handles a variable number of overlapping sources, three intermediate variables are defined to capture the location of sources and their bounding boxes in the spectrogram. The first variable, C, determines the center of sources, which consequently indicates the number of sources. The other two variables, H and W, are responsible for the extent of sources, i.e., the heights and widths of sources in the spectrogram. Therefore, the separation function operates in two steps of the detection and segmentation. During the detection step, the center of all sources and their corresponding bounding boxes are predicted. The segmentation step aims to extract the mask of each detected source in the first step from its corresponding resized version. For a mixture of S sources, the separation function ideally operates in S + 1 steps, where S center and bounding boxes of sources are detected in the first step and S masks Ms for is predicted in the next S steps of segmentation. An example of inputs and outputs of the separation function during the training and the test are depicted in Fig. 2. In the detection step, Fig. 2(a), the resized version is not available and the first three output variables are of interest. Only the resized version is involved in the segmentation step, Fig. 2(b), that extracts the corresponding mask. The first and last outputs are binary variables, while the other two have positive values indicating heights and widths of all sources. During the training, Fig. 2(c), the function is forced to map the mixture X and a resized version Zs to four aforementioned variables of C, H, W, and Ms. Figure 3 shows the results of the proposed method for two steps of detection and segmentation. The input which is a mixture of an unknown number of bat echolocation calls is depicted in Fig. 3(a).
(Color online) The schematics of inputs and outputs to the separation function during (a) the detection, (b) the segmentation, and (c) the training.
(Color online) The schematics of inputs and outputs to the separation function during (a) the detection, (b) the segmentation, and (c) the training.
(Color online) A mixture of an unknown number of echolocation calls (a) and the outputs of the proposed method after detection (b) [using g function as Fig. 2(a) only one time] and segmentation (c) [using g function as Fig. 2(b) times] steps. extracted binary masks are shown in different colors.
The detection step (equivalent to the source-independent part) manages the variable number of sources. Any estimator of the number of sources captures an underlying structure in which sources are different. When sources are characterized based on their position (temporal and spectral) and extent, one can estimate the number of sources by detecting the position of sources in the spectrogram. Therefore, a mapping from the mixture X to space could be interpreted as the probability of the presence of a source in each time-frequency bin, i.e., indicates the presence of a source centering at (f, n) in the spectrogram. Accordingly, this variable, C, not only provides the information about the number of sources since , but also specifies the position of sources in the spectrogram. When the function is modeled with a neural network, to ensure the binary range of the C, the last layer is followed by a mapping function to , like the Sigmoid function, and an indicator function.
Since knowing the position of centers of sources is not enough to uniquely identify sources, the extent of sources along the time and frequency axes needs to be predicted. To record the extent of sources, two variables, H and W , are assigned to describe the height and width of sources. For the source centering at the time-frequency bin (f, n), i.e., , having frequency range hs and duration ws, two time-frequency bins H(f, n) and W(f, n) are assigned accordingly, i.e., and . Two mappings from the mixture X to H and W model the extent of sources where denotes the set of non-negative real numbers. Three variables of C, H, and W are equivalent to S bounding boxes on the spectrogram.
Having the position and extent of sources only specifies rectangular areas of the spectrogram, not their detailed bin-level masks, see Fig. 3(b). Thus, another step is needed to fully segment sources in the time-frequency domain. The segmentation step that extracts masks of individual sources from a mixture is conditioned on results of the detection step, i.e., for , and should be independent of the position and extent of sources. It is trivial to show the mapping from X to Ms given is equivalent to the mapping from Zs to Ms where Zs is the sth resized version of the mixture. One can get the sth resized version of the mixture X by first cropping the area of the X that contains the source Ys and then resizing the rectangular spectrogram of size to F × N. The sth rectangular region of X is the smallest rectangular window that is not zero, i.e., . Without this resizing operation, the segmentation part needs to learn the already available position and extent of the source unnecessarily. This operation takes advantage of having similar sources that only differ in the position and size in the spectrogram. The result of the segmentation step is shown in Fig. 3(c). In the case of using a neural network, a post-processing step is needed to reverse the resizing process in the following way: First, the output of the last layer is resized from F × N to , and then, the spectrogram of size is padded with zeros so that it is centered at the location of the sth source. Also, since the desired output Ms is binary, the last layer of the network is followed by a mapping function to .
There are two approaches to implement the detection and segmentation steps in a supervised fashion. One can search for two separate functions using ground-truth outputs of and and then use the detection function to predict the position and extent of sources and run the segmentation function times to reconstruct individual sources. The approach in this paper is to find a single function that projects a mixture X to masks Ms for operating in steps: one step of detection, , and steps of segmentation, . This function is presented as where Zs is the resized version of sth source in X and Ms is its corresponding mask. g operates in steps: in the first step, and in the next steps. In the following, first, a recipe for training this function is provided and then its usage in the evaluation and test time is explained. Finally, a DNN is suggested for implementing this function.
D. Train
Training or estimating the parameters of the function in a supervised manner is an optimization process in which a loss function between predictions and ground-truth values is minimized. Given the input of , ground-truth labels are or equivalently, a mixture X has four labels of C, H, W, and indicating bounding boxes (the first three) and masks of all sources, respectively. It is worth noting that given the first three labels, it is trivial to find S resized versions of X as Zs for . Also, the input and output of the separation function have 2 and 4 channels, respectively. During the training, as shown in Fig. 2, the function inputs for a random s from and outputs a prediction of . Parameters of g are estimated by minimizing the loss function of
which measures the distance between the prediction and the ground-truth Y where parameters are regularization coefficients. The first and the last terms of Eq. (2) are binary cross-entropy between ground-truth binary variables and their predictions, whereas the other term contains mean squared errors between ground-truth heights and widths of bounding boxes and their predictions. Since and are positive and real-valued, mean squared error is used to measure the model performance. However, cross-entropy has shown to be a more effective distance on categorical predictions like binary values.
E. Test
During the evaluation or test time, the separation function operates times and in two steps of detection and segmentation. Since there is no resized version of the mixture during the detection step of test and evaluation, let the mixture accompanied by itself, i.e., be the initial input to the network where the second variable is not available. The output of the network during the detection step is noted by , where the last dimension of the output is not useful. To prepare resized versions for the segmentation step, a couple of post-processings are employed on the . Since , a thresholding operation is applied to the first dimension of the output, i.e., , where is the detection threshold. This operation chooses bins with high chances of being centers of sources and excludes the rest of bins at , and . Therefore, the output of the detection step could be represented as . By having the first three dimensions of the output , one can form inputs of the form and get outputs of for . In a similar manner as the prediction of centers of sources, the final output is achieved by , where ψ is the segmentation threshold. Since threshold-based predictions are cluttered with non-source segments, as a post-processing step, after filling holes of the predicted binary mask, the largest segment is extracted as the final mask.
F. Architecture
An auto-encoder convolutional neural network (CNN) from the UNet family of architectures is used to model the separation function. The UNet architecture, initially proposed for segmentation in medical images, has shown to be successful in the encoding and decoding of spacial features. The encoding part represents images in a low-dimensional space, known as latent space, and the decoding part reconstructs outputs of interest using features in the latent space.40 In this paper, a neural network with the DenseUNet architecture41 (a slightly modified UNet) is used to model the separation function that detects and segments individual sources in the mixture. The DenseUNet is an auto-encoder neural network with the path alternating downsample/upsample units in encoding/decoding parts respectively and blocks of bottleneck layers. Downsample layers consist of a convolution followed by an average pooling and upsample layers are transposed convolutions. Bottleneck layers in both encoding and decoding paths include two convolution operations and the output of each layer is dropped out with a small probability. A batch normalization followed by a rectified linear unit (ReLU) is applied to the inputs of both convolution and transposed convolution. Parameters of this function (weights of convolution, transposed convolution, and batch normalization) are estimated through the training process.
G. Data
In order to minimize the loss function of Eq. (2), input mixtures X and their corresponding ground-truth labels are required. Regarding the input mixture x, a common approach in the development of source separation algorithms is to mix S individual sources and estimate a function that predicts binary masks of individual sources based on their dominance in the spectrogram of X as the transformation of x. To make a mixture of S sources, x, one can mix S individual sources using
for signal-to-noise ratios (SNRs) of where . Therefore, a mixture X is labeled with , where the mask of is noted by Ms.
For ground-truth labels M, since manually labeling bioacoustic signals in the time-frequency domain is expensive, the development of a simple and fast labeling algorithm is needed. Given a recording x1 with an assumption of containing non-overlapping calls and its transformation X1, the single-channel binary mask M is extracted using the iterative thresholding Algorithm 1 for the maximum of U iterations. ω is the amplitude threshold, and is the number of connected regions in the mask M.
III. EXPERIMENTS
Experiments of the proposed method are conducted on real and artificial datasets, and the performance is evaluated in terms of common metrics used in binary decision problems.
A. Datasets
The artificial dataset is made by mixing single source recordings that are pulled out of recordings of bat echolocation calls sampled at 256 kHz using Algorithm 1. The time-frequency representation is drawn using PCEN38 spectral magnitude of a Mel scaled spectrogram with F = 128 bands between 20 and 60 kHz. Spectrograms are obtained using a 1024-point short-time Fourier transform (STFT) with window and hop size of 256 and 32 samples, respectively. By framing time-domain recordings in segments of length T = 32 ms, inputs to Algorithm 1 are N = 256 features of dimension F = 128 and U = 5. Having three field recordings with a total duration of one hour, around 2700 bat echolocation calls are pulled out and divided into five folds of sources used in making development and test sets of mixtures. For each set of individual sources, a maximum of sources is chosen and mixed at randomly selected SNRs from (see Table I). The number of mixtures made out of each set of sources is proportional to the number of sources and a total number of 100–5000 mixtures are generated for each fold (see Table II). Figure 4 depicts a sample of artificial dataset, i.e., a mixture of ten echolocation calls and its corresponding label, ten masks.
Impacts of different maximum number of sources in mixtures of development (SD) and test (ST) sets. The results are reported in terms of average FNR, EER (when appropriate), and F1 score over five test folds for detection and segmentation thresholds of and , respectively. For each step of detection and segmentation, source-level and bin-level metrics are represented while the FNR is of primary attention. The number of bottleneck layers in the encoding/decoding path and regularization coefficients are set to and , respectively.
. | . | Detection . | Segmentation . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
. | . | Source-level . | Bin-level . | Source-level . | Bin-level . | ||||||
SD . | ST . | F 1 . | FNR . | EER . | F1 . | FNR . | F1 . | FNR . | EER . | F1 . | FNR . |
5 | 5 | 0.63 | 0.14 | 0.64 | 0.82 | 0.22 | 0.52 | 0.29 | 1.17 | 0.73 | 0.27 |
5 | 10 | 0.62 | 0.34 | 0.78 | 0.81 | 0.22 | 0.47 | 0.50 | 1.21 | 0.73 | 0.27 |
10 | 5 | 0.74 | 0.32 | 0.51 | 0.90 | 0.10 | 0.45 | 0.54 | 1.04 | 0.74 | 0.25 |
10 | 10 | 0.72 | 0.33 | 0.58 | 0.89 | 0.08 | 0.45 | 0.59 | 1.07 | 0.79 | 0.22 |
. | . | Detection . | Segmentation . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
. | . | Source-level . | Bin-level . | Source-level . | Bin-level . | ||||||
SD . | ST . | F 1 . | FNR . | EER . | F1 . | FNR . | F1 . | FNR . | EER . | F1 . | FNR . |
5 | 5 | 0.63 | 0.14 | 0.64 | 0.82 | 0.22 | 0.52 | 0.29 | 1.17 | 0.73 | 0.27 |
5 | 10 | 0.62 | 0.34 | 0.78 | 0.81 | 0.22 | 0.47 | 0.50 | 1.21 | 0.73 | 0.27 |
10 | 5 | 0.74 | 0.32 | 0.51 | 0.90 | 0.10 | 0.45 | 0.54 | 1.04 | 0.74 | 0.25 |
10 | 10 | 0.72 | 0.33 | 0.58 | 0.89 | 0.08 | 0.45 | 0.59 | 1.07 | 0.79 | 0.22 |
Impacts of different sizes of training dataset.
. | Detection . | Segmentation . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Source-level . | Bin-level . | Source-level . | Bin-level . | |||||||
# mixtures . | F 1 . | FNR . | EER . | F1 . | FNR . | F1 . | FNR . | EER . | F1 . | FNR . |
500 | 0.50 | 0.77 | 0.79 | 0.83 | 0.14 | 0.26 | 0.81 | 1.00 | 0.73 | 0.26 |
1000 | 0.54 | 0.56 | 0.69 | 0.82 | 0.12 | 0.33 | 0.73 | 1.18 | 0.72 | 0.25 |
2000 | 0.72 | 0.33 | 0.58 | 0.89 | 0.08 | 0.45 | 0.59 | 1.07 | 0.79 | 0.22 |
5000 | 0.76 | 0.31 | 0.53 | 0.94 | 0.06 | 0.46 | 0.50 | 0.93 | 0.81 | 0.18 |
. | Detection . | Segmentation . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Source-level . | Bin-level . | Source-level . | Bin-level . | |||||||
# mixtures . | F 1 . | FNR . | EER . | F1 . | FNR . | F1 . | FNR . | EER . | F1 . | FNR . |
500 | 0.50 | 0.77 | 0.79 | 0.83 | 0.14 | 0.26 | 0.81 | 1.00 | 0.73 | 0.26 |
1000 | 0.54 | 0.56 | 0.69 | 0.82 | 0.12 | 0.33 | 0.73 | 1.18 | 0.72 | 0.25 |
2000 | 0.72 | 0.33 | 0.58 | 0.89 | 0.08 | 0.45 | 0.59 | 1.07 | 0.79 | 0.22 |
5000 | 0.76 | 0.31 | 0.53 | 0.94 | 0.06 | 0.46 | 0.50 | 0.93 | 0.81 | 0.18 |
(Color online) A sample of artificial dataset. A mixture of multiple single call recordings and their corresponding masks.
(Color online) A sample of artificial dataset. A mixture of multiple single call recordings and their corresponding masks.
The real dataset contains 13 recordings (with the total duration of around 4 s) within swarms of up to 300 echolocating bats. Recordings are framed into overlapping segments so that some sources appear in more than one segment. Using a DenseUNet of bottleneck layers and trained with mixtures of maximum ten sources and regularization coefficients of , more than 900 sources are extracted. Figure 3(a) shows a spectrogram of a 32 ms frame from this dataset, where Fig. 3(b) depicts outputs of the detection step and extracted masks are presented in Fig. 3(c).
B. Model
The DenseUNet has four dense blocks of bottleneck layers in both encoding and decoding paths (see Table III). Each downsample layer is composed of a 3 × 3 convolution with the stride of 1 and a 2 × 2 average pooling with the stride of 2, while upsample layers have 2 × 2 transposed convolutions with a stride of 2. In bottleneck layers, 1 × 1 and 3 × 3 convolutions with strides of 1 are followed by a dropout with p = 0.1. The number of channels in the DenseUNet repeatedly increases and decreases with growth and reduction rates of 16 and 0.5, respectively. To estimate parameters of this network, the development set is split into two validation and training sets with the proportion of 1/4, and the Adam algorithm42 is used to minimize the loss function of Eq. (2) over the training set where its learning rate decays linearly starting from 104 ending at 105 (see Table IV). The total number of epochs is set to 100 with an early stop rule based on no change in the value of the loss function for the validation set for ten epochs.
Impacts of different number of bottleneck layers in each block.
. | . | . | . | Detection . | Segmentation . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
. | . | . | . | Source-level . | Bin-level . | Source-level . | Bin-level . | ||||||
1 . | 2 . | 3 . | 4 . | F 1 . | FNR . | EER . | F1 . | FNR . | F1 . | FNR . | EER . | F1 . | FNR . |
2 | 2 | 2 | 2 | 0.61 | 0.30 | 0.72 | 0.66 | 0.34 | 0.38 | 0.62 | 1.13 | 0.66 | 0.35 |
2 | 4 | 6 | 8 | 0.72 | 0.33 | 0.58 | 0.89 | 0.08 | 0.45 | 0.59 | 1.07 | 0.79 | 0.22 |
8 | 6 | 4 | 2 | 0.65 | 0.28 | 0.65 | 0.75 | 0.30 | 0.42 | 0.59 | 1.09 | 0.72 | 0.32 |
2 | 5 | 5 | 8 | 0.71 | 0.37 | 0.64 | 0.87 | 0.15 | 0.42 | 0.60 | 1.10 | 0.74 | 0.26 |
. | . | . | . | Detection . | Segmentation . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
. | . | . | . | Source-level . | Bin-level . | Source-level . | Bin-level . | ||||||
1 . | 2 . | 3 . | 4 . | F 1 . | FNR . | EER . | F1 . | FNR . | F1 . | FNR . | EER . | F1 . | FNR . |
2 | 2 | 2 | 2 | 0.61 | 0.30 | 0.72 | 0.66 | 0.34 | 0.38 | 0.62 | 1.13 | 0.66 | 0.35 |
2 | 4 | 6 | 8 | 0.72 | 0.33 | 0.58 | 0.89 | 0.08 | 0.45 | 0.59 | 1.07 | 0.79 | 0.22 |
8 | 6 | 4 | 2 | 0.65 | 0.28 | 0.65 | 0.75 | 0.30 | 0.42 | 0.59 | 1.09 | 0.72 | 0.32 |
2 | 5 | 5 | 8 | 0.71 | 0.37 | 0.64 | 0.87 | 0.15 | 0.42 | 0.60 | 1.10 | 0.74 | 0.26 |
Impacts of different regularization coefficients.
. | . | . | Detection . | Segmentation . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
. | . | . | Source-level . | Bin-level . | Source-level . | Bin-level . | ||||||
λ 1 . | λ2 . | λ3 . | F1 . | FNR . | EER . | F1 . | FNR . | F1 . | FNR . | EER . | F1 . | FNR . |
0.2 | 0.2 | 0.6 | 0.72 | 0.33 | 0.58 | 0.89 | 0.08 | 0.45 | 0.59 | 1.07 | 0.79 | 0.22 |
0.3 | 0.1 | 0.6 | 0.64 | 0.35 | 0.68 | 0.72 | 0.21 | 0.37 | 0.65 | 1.12 | 0.67 | 0.39 |
0.1 | 0.3 | 0.6 | 0.66 | 0.38 | 0.67 | 0.75 | 0.13 | 0.41 | 0.68 | 1.16 | 0.69 | 0.33 |
0.1 | 0.1 | 0.8 | 0.71 | 0.36 | 0.59 | 0.89 | 0.07 | 0.47 | 0.59 | 1.07 | 0.79 | 0.22 |
. | . | . | Detection . | Segmentation . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
. | . | . | Source-level . | Bin-level . | Source-level . | Bin-level . | ||||||
λ 1 . | λ2 . | λ3 . | F1 . | FNR . | EER . | F1 . | FNR . | F1 . | FNR . | EER . | F1 . | FNR . |
0.2 | 0.2 | 0.6 | 0.72 | 0.33 | 0.58 | 0.89 | 0.08 | 0.45 | 0.59 | 1.07 | 0.79 | 0.22 |
0.3 | 0.1 | 0.6 | 0.64 | 0.35 | 0.68 | 0.72 | 0.21 | 0.37 | 0.65 | 1.12 | 0.67 | 0.39 |
0.1 | 0.3 | 0.6 | 0.66 | 0.38 | 0.67 | 0.75 | 0.13 | 0.41 | 0.68 | 1.16 | 0.69 | 0.33 |
0.1 | 0.1 | 0.8 | 0.71 | 0.36 | 0.59 | 0.89 | 0.07 | 0.47 | 0.59 | 1.07 | 0.79 | 0.22 |
C. Evaluation
The performance of the proposed method is evaluated in the time-frequency domain. Without any extra post-processing decision steps, all predicted binary masks and their corresponding bounding boxes are examined in contrast with the ground-truth ones. Since the proposed model operates in two separate steps and the output of the first step affects the final prediction, the results are evaluated for both detection and segmentation steps. Also, for each step, the performance is measured in two levels of sources and time-frequency bins. A source-level metric compares the number of detected or segmented sources with ground-truth ones and considers a prediction as a correct output if its intersection over union (IOU) between the ground-truth is larger than 0.7. Within correctly predicted sources, the time-frequency bins level metric similarly compares time-frequency bins of the spectrogram in the prediction and ground-truth. Each metric is presented for both detection and segmentation steps and source-level and bin-level metrics are provided separately. The results of different experiments on the artificial dataset are reported in terms of average false negative rate (FNR), event error rate (EER) (when appropriate), and F1 score (F) over five test folds for detection and segmentation thresholds of and , respectively. All the threshold values are chosen based on the FNR. The thresholding Algorithm 1 is not accurate enough to capture all sources in a recording, so F1 score does not reflect the real performance of the proposed method compared to FNR. Also, F1 score takes into account false positive predictions, which is not the primary concern compared to FNR that represents the ratio of missed sources and time-frequency bins.
The results of the experiment exploring the complexity of the training dataset are shown in Table I. Two datasets composed of mixtures of maximum five and ten sources are used for training and each is evaluated against test sets of mixtures of maximum five and ten sources. As Table I shows, testing mixtures of maximum ten sources on a function trained on mixtures of maximum five sources results in the worst FNR. Therefore, training on mixtures of more sources is necessary for detecting highly overlapped sources because the neural network learns to recognize the source of interest in the presence of other sources if the cropped version is more complicated. However, since frame size and duration of sources and overlap of sources are inversely related, training on mixtures of more than 20 sources does not provide a better model. Table II shows the performance for various sizes of the training dataset. The larger the training dataset, the faster the convergence rate. However, since all mixtures are generated similarly, the model reaches its maximum capacity after 2000 mixtures. To examine the size of the DNN, which is mainly defined by the number of encoding layers (set to 4) and bottlenecks in each layer, training is performed on a different combination of bottleneck layers. Increasing layers constantly improves the FNR up to a total of around 20 layers and shows the best performance (Table III). Other orders like have decreased the FNR substantially. Furthermore, minor changes in the architecture such as do not yield considerable enhancement. Since during the training, the first three variables of C, H, and W are repeated proportional to the number of sources, weighting λ3 should result in better performance. Table IV shows the effects of the various regularization coefficients. FNR decreases constantly when λ1 and λ2 values are larger than λ3. However, training with regularization coefficients of results in better FNRs in the source level, but the last variable of the output overfits to the average mask so the shape of sources is not preserved well.
IV. CONCLUSION
This work proposes a novel source separation algorithm for extracting an unknown number of overlapped bat echolocation calls. The suggested method operates in two steps of detection and segmentation, where the first one finds bounding boxes potentially containing a source and the second one extracts the final masks of sources. Experiments on real mixtures of bat echolocation calls demonstrate that extracted segments maintain essential features of the original shape. Different components of the proposed methods such as DNN architectures, loss function, and size of the training dataset are founded thorough a numerical search over a wide range of parameters. This method performs the best on signals with high SNR and low-to-moderate levels of overlap. The primary motivation for this algorithm was the separation of bat echolocation calls, but this method can easily be applied and/or extended to other bioacoustic recordings with overlapping calls. We developed this method for the separation of sources without harmonics, but an association function could be applied to merge harmonics of identical sources at the end. Future directions can include modifications to the approach to separate more complex overlapping bioacoustic signals, such as those from dolphins, birds, or frogs vocalizing in large groups.
ACKNOWLEDGMENTS
This project was funded by an Office of Naval Research Young Investigator Award (N00014-16-1-2478) and ONR N00014-18-1-2522, both awarded to L.N.K. The authors would like to thank Nicole Blaha, Morgan Kinniry, Kathryn McGowan, Allison Pudlo, and Lilias Zusi for assistance with data collection and extraction.