Source separation is an important step to study signals that are not easy or possible to record individually. Common methods such as deep clustering, however, cannot be applied to signals of an unknown number of sources and/or signals that overlap in time and/or frequency—a common problem in bioacoustic recordings. This work presents an approach, using a supervised learning framework, to parse individual sources from a spectrogram of a mixture that contains a variable number of overlapping sources. This method isolates individual sources in the time-frequency domain using only one function but in two separate steps, one for the detection of the number of sources and corresponding bounding boxes, and a second step for the segmentation in which masks of individual sounds are extracted. This approach handles the full separation of overlapping sources in both time and frequency using deep neural networks in an applicable manner to other tasks such as bird audio detection. This paper presents method and reports on its performance to parse individual bat signals from recordings containing hundreds of overlapping bat echolocation signals. This method can be extended to other bioacoustic recordings with a variable number of sources and signals that overlap in time and/or frequency.

## I. INTRODUCTION

The separation of sources in acoustic mixtures is an inevitable step in the study of many environments when recordings of isolated signals of interest are not available.^{1–3} This problem has been studied in the context of many different applications, and proposed solutions typically depend on the relative information contained in sources^{4,5} and mixtures.^{6,7} Recovering individual sources in a complex and noisy environment (sometimes referred to as the “cocktail party problem”) has been an active area of research for decades and is still in progress.^{8–12} Popular approaches in speech separation, singing-voice separation, and signal denoising mainly suffer from underlying assumptions that make them inapplicable to mixtures of other sounds, such as bioacoustic recordings. For example, it is often assumed that the number of sources is known and fixed and/or sources are independent.^{2,13,14} When the number of sources is more than the number of recording channels (underdetermined system), sources are similar, the number of sources is unknown, or sources are highly overlapped in time and/or frequency, such as in the case of many bioacoustic recordings, common methods such as independent component analysis (ICA),^{15} nonnegative matrix factorization (NMF),^{16} or deep clustering (DC)^{17} are not applicable. Solutions based on NMF aim to represent the additive linear interpolation of a known number of sources through a basis (or template) matrix. ICA methods are developed for the determined system and rely on the assumption of the independence of sources. The DC methods train a neural network to generate discriminative embeddings for time-frequency bins where target sources are complementary masked spectrograms of the mixture. Mapping-based methods, unlike masking-based ones, look for complimentary masks in a continuous range, but these approaches are designed for dereverberation and denoising, not separation of multiple sources. The specific problem of detecting sound events may vary based on the amount of information in the input and the level of details in the output, but typical approaches are not designed to estimate the number of sources or to discriminate overlapped similar sources. Furthermore, masking-based methods with complementary outputs cannot fully represent overlapping sources because each time-frequency bin is assigned to only one source. Estimating the number of sources is a front-end processing step while estimating the separation function is usually defined as a back-end learning process, with the design of the separation function relying on a fixed number of sources.

This paper describes an approach to separate the bioacoustic signals of an unknown number of overlapping sources. Specifically, mixtures of interest are recordings of many echolocating signals collected within swarms of bats and target representation of sources are single component connected binary regions in the time-frequency domain. This technical problem is motivated by our study of the behavior of individual bats in a swarm. Some bat species form large groups in a cave and emerge from the cave in a dense swarm each evening.^{18} The primary sensory modality of these bats is echolocation, in which individuals make frequency-modulated ultrasonic signals.^{19–21} When flying together in small groups, bats may adjust their echolocation to avoid interference by switching between active and passive mode, adjusting the time-frequency characteristics of signals, or spending substantial time in silence.^{22–24} It is currently unknown, however, how bats in extreme dense swarms, such as during cave emergence, avoid the problem of mutual interference or sonar jamming.^{24–29} Pilot data indicate individual bats may produce signals with discrete time-frequency structure,^{30} but this has not been verified from recordings of dense swarms due to the challenge of overlapping echolocation signals. Two spectrograms of these overlapping signals are depicted in Fig. 1, where Fig. 1(a) is a simulated mixture by adding ten single call recordings and Fig. 1(b) is a real recording within a swarm of approximately 30 echolocating bats. Understanding how individual bats can adapt and use their sonar signaling techniques to navigate in complex environments is the primary incentive of this work. To gain this understanding, complex acoustic mixtures need to be represented in terms of individual sources.^{31–34} In this work, individual sources are defined to be single-harmonic bat calls. One can extract bat-wise signals by sequentially associating predicted calls to a specific bat. However, performing the post-processing step of the so-called data association is out of the scope of this work.

The approach presented in this paper is built on ideas that have been used to address the segmentation problem found in computer vision systems. Instance segmentation is the task of assigning every pixel into a category and discriminating between individual object instances that are spatially large and sparse. A common approach to instance segmentation is to split the task into two separate problems: First, detecting all areas that potentially contain an object, known as region proposals; then, second, running a mask generator on each proposal for final segmentation. In segmentation by detection methods, once an object is detected, segmentation can be relatively easy using characteristics of the object. Most of the instance segmentation methods such as mask regional-convolutional neural networks (Mask R-CNN)^{35} either show poor performance on overlapping objects due to the local non-maximum suppression or stack multiple computationally expensive neural networks.

In this paper, a deep neural network (DNN) with DenseUNet architecture is proposed to model the separation function where the neural network operates to perform in two steps. First, a detection step detects signal measurements that are large enough to allow them to be segmented from the remaining signal content. Then, by reducing the observation to just a few target candidates, the problem reduces to mapping the most probable candidates to the representation of individual sources. The model performs in the time-frequency representation of mixtures. Outputs of the detection and segmentation steps are bounding boxes and binary mappings of individual sources, respectively. Time-domain predicted sources can be achieved from an inverse time-frequency transformation. The performance of the proposed method is evaluated in the time-frequency domain through common metrics including *F*_{1}-score and false-negative rate (FNR).

## II. METHOD

Let $x\u2208\mathbb{R}T$ denote the mixture of *S* sources $y=[ys]s=1S\u2208\mathbb{R}S\xd7T$ where $ys\u2208\mathbb{R}T$ and *T* is the length of the recording. The source separation problem is to estimate the separation function $g\u0302$ that minimizes a distance $D(y,y\u0302)$ where $y\u0302=g\u0302(x)$ and $S\u0302$ are predicted sources and their quantity, respectively. $y\u0302\u2208\mathbb{R}S\u0302\xd7T$.

This section first reviews different assumptions on the mixing process. The second part then explains the importance of signal representation and provides a transformation option. The model section proposes a DNN which models the separation function *g* and presents the process of estimating parameters—training—and an algorithm for preparing ground-truth sources used in the training.

### A. Mixing system

Source separation is considered as an inverse problem, and the first assumption underlying most of the proposed solutions is the mixing function $\mathbb{R}S\xd7T\u2192\mathbb{R}T$ that represents the environment and process of mixing sources. The separation function *g* is one of the (possibly many) inverse functions of the mixing function and should reflect the properties of the mixing environment.

A linear and time-invariant system which adds up sources in an instantaneous system does not fully express properties of non-stationary environments in which mixing variables are time-dependent. Furthermore, in reverberant environments, reflections of sources persist after they are produced and the mixing process should be in a convolutive manner to highlight the occurrence of reflections. However, an adequately large enough environment is usually approximated with a linear system. In this paper, the mixing function is assumed to be $x=Ay+b$, where $A\u2208\mathbb{R}S$ and $b\u2208\mathbb{R}T$ capture the relative power of sources and the environment noise (including non-source signals), respectively.

### B. Signal representation

Source separation is often performed in the time-frequency domain. Signal transformations such as Fourier analysis are not only used as feature extractors that improve the separation performance but also serve as a visualization scheme that provides insight into the analysis of the signal. An appropriate signal transformation is reversible and projects time-domain samples to an interpretable space that highlights the contrast between different sources and similarities between identical sources. Depending on the application, some transformations have shown to be more appropriate. Time-frequency representations like the short-time Fourier transform (STFT) are among the most common transformations used in source separation.

Let $T:\mathbb{R}T\u2192\mathbb{R}F\xd7N$ be a two-dimensional (2 D) transformation that projects *T* time-domain samples into *N* features of dimension *F*. Parameters of the transformation including sampling frequency, window size, and hop size play an important role in the resolution of time-frequency space ($F\xd7N$) such that increasing *F* reduces *N* and vice versa. Due to this inherent trade-off of short-time analysis, various joint distributions have been proposed to control the interplay between time and frequency by adding more parameters to the transformation $T$.^{36} Furthermore, useful information in separating sources is primarily embodied in the signal energy and not in its phase.^{37} Therefore, most separation methods rely on energy-based representations known as spectrograms. In some cases, applying other transformations such as logarithmic and Mel scales and recently proposed per-channel energy normalization (PCEN)^{38} on spectrograms can help with the suppression of non-source components. The log and PCEN compression reduce the dynamic range of Mel-bank energy so that scales down a large range of low-level signals, such as silence and small noise (non-source). Since a log function is loudness dependent and devotes a large range to weak signals, PCEN has shown to be more effective in suppressing non-stationary noise.

Using the spectrogram of a mixture $X=T(x)$ and sources $Ys=T(ys)$ for $s\u2208{1,\u2026,S}$, the separation function $g\u0302$ extracts $S\u0302$ sources $[Y\u0302s]s=1S\u0302=g\u0302(X)$ and the source separation problem is to estimate a function $g\u0302$ such that the distance

is minimized where *S* ground-truth and $S\u0302$ predicted sources are noted by *Y _{s}* and $Y\u0302s\u0302$, respectively. For an

*F*×

*N*-dimensional

*A*, the norm $\Vert A\Vert 22$ is the average over norms of

*A*element, i.e., $\Vert A\Vert =\u2211iF\u2211jN\Vert aij\Vert /FN$. In a $S\xd7S\u0302$ matrix of distances $\Vert Ys\u2212Y\u0302s\u0302\Vert 22$ for $s=1,\u2026,S$ and $s\u0302=1,\u2026,S\u0302$, the first term of averages over minimums of rows, while the second term overages over the minimums of columns. Considering unique predictions, Eq. (1) is minimized if predictions are a permutation of

*S*ground-truth sources. A missing prediction increases the first term of Eq. (1), while the second term quantifies extra predictions.

Instead of directly looking for the source representations within an *F *×* N*-dimensional real-valued space, it can be useful to reduce the range of the outputs to the binary-valued $F\xd7N$-dimensional space.^{37} The time-frequency masking approach was proposed to identify sources with the largest amplitude in each individual time-frequency bin. The ideal binary mask (IBM),^{39} inspired by the auditory masking phenomenon, is defined as $M=1\omega (Y)$, where $1\omega (\xb7)$ is the element-wise indicator function with an amplitude threshold of *ω*. Let *M _{s}* be the mask of the

*s*th source for $s\u2208{1,\u2026,S}$. Having a mask

*M*and the phase of the mixture, one can reconstruct the time-domain source

_{s}*s*as $ys=T\u22121(Ms\u2299X)$ where $\u2299$ denotes the element-wise multiplication.

### C. Model

Conventional separating functions are dependent on the number of sources (or an estimation of the number of sources when it is unknown) and are typically composed of source-independent and source-dependent parts. The source-independent part projects the mixture to a latent space in which sources are easily separable. This part aims to maximize the distance between samples from different sources and the similarity between identical sources. Finally, for each source, a source-dependent function projects the latent vector onto the corresponding hyperplane of the source. In a separation function modeled with a DNN, the last layer with *S*-channels operates as the source-dependent part. However, when the number of sources is unknown and variable, the last layer needs to operate adaptively.

To model a separation function that handles a variable number of overlapping sources, three intermediate variables are defined to capture the location of sources and their bounding boxes in the spectrogram. The first variable, *C*, determines the center of sources, which consequently indicates the number of sources. The other two variables, *H* and *W*, are responsible for the extent of sources, i.e., the heights and widths of sources in the spectrogram. Therefore, the separation function operates in two steps of the detection and segmentation. During the detection step, the center of all sources and their corresponding bounding boxes are predicted. The segmentation step aims to extract the mask of each detected source in the first step from its corresponding resized version. For a mixture of *S* sources, the separation function ideally operates in *S *+* *1 steps, where *S* center and bounding boxes of sources are detected in the first step and *S* masks *M _{s}* for $s\u2208{1,\u2026,S}$ is predicted in the next

*S*steps of segmentation. An example of inputs and outputs of the separation function during the training and the test are depicted in Fig. 2. In the detection step, Fig. 2(a), the resized version is not available and the first three output variables are of interest. Only the resized version is involved in the segmentation step, Fig. 2(b), that extracts the corresponding mask. The first and last outputs are binary variables, while the other two have positive values indicating heights and widths of all sources. During the training, Fig. 2(c), the function is forced to map the mixture

*X*and a resized version

*Z*to four aforementioned variables of

_{s}*C*,

*H*,

*W*, and

*M*. Figure 3 shows the results of the proposed method for two steps of detection and segmentation. The input which is a mixture of an unknown number of bat echolocation calls is depicted in Fig. 3(a).

_{s}The detection step (equivalent to the source-independent part) manages the variable number of sources. Any estimator of the number of sources captures an underlying structure in which sources are different. When sources are characterized based on their position (temporal and spectral) and extent, one can estimate the number of sources by detecting the position of sources in the spectrogram. Therefore, a mapping from the mixture *X* to $C\u2208[0,1]F\xd7N$ space could be interpreted as the probability of the presence of a source in each time-frequency bin, i.e., $C(f,n)=1$ indicates the presence of a source centering at (*f*, *n*) in the spectrogram. Accordingly, this variable, *C*, not only provides the information about the number of sources since $\Sigma f,nC(f,n)=S$, but also specifies the position of sources in the spectrogram. When the function is modeled with a neural network, to ensure the binary range of the *C*, the last layer is followed by a mapping function to $[0,1]$, like the Sigmoid function, and an indicator function.

Since knowing the position of centers of sources is not enough to uniquely identify sources, the extent of sources along the time and frequency axes needs to be predicted. To record the extent of sources, two variables, *H* and *W* $\u2208\mathbb{N}F\xd7N$, are assigned to describe the height and width of sources. For the source $s\u2208{1,\u2026,S}$ centering at the time-frequency bin (*f*, *n*), i.e., $C(f,n)=1$, having frequency range *h _{s}* and duration

*w*, two time-frequency bins

_{s}*H*(

*f*,

*n*) and

*W*(

*f*,

*n*) are assigned accordingly, i.e., $H(f,n)=hs$ and $W(f,n)=ws$. Two mappings from the mixture

*X*to

*H*and

*W*$\u2208\mathbb{R}+F\xd7N$ model the extent of sources where $\mathbb{R}+$ denotes the set of non-negative real numbers. Three variables of

*C*,

*H*, and

*W*are equivalent to

*S*bounding boxes on the spectrogram.

Having the position and extent of sources only specifies rectangular areas of the spectrogram, not their detailed bin-level masks, see Fig. 3(b). Thus, another step is needed to fully segment sources in the time-frequency domain. The segmentation step that extracts masks of individual sources from a mixture is conditioned on results of the detection step, i.e., $X;C,H,W,s\u2192Ms$ for $s\u2208{1,\u2026,S\u0302}$, and should be independent of the position and extent of sources. It is trivial to show the mapping from *X* to *M _{s}* given $C,H,W,s$ is equivalent to the mapping from

*Z*to

_{s}*M*where

_{s}*Z*is the

_{s}*s*th resized version of the mixture. One can get the

*s*th resized version of the mixture

*X*by first cropping the area of the

*X*that contains the source

*Y*and then resizing the rectangular spectrogram of size $hs\xd7ws$ to

_{s}*F*×

*N*. The

*s*th rectangular region of

*X*is the smallest rectangular window that $Ms(f,n)$ is not zero, i.e., $\u2200f,n:arg\u2009minf,nMs\u2264f,n\u2264arg\u2009maxf,nMs$. Without this resizing operation, the segmentation part needs to learn the already available position and extent of the source unnecessarily. This operation takes advantage of having similar sources that only differ in the position and size in the spectrogram. The result of the segmentation step is shown in Fig. 3(c). In the case of using a neural network, a post-processing step is needed to reverse the resizing process in the following way: First, the output of the last layer is resized from

*F*×

*N*to $hs\xd7ws$, and then, the spectrogram of size $hs\xd7ws$ is padded with zeros so that it is centered at the location of the

*s*th source. Also, since the desired output

*M*is binary, the last layer of the network is followed by a mapping function to $[0,1]$.

_{s}There are two approaches to implement the detection and segmentation steps in a supervised fashion. One can search for two separate functions using ground-truth outputs of $[C,H,W]$ and $[Ms]s=1S$ and then use the detection function to predict the position and extent of $S\u0302$ sources and run the segmentation function $S\u0302$ times to reconstruct individual sources. The approach in this paper is to find a single function that projects a mixture *X* to $S\u0302$ masks *M _{s}* for $s\u2208{1,\u2026,S}$ operating in $S\u0302+1$ steps: one step of detection, $X\u2192[C,H,W]$, and $S\u0302$ steps of segmentation, $Zs\u2192Ms$. This function is presented as $g:[X,Zs]\u2192[C,H,W,Ms]$ where

*Z*is the resized version of

_{s}*s*th source in

*X*and

*M*is its corresponding mask.

_{s}*g*operates in $1+S\u0302$ steps: $g:[X,\xb7]\u2192[C,H,W,\xb7]$ in the first step, and $g:[\xb7,Zs]\u2192[\xb7,\xb7,\xb7,Ms]$ in the next $S\u0302$ steps. In the following, first, a recipe for training this function is provided and then its usage in the evaluation and test time is explained. Finally, a DNN is suggested for implementing this function.

### D. Train

Training or estimating the parameters of the function in a supervised manner is an optimization process in which a loss function between predictions and ground-truth values is minimized. Given the input of $[X,Zs]$, ground-truth labels are $[C,H,W,Ms]$ or equivalently, a mixture *X* has four labels of *C*, *H*, *W*, and $M=[Ms]s=1S$ indicating bounding boxes (the first three) and masks of all sources, respectively. It is worth noting that given the first three labels, it is trivial to find *S* resized versions of *X* as *Z _{s}* for $s\u2208{1,\u2026,S}$. Also, the input and output of the separation function have 2 and 4 channels, respectively. During the training, as shown in Fig. 2, the function inputs $[X,Zs]$ for a random

*s*from ${1,\u2026,S}$ and outputs a prediction of $[C,H,W,Ms]$. Parameters of

*g*are estimated by minimizing the loss function of

which measures the distance between the prediction $Y\u0302$ and the ground-truth *Y* where parameters $\lambda ={\lambda 1,\lambda 2,\lambda 3}$ are regularization coefficients. The first and the last terms of Eq. (2) are binary cross-entropy between ground-truth binary variables and their predictions, whereas the other term contains mean squared errors between ground-truth heights and widths of bounding boxes and their predictions. Since $H\u0302$ and $W\u0302$ are positive and real-valued, mean squared error is used to measure the model performance. However, cross-entropy has shown to be a more effective distance on categorical predictions like binary values.

### E. Test

During the evaluation or test time, the separation function operates $S\u0302+1$ times and in two steps of detection and segmentation. Since there is no resized version of the mixture during the detection step of test and evaluation, let the mixture accompanied by itself, i.e., $[X,\xb7]$ be the initial input to the network where the second variable is not available. The output of the network during the detection step is noted by $[C\u0302,H\u0302,W\u0302,\xb7]$, where the last dimension of the output is not useful. To prepare resized versions for the segmentation step, a couple of post-processings are employed on the $Y\u0302$. Since $C\u0302\u2208[0,1]$, a thresholding operation is applied to the first dimension of the output, i.e., $S\u0302=\Sigma f,n1\varphi (C\u0302(f,n))$, where $\varphi $ is the detection threshold. This operation chooses $S\u0302$ bins with high chances of being centers of sources and excludes the rest of $FN\u2212S\u0302$ bins at $C\u0302,\u2009H\u0302$, and $W\u0302$. Therefore, the output of the detection step could be represented as $[1\varphi (C\u0302),1\varphi (C\u0302)\u2299H\u0302,1\varphi (C\u0302)\u2299W\u0302,\xb7]$. By having the first three dimensions of the output $Y\u0302$, one can form $S\u0302$ inputs of the form $[\xb7,Z\u0302s]$ and get $S\u0302$ outputs of $[\xb7,\xb7,\xb7,M\u0302s]$ for $s\u2208{1,\u2026S\u0302}$. In a similar manner as the prediction of centers of sources, the final output is achieved by $Y\u0302=[\xb7,\xb7,\xb7,1\psi (M\u0302s)]$, where *ψ* is the segmentation threshold. Since threshold-based predictions are cluttered with non-source segments, as a post-processing step, after filling holes of the predicted binary mask, the largest segment is extracted as the final mask.

### F. Architecture

An auto-encoder convolutional neural network (CNN) from the UNet family of architectures is used to model the separation function. The UNet architecture, initially proposed for segmentation in medical images, has shown to be successful in the encoding and decoding of spacial features. The encoding part represents images in a low-dimensional space, known as latent space, and the decoding part reconstructs outputs of interest using features in the latent space.^{40} In this paper, a neural network with the DenseUNet architecture^{41} (a slightly modified UNet) is used to model the separation function that detects and segments individual sources in the mixture. The DenseUNet is an auto-encoder neural network with the path alternating downsample/upsample units in encoding/decoding parts respectively and blocks of bottleneck layers. Downsample layers consist of a convolution followed by an average pooling and upsample layers are transposed convolutions. Bottleneck layers in both encoding and decoding paths include two convolution operations and the output of each layer is dropped out with a small probability. A batch normalization followed by a rectified linear unit (ReLU) is applied to the inputs of both convolution and transposed convolution. Parameters of this function (weights of convolution, transposed convolution, and batch normalization) are estimated through the training process.

### G. Data

In order to minimize the loss function of Eq. (2), input mixtures *X* and their corresponding ground-truth labels $[Ms]s=1S$ are required. Regarding the input mixture *x*, a common approach in the development of source separation algorithms is to mix *S* individual sources $[xs1]s=1S$ and estimate a function that predicts binary masks of individual sources based on their dominance in the spectrogram of *X* as the transformation of *x*. To make a mixture of *S* sources, *x*, one can mix *S* individual sources $[xs1]s=1S$ using

for signal-to-noise ratios (SNRs) of $(\alpha s)s=2S$ where $x0=0$. Therefore, a mixture *X* is labeled with $[Ms]s=1S$, where the mask of $Xs1$ is noted by *M _{s}*.

For ground-truth labels *M*, since manually labeling bioacoustic signals in the time-frequency domain is expensive, the development of a simple and fast labeling algorithm is needed. Given a recording *x*^{1} with an assumption of containing non-overlapping calls and its transformation *X*^{1}, the single-channel binary mask *M* is extracted using the iterative thresholding Algorithm 1 for the maximum of *U* iterations. *ω* is the amplitude threshold, and $\u27e8M\u27e9$ is the number of connected regions in the mask *M*.

Input: $X1\u2208\mathbb{R}F\xd7N$, U |

Initialize: $\omega \u21900$ |

while $w<max(X1)$ do |

$M=1\omega (X1)$ |

if $\u27e8M\u27e9=1$ and $\Sigma f,nM\u2260FN$ then |

Output: M |

else |

$\omega \u2190\omega +1/U(max(X1)\u2212min(X1))$ |

Input: $X1\u2208\mathbb{R}F\xd7N$, U |

Initialize: $\omega \u21900$ |

while $w<max(X1)$ do |

$M=1\omega (X1)$ |

if $\u27e8M\u27e9=1$ and $\Sigma f,nM\u2260FN$ then |

Output: M |

else |

$\omega \u2190\omega +1/U(max(X1)\u2212min(X1))$ |

## III. EXPERIMENTS

Experiments of the proposed method are conducted on real and artificial datasets, and the performance is evaluated in terms of common metrics used in binary decision problems.

### A. Datasets

The artificial dataset is made by mixing single source recordings that are pulled out of recordings of bat echolocation calls sampled at 256 kHz using Algorithm 1. The time-frequency representation is drawn using PCEN^{38} spectral magnitude of a Mel scaled spectrogram with *F *=* *128 bands between 20 and 60 kHz. Spectrograms are obtained using a 1024-point short-time Fourier transform (STFT) with window and hop size of 256 and 32 samples, respectively. By framing time-domain recordings in segments of length *T *=* *32 ms, inputs to Algorithm 1 are *N *=* *256 features of dimension *F *=* *128 and *U *=* *5. Having three field recordings with a total duration of one hour, around 2700 bat echolocation calls are pulled out and divided into five folds of sources used in making development and test sets of mixtures. For each set of individual sources, a maximum of $S\u2208{5,10}$ sources is chosen and mixed at randomly selected SNRs from $[\u22125,5]$ (see Table I). The number of mixtures made out of each set of sources is proportional to the number of sources and a total number of 100–5000 mixtures are generated for each fold (see Table II). Figure 4 depicts a sample of artificial dataset, i.e., a mixture of ten echolocation calls and its corresponding label, ten masks.

. | . | Detection . | Segmentation . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

. | . | Source-level . | Bin-level . | Source-level . | Bin-level . | ||||||

SD . | ST . | F _{1}
. | FNR . | EER . | F_{1}
. | FNR . | F_{1}
. | FNR . | EER . | F_{1}
. | FNR . |

5 | 5 | 0.63 | 0.14 | 0.64 | 0.82 | 0.22 | 0.52 | 0.29 | 1.17 | 0.73 | 0.27 |

5 | 10 | 0.62 | 0.34 | 0.78 | 0.81 | 0.22 | 0.47 | 0.50 | 1.21 | 0.73 | 0.27 |

10 | 5 | 0.74 | 0.32 | 0.51 | 0.90 | 0.10 | 0.45 | 0.54 | 1.04 | 0.74 | 0.25 |

10 | 10 | 0.72 | 0.33 | 0.58 | 0.89 | 0.08 | 0.45 | 0.59 | 1.07 | 0.79 | 0.22 |

. | . | Detection . | Segmentation . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

. | . | Source-level . | Bin-level . | Source-level . | Bin-level . | ||||||

SD . | ST . | F _{1}
. | FNR . | EER . | F_{1}
. | FNR . | F_{1}
. | FNR . | EER . | F_{1}
. | FNR . |

5 | 5 | 0.63 | 0.14 | 0.64 | 0.82 | 0.22 | 0.52 | 0.29 | 1.17 | 0.73 | 0.27 |

5 | 10 | 0.62 | 0.34 | 0.78 | 0.81 | 0.22 | 0.47 | 0.50 | 1.21 | 0.73 | 0.27 |

10 | 5 | 0.74 | 0.32 | 0.51 | 0.90 | 0.10 | 0.45 | 0.54 | 1.04 | 0.74 | 0.25 |

10 | 10 | 0.72 | 0.33 | 0.58 | 0.89 | 0.08 | 0.45 | 0.59 | 1.07 | 0.79 | 0.22 |

. | Detection . | Segmentation . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|

Source-level . | Bin-level . | Source-level . | Bin-level . | |||||||

# mixtures . | F _{1}
. | FNR . | EER . | F_{1}
. | FNR . | F_{1}
. | FNR . | EER . | F_{1}
. | FNR . |

500 | 0.50 | 0.77 | 0.79 | 0.83 | 0.14 | 0.26 | 0.81 | 1.00 | 0.73 | 0.26 |

1000 | 0.54 | 0.56 | 0.69 | 0.82 | 0.12 | 0.33 | 0.73 | 1.18 | 0.72 | 0.25 |

2000 | 0.72 | 0.33 | 0.58 | 0.89 | 0.08 | 0.45 | 0.59 | 1.07 | 0.79 | 0.22 |

5000 | 0.76 | 0.31 | 0.53 | 0.94 | 0.06 | 0.46 | 0.50 | 0.93 | 0.81 | 0.18 |

. | Detection . | Segmentation . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|

Source-level . | Bin-level . | Source-level . | Bin-level . | |||||||

# mixtures . | F _{1}
. | FNR . | EER . | F_{1}
. | FNR . | F_{1}
. | FNR . | EER . | F_{1}
. | FNR . |

500 | 0.50 | 0.77 | 0.79 | 0.83 | 0.14 | 0.26 | 0.81 | 1.00 | 0.73 | 0.26 |

1000 | 0.54 | 0.56 | 0.69 | 0.82 | 0.12 | 0.33 | 0.73 | 1.18 | 0.72 | 0.25 |

2000 | 0.72 | 0.33 | 0.58 | 0.89 | 0.08 | 0.45 | 0.59 | 1.07 | 0.79 | 0.22 |

5000 | 0.76 | 0.31 | 0.53 | 0.94 | 0.06 | 0.46 | 0.50 | 0.93 | 0.81 | 0.18 |

The real dataset contains 13 recordings (with the total duration of around 4 s) within swarms of up to 300 echolocating bats. Recordings are framed into overlapping segments so that some sources appear in more than one segment. Using a DenseUNet of $2/4/8/4$ bottleneck layers and trained with mixtures of maximum ten sources and regularization coefficients of $0.1/0.1/0.5$, more than 900 sources are extracted. Figure 3(a) shows a spectrogram of a 32 ms frame from this dataset, where Fig. 3(b) depicts outputs of the detection step and extracted masks are presented in Fig. 3(c).

### B. Model

The DenseUNet has four dense blocks of bottleneck layers in both encoding and decoding paths (see Table III). Each downsample layer is composed of a 3 × 3 convolution with the stride of 1 and a 2 × 2 average pooling with the stride of 2, while upsample layers have 2 × 2 transposed convolutions with a stride of 2. In bottleneck layers, 1 × 1 and 3 × 3 convolutions with strides of 1 are followed by a dropout with *p *=* *0.1. The number of channels in the DenseUNet repeatedly increases and decreases with growth and reduction rates of 16 and 0.5, respectively. To estimate parameters of this network, the development set is split into two validation and training sets with the proportion of 1/4, and the Adam algorithm^{42} is used to minimize the loss function of Eq. (2) over the training set where its learning rate decays linearly starting from 10^{4} ending at 10^{5} (see Table IV). The total number of epochs is set to 100 with an early stop rule based on no change in the value of the loss function for the validation set for ten epochs.

. | . | . | . | Detection . | Segmentation . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

. | . | . | . | Source-level . | Bin-level . | Source-level . | Bin-level . | ||||||

1 . | 2 . | 3 . | 4 . | F _{1}
. | FNR . | EER . | F_{1}
. | FNR . | F_{1}
. | FNR . | EER . | F_{1}
. | FNR . |

2 | 2 | 2 | 2 | 0.61 | 0.30 | 0.72 | 0.66 | 0.34 | 0.38 | 0.62 | 1.13 | 0.66 | 0.35 |

2 | 4 | 6 | 8 | 0.72 | 0.33 | 0.58 | 0.89 | 0.08 | 0.45 | 0.59 | 1.07 | 0.79 | 0.22 |

8 | 6 | 4 | 2 | 0.65 | 0.28 | 0.65 | 0.75 | 0.30 | 0.42 | 0.59 | 1.09 | 0.72 | 0.32 |

2 | 5 | 5 | 8 | 0.71 | 0.37 | 0.64 | 0.87 | 0.15 | 0.42 | 0.60 | 1.10 | 0.74 | 0.26 |

. | . | . | . | Detection . | Segmentation . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

. | . | . | . | Source-level . | Bin-level . | Source-level . | Bin-level . | ||||||

1 . | 2 . | 3 . | 4 . | F _{1}
. | FNR . | EER . | F_{1}
. | FNR . | F_{1}
. | FNR . | EER . | F_{1}
. | FNR . |

2 | 2 | 2 | 2 | 0.61 | 0.30 | 0.72 | 0.66 | 0.34 | 0.38 | 0.62 | 1.13 | 0.66 | 0.35 |

2 | 4 | 6 | 8 | 0.72 | 0.33 | 0.58 | 0.89 | 0.08 | 0.45 | 0.59 | 1.07 | 0.79 | 0.22 |

8 | 6 | 4 | 2 | 0.65 | 0.28 | 0.65 | 0.75 | 0.30 | 0.42 | 0.59 | 1.09 | 0.72 | 0.32 |

2 | 5 | 5 | 8 | 0.71 | 0.37 | 0.64 | 0.87 | 0.15 | 0.42 | 0.60 | 1.10 | 0.74 | 0.26 |

. | . | . | Detection . | Segmentation . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

. | . | . | Source-level . | Bin-level . | Source-level . | Bin-level . | ||||||

λ _{1}
. | λ_{2}
. | λ_{3}
. | F_{1}
. | FNR . | EER . | F_{1}
. | FNR . | F_{1}
. | FNR . | EER . | F_{1}
. | FNR . |

0.2 | 0.2 | 0.6 | 0.72 | 0.33 | 0.58 | 0.89 | 0.08 | 0.45 | 0.59 | 1.07 | 0.79 | 0.22 |

0.3 | 0.1 | 0.6 | 0.64 | 0.35 | 0.68 | 0.72 | 0.21 | 0.37 | 0.65 | 1.12 | 0.67 | 0.39 |

0.1 | 0.3 | 0.6 | 0.66 | 0.38 | 0.67 | 0.75 | 0.13 | 0.41 | 0.68 | 1.16 | 0.69 | 0.33 |

0.1 | 0.1 | 0.8 | 0.71 | 0.36 | 0.59 | 0.89 | 0.07 | 0.47 | 0.59 | 1.07 | 0.79 | 0.22 |

. | . | . | Detection . | Segmentation . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

. | . | . | Source-level . | Bin-level . | Source-level . | Bin-level . | ||||||

λ _{1}
. | λ_{2}
. | λ_{3}
. | F_{1}
. | FNR . | EER . | F_{1}
. | FNR . | F_{1}
. | FNR . | EER . | F_{1}
. | FNR . |

0.2 | 0.2 | 0.6 | 0.72 | 0.33 | 0.58 | 0.89 | 0.08 | 0.45 | 0.59 | 1.07 | 0.79 | 0.22 |

0.3 | 0.1 | 0.6 | 0.64 | 0.35 | 0.68 | 0.72 | 0.21 | 0.37 | 0.65 | 1.12 | 0.67 | 0.39 |

0.1 | 0.3 | 0.6 | 0.66 | 0.38 | 0.67 | 0.75 | 0.13 | 0.41 | 0.68 | 1.16 | 0.69 | 0.33 |

0.1 | 0.1 | 0.8 | 0.71 | 0.36 | 0.59 | 0.89 | 0.07 | 0.47 | 0.59 | 1.07 | 0.79 | 0.22 |

### C. Evaluation

The performance of the proposed method is evaluated in the time-frequency domain. Without any extra post-processing decision steps, all predicted binary masks and their corresponding bounding boxes are examined in contrast with the ground-truth ones. Since the proposed model operates in two separate steps and the output of the first step affects the final prediction, the results are evaluated for both detection and segmentation steps. Also, for each step, the performance is measured in two levels of sources and time-frequency bins. A source-level metric compares the number of detected or segmented sources with ground-truth ones and considers a prediction as a correct output if its intersection over union (IOU) between the ground-truth is larger than 0.7. Within correctly predicted sources, the time-frequency bins level metric similarly compares time-frequency bins of the spectrogram in the prediction and ground-truth. Each metric is presented for both detection and segmentation steps and source-level and bin-level metrics are provided separately. The results of different experiments on the artificial dataset are reported in terms of average false negative rate (FNR), event error rate (EER) (when appropriate), and *F*_{1} score (*F*) over five test folds for detection and segmentation thresholds of $\varphi =0.1$ and $\psi =0.5$, respectively. All the threshold values are chosen based on the FNR. The thresholding Algorithm 1 is not accurate enough to capture all sources in a recording, so *F*_{1} score does not reflect the real performance of the proposed method compared to FNR. Also, *F*_{1} score takes into account false positive predictions, which is not the primary concern compared to FNR that represents the ratio of missed sources and time-frequency bins.

The results of the experiment exploring the complexity of the training dataset are shown in Table I. Two datasets composed of mixtures of maximum five and ten sources are used for training and each is evaluated against test sets of mixtures of maximum five and ten sources. As Table I shows, testing mixtures of maximum ten sources on a function trained on mixtures of maximum five sources results in the worst FNR. Therefore, training on mixtures of more sources is necessary for detecting highly overlapped sources because the neural network learns to recognize the source of interest in the presence of other sources if the cropped version is more complicated. However, since frame size and duration of sources and overlap of sources are inversely related, training on mixtures of more than 20 sources does not provide a better model. Table II shows the performance for various sizes of the training dataset. The larger the training dataset, the faster the convergence rate. However, since all mixtures are generated similarly, the model reaches its maximum capacity after 2000 mixtures. To examine the size of the DNN, which is mainly defined by the number of encoding layers (set to 4) and bottlenecks in each layer, training is performed on a different combination of bottleneck layers. Increasing layers constantly improves the FNR up to a total of around 20 layers and $2/4/6/8$ shows the best performance (Table III). Other orders like $8/6/4/2$ have decreased the FNR substantially. Furthermore, minor changes in the architecture such as $2/5/5/8$ do not yield considerable enhancement. Since during the training, the first three variables of *C*, *H*, and *W* are repeated proportional to the number of sources, weighting *λ*_{3} should result in better performance. Table IV shows the effects of the various regularization coefficients. FNR decreases constantly when *λ*_{1} and *λ*_{2} values are larger than *λ*_{3}. However, training with regularization coefficients of $1/1/10$ results in better FNRs in the source level, but the last variable of the output overfits to the average mask so the shape of sources is not preserved well.

## IV. CONCLUSION

This work proposes a novel source separation algorithm for extracting an unknown number of overlapped bat echolocation calls. The suggested method operates in two steps of detection and segmentation, where the first one finds bounding boxes potentially containing a source and the second one extracts the final masks of sources. Experiments on real mixtures of bat echolocation calls demonstrate that extracted segments maintain essential features of the original shape. Different components of the proposed methods such as DNN architectures, loss function, and size of the training dataset are founded thorough a numerical search over a wide range of parameters. This method performs the best on signals with high SNR and low-to-moderate levels of overlap. The primary motivation for this algorithm was the separation of bat echolocation calls, but this method can easily be applied and/or extended to other bioacoustic recordings with overlapping calls. We developed this method for the separation of sources without harmonics, but an association function could be applied to merge harmonics of identical sources at the end. Future directions can include modifications to the approach to separate more complex overlapping bioacoustic signals, such as those from dolphins, birds, or frogs vocalizing in large groups.

## ACKNOWLEDGMENTS

This project was funded by an Office of Naval Research Young Investigator Award (N00014-16-1-2478) and ONR N00014-18-1-2522, both awarded to L.N.K. The authors would like to thank Nicole Blaha, Morgan Kinniry, Kathryn McGowan, Allison Pudlo, and Lilias Zusi for assistance with data collection and extraction.