This work demonstrates that automated mine countermeasure (MCM) tasks are greatly facilitated by characterizing the seafloor environment in which the sensors operate as a first step within a comprehensive strategy for how to exploit information from available sensors, multiple detector types, measured features, and target classifiers, depending on the specific seabed characteristics present within the high-frequency synthetic aperture sonar (SAS) imagery used to perform MCM tasks. This approach is able to adapt as environmental characteristics change and includes the ability to recognize novel seabed types. Classifiers are then adaptively retrained through active learning in these unfamiliar seabed types, resulting in improved mitigation of challenging environmental clutter as it is encountered. Further, a segmentation constrained network algorithm is introduced to enable enhanced generalization abilities for recognizing mine-like objects from underrepresented environments within the training data. Additionally, a fusion approach is presented that allows the combination of multiple detectors, feature types spanning both measured expert features and deep learning, and an ensemble of classifiers for the particular seabed mixture proportions measured around each detected target. The environmentally adaptive approach is demonstrated to provide the best overall performance for automated mine-like object recognition.
The complexity of the natural underwater environment creates a challenging arena in which to find underwater mines, particularly through autonomous processes. The wide variety of seafloor textures, depth offsets, and sediment types contribute greatly to the difficulty that sensors, detectors, and classification algorithms have in performing mine countermeasure (MCM) tasks. Despite the challenges presented by this environment, including the difficulty of collecting clean imagery,1 continued innovation in automated MCM approaches remains a priority for the research and defense community.2 In this paper, we propose a comprehensive fusion of information to maximize machine learning (ML)-driven MCM performance. This brings together information characterizing the seabed, colocated synthetic aperture sonar (SAS) imagery formed from different sonar frequency bands, multiple target detectors, numerous feature measurements, and a collection of classifiers. Our overall approach provides a clear performance gain in mine-like object recognition tasks, is adaptable to changing environments, and able to discover new seabed environments as they are encountered while using active learning to train classifiers with targets in new environments.
Previous approaches in mine detection and classification have included matched filter designs to capture the target structure within sonar imagery3 and canonical correlation analysis (CCA) using sonar backscatter.4 More recently, graph-based approaches that build low-dimensional embedded spaces have been explored to see if manifolds that better separate mines from clutter can be found.5 Rather than isolate a particular technique, other approaches find MCM performance gain by considering a wider range of frequencies used by the sonar. Early examples include an adaptive clutter filter, which combines three sonar images formed with different frequencies and bandwidths,6 a dual-band SAS constructed with high-frequency (HF) and low-frequency (LF) sonar,7 and a multiband approach to segment the object, shadow, and seafloor within the side-scan sonar imagery.8 More recently, novel types of mine detectors using multiband SAS imagery include a subspace-based method with a fast dictionary learning algorithm9 and a Mondrian detector that assesses pixel intensities in a set of shaped boxes with various spatial arrangements.10 Other work has explored MCM gains possible by constructing SAS imagery with disparate sonar platforms,11 bringing in new information from multiple simultaneous views, rather than extending the frequency range.
Along with investigating a broader extent of the sonar spectrum, incorporating the environmental context within the decision process for MCM tasks has previously been explored. Recent work extends the available information for the MCM task to include electroencephalography signals from human operators as input within a mine-like object classifier along with SAS imagery.12 Environmental characteristics of the seafloor are indirectly used in work that explores a resampling algorithm to assess the classifier performance to determine an optimal feature group and classifier pairing.13 The environment is used more explicitly in work that trains a classifier for a set of seabed types and assesses the performance within an ensemble of these classifiers.14 Recent work also demonstrates the detector performance gain by adaptively using seabed characteristics within the detector framework.15
Deep learning approaches have also been a common theme in recent MCM work. Focused on the classification of mines within SAS imagery, convolutional neural networks (CNNs) have been shown to be quite effective in mine recognition.16 Recent work with CNNs has also been done to train detectors as a replacement for constant false alarm rate (CFAR)-type detectors, where image segmentation, isolating targets from the background, allows for a comparison between the foreground and background within an observation to be used to train object detectors.17 One drawback of this approach is the number of training samples needed. In recent work, a reduction training sample size, required for CNNs developed for sonar image classification, has been demonstrated by using priors for image scene context and image structural similarity but still training with over 20 000 samples.18 Although these approaches are encouraging, CNNs pose limitations due to the quantity of required training data (tens to hundreds of thousands of observations), whereas the approach we have adopted within this work is able to train with only hundreds of samples over a wide range of seabed types. In particular, this work introduces a segmentation constrained network (SCN) that is able to combine a CFAR-like foreground/background segmentation with a deep network effectively with less than 300 targets, whereas Akhtar and Olsen17 used 40 000 samples for CNN training with a CFAR-type detector. These other deep learning approaches present a full deep learning solution, whereas our approach uses deep learning only to perform feature extraction in between the detection and classification stages. Doing so, allows us to bring in a distinct seabed characterization front end and an active learning portion to the back end, which can be developed without the requirement of tens of thousands of training samples.
This paper investigates how to combine all available information to significantly improve underwater mine-like object recognition and provides a framework for learning when sensing environments change significantly. Our approach includes using a wide range of sonar frequencies along with seabed characterization, as well as both Bayesian and deep learning architectures. Building on previous work that showed the benefits of environmental information to assist mine detection15 and classification,14 we develop a technique to use seabed characterization to enhance a complete automatic target recognition (ATR) process. One of the primary contributions of this work is to use our adaptive seabed characterization model19 to enhance ATR, enabling the discovery of new seabed types as they are encountered during data collection to facilitate mine recognition. Another primary contribution of this work is in our use of active learning to retrain seabed-based fusion when observations with sufficiently large novel information content are discovered, e.g., when new targets or challenging clutter types are encountered within a novel seabed environment. As a minor contribution, this work also introduces a SCN, which uses a foreground/background image segmentation to more efficiently train our deep learning architecture for feature discovery for underwater mine-like objects. We demonstrate the utility of our full approach across a variety of data collection sites.
The remainder of the text is organized as follows. A system overview of our approach is provided in Sec. II. The seabed characterization model is described in Sec. III. The detector algorithms used are described in Sec. IV. Details of the features used for classification are provided in Sec. V. Details of the classifier ensemble, seabed-informed fusion, and active learning are given in Sec. VI. The end-to-end system performance is described in Sec. VII, and concluding thoughts are provided in Sec. VIII.
II. APPROACH OVERVIEW
Our MCM approach begins with characterizing the seafloor using HF SAS imagery. By first characterizing the environment in which mines are sought, the more precisely and effectively different sensors, signal features, detectors, and classifiers can be employed for specific MCM tasks. The environmental characterization can be applied to individual SAS chips so that multiple sensing and classification models can be fused on a per-SAS-chip basis, maximizing the benefit of using environmental characterization for MCM tasks.
A system overview of our approach, the lifelong environmentally adaptive probabilistic recognition (LEAPR) for MCM, is shown in Fig. 1. The flow chart of our approach begins at the left green box, describing our adaptive environmental characterization,20 where SAS images are assessed based on local textures to train an unsupervised clustering model. Our environmental characterization model uses an adaptive online hierarchical Dirichlet process (aoHDP),19 which performs unsupervised clustering in a dynamic setting, allowing the model to adapt to a changing environment and discover novel seabed types as they are first encountered in new data collections. Described in Sec. III, our model works in the context of lifelong learning, where the model can be used to process data once it has undergone training with an initial data set; however, the model also continues to learn as new data become available, creating a more robust statistical representation of the sensing environment and lending itself to a wider extent of applicability.
Because the environmental characterization uses a Dirichlet mixture-modeling approach, the characterization ascribed to each SAS chip under investigation consists of a mixture of seabed textures proportionally determined by the model. Rather than segmenting an image, by assigning a seabed type to each pixel, each SAS chip is assigned a vector of probabilities (summing to one) of mixture proportions for each seabed type contained within the SAS chip. These proportions are then available for use in our downstream processing, particularly, in our final step of probabilistic decision fusion.
Once the seabed mixtures are assessed, the set of sensor data, features, and detectors to use for the MCM tasks are selected; these are depicted in the brown box in Fig. 1. The system processes the SAS data collected with both HF as well as broadband (BB) sonar. Although the process readily scales to include additional sensor types, such as LF data, to date, we have only implemented it with HF and BB sonar. LEAPR uses two detectors in parallel, one of which is a fused raw energy and Kullback-Leibler (KL)-divergence (FRED) detector described in Sec. IV. The other detector is the adaptive object segmentation (ASEG) detector,21,22 initially developed for target detection in synthetic aperture radar (SAR) imagery, which we have applied here to the SAS object detection. We merge overlapping detections from each and filter detections using an upfront prescreener stage to de-weight the clutter background. For each resulting detection, we then extract a collection of expertly defined features that encode the Bayesian priors used by human analysts in parallel with target-centric features discovered using deep learning, implemented with our developed SCN architecture described in Sec. V. For sonar image chips produced by our upfront detection algorithm, these complementary feature sets provide a rich representation for LEAPR's downstream automated mine-like object recognition processing.
The feature sets are passed along to a collection of classifiers, depicted in the orange boxes in Fig. 1. We use an ensemble of Bayesian classifiers spanning from sparse to non-sparse representations achieved via different combinations of prior distribution types and objective functions, providing a more comprehensive target classification than using an individual classifier. The information from the environmental characterization, detector algorithms, and classifier ensemble are passed along to our probabilistic decision fusion process, depicted in the upper blue box in Fig. 1. Our seabed-informed fusion process employs a model-stacking approach to classify each detection based on a nonlinear kernel representation of ensemble classifier probabilities, detector scores, and environmental context features extracted from the available sensor data used in the MCM task. Further details of our classifier ensemble and seabed-informed fusion approach are described in Sec. VI.
In conjunction with our classifier ensemble fusion, we also employ an active learning algorithm,23 where the novel information content of the incoming detections is assessed, and an expert analyst can provide labels for detections that are sufficiently informative to the algorithm. This allows the human analyst to directly refine the classifier's maturation and is an important function, particularly, when novel seabed types are encountered to maximize performance in the presence of challenging environmental factors. Further details of our active learning approach are provided in Sec. VI.
III. SEABED CHARACTERIZATION
The first step in our processing chain is to assess the environment in which we are sensing. Because the targets of interest are on the seafloor, we turn to the seabed characterization to provide categories of environmental context with which to look for mine-like objects. Our seabed characterization is performed by assessing the textures within the SAS imagery of the seafloor in an unsupervised clustering fashion, using our variant of the hierarchical Dirichlet process, the aoHDP.19 The aoHDP model learns to cluster observations into categories of distinct seabed types using each batch of collected data. However, the model also works in a lifelong learning capacity, allowing the model to continue to discover new categories of seabed types as new data arrive. This is accomplished by processing in a batch mode, allowing the model inference to complete with each batch (data collection) . A subset of samples representative of each learned seabed type are then stored in memory . When a new batch of data becomes available, the model inference is restarted without re-initializing learned cluster statistics and by using both the new batch data and samples in memory .
As a review, the aoHDP generative model is given as
Each observation is indexed with respect to a full SAS image , about 50 m × 50 m, and a smaller chip within the image, about 2 m × 2 m. As described in Brandes and Ballard,19 the features extracted from each SAS chip are lacunarity, a texture feature coding number, and a rotationally invariant histogram of gradients feature. These features are modeled, , as draws from a Gaussian distribution, parameterized by (short-hand for both mean and precision). The cluster index for each chip in each SAS image is drawn from a multinomial distribution, parameterized by , drawn as a conjugate prior from a Dirichlet distribution. The pair of clustering variables, and , provide the clustering at the low level within the SAS image. These are linked to the globally learned clusters across all SAS chips through , here, modeled with a stick-breaking representation. In the stick-breaking representation for clustering, a full set of cluster proportions of the data is treated as a stick of length one. A portion of the stick is broken off, drawn from a Beta distribution, representing the proportion of samples in cluster . Subsequent clusters are broken off from the remaining portion of the stick.
Of particular importance to our use of this model for seabed characterization, the inference process assigns all observations to a cluster at the end of processing each batch, normalizing the stick length at clusters and effectively discarding the last length of the remaining stick. However, upon processing a new batch, we bring back the previously discarded portion of the stick to allow the possibility of a new cluster () being discovered within the unseen data. Finally, the Gaussian statistics for each cluster are drawn, as short-hand notation, from a pair of conjugate distributions for the mean as a draw from a Gaussian , and precision as a draw from a Wishart distribution . Details of the model setup and exact parameterization are provided in Brandes and Ballard.19
Applied in our setting to provide environmental context for an ATR-process, the seabed characterization model serves as a way to quickly categorize the background landscape with which to detect and classify mine-like objects. In practice, the aoHDP model provides learned category labels for the seabed imagery used for finding targets. With each batch of imagery, the model provides seabed categories for each image chip. When the model discovers new seabed types, those labels are applied to the new imagery and provide information relevant to the active learning portion of the system as they can add to the novel content assessment used by active learning to select which detections to request labels for. Many patches of the seafloor contain mixtures of seabed types, and what is provided to the active learning is a proportional mixture of seabed types for each image chip as a vector that sums to one.
IV. DETECTION OF MINE-LIKE OBJECTS
Our detection process uses multiple detectors followed by a Bayesian pre-screener stage to provide a fusion of the detector outputs as depicted in the multi-detector fusion block in Fig. 2. The employment of multiple detectors provides an increased probability of detection over either of the individual detector algorithms, and the Bayesian pre-screener stage provides effective filtering for clutter rejection. Here, we use this framework to combine two detectors, although it readily scales to include more detectors if desired. One detector we use is the ASEG detector.21 The ASEG detector uses a wavelet decomposition and adaptive noise threshold estimation to segment the SAS image into foreground pixels representative of targets and background pixels of the underlying scene. Unlike a traditional CFAR stencil,24,25 ASEG measures the statistical representation of the foreground pixels as part of the candidate object segmentations it produces. ASEG then discriminates targets from clutter using appearance features based on gamma distributions fit to both foreground objects and local background pixels extracted via dilation of the object segmentations.
A. FRED detector
A second detector that we use is the FRED detector. The FRED detector is a fusion of two simple and efficient detection algorithms: the raw energy detector and the KL-divergence Reed-Xiaoli (RX) detector. Individually, the simplistic nature of each can lead to very high false alarm rates (FARs). Whereas the constituent algorithms generally agree on actual targets, they tend to include different false alarms. Therefore, the number of false alarms can be significantly reduced by keeping only those detections for which both algorithms agree.
For pre-processing, each frequency band of data is resampled to the same lower resolution. Then, the log of the data is taken so that the resultant image pixel values are roughly normally distributed.26,27 Image values are then clipped to dampen outliers and normalized to have a unity background.
The first of the fused multiband detectors is the raw energy detector, which uses the energy accumulated across both frequency bands of the sensor to make detections. For this, we use a weighted sum to combine both bands of the sonar data into a single image. The idea is that man-made objects generally persist across different frequency bands. Conversely, natural objects tend to be less coherent across the band. Therefore, these man-made objects will most likely accumulate more energy in the combined image. Further, HF data provide fine object detail, whereas lower-frequency data provide more object substance. Combining the frequency bands tends to make objects appear whole and easier to detect at the cost of image blurring. In the work presented here, the HF and BB band weights that we used were 0.4 and 0.6, respectively. Because of the band properties explained above, the performance is not very sensitive to the weighting choice. Once the bands have been combined, the resultant image is filtered to remove noise, and objects that are less than 10% of the smallest target size are culled. Next, we remove local energy differences by subtracting the mean and dividing by the variance of values around each pixel of the image. The detections are then produced by thresholding this image to find areas with abnormally high energy.
The second multiband detector to be fused is the KL divergence RX detector. The RX detector28 is a simple detection algorithm that finds local anomalies in an image by comparing windowed data with surrounding contextual data. An anomaly is declared when a large difference in statistical measurements occurs between the inner and outer windowed data. A variety of metrics can be used for the comparison; KL divergence is used in our implementation. KL divergence29 is a comparison of two probability density functions (PDFs) that measure the information lost when using one PDF to approximate another. The KL calculation is
where and are the estimates of the mean and covariance for the th distribution, is the trace of operation, and is the determinant operator. A small indicates that the two PDFs are very similar.
For our KL divergence calculation, the two PDFs that are being compared are generated from the two windows shown in Fig. 3. A PDF is estimated for both the green and blue regions by assuming the data follows a Gaussian distribution and computing the mean and covariance of the high and low bands. These windows are moved over the entire image, and the metric is computed for every location. The image locations with the highest score are then selected as detections to be fused.
The window sizes can significantly change the score values and ability of the detector to detect the targets. The general rule is that the inner window should be slightly bigger than the targets of interest. The outer window provides the environmental context for the anomaly detection and is much less definitive in its optimal setting. For its size, we must consider the scale of the environmental features that we do not want to be anomalies (potential targets). For example, small holes distributed across an image are a naturally occurring environmental feature in some environments. If we have a small outer window size, we are less likely to have other holes in our local window context. This will make every hole appear more anomalous. However, if we have a large outer window size, we are likely to have other holes in our window context. This will make every hole appear less anomalous. Hence, the outer window size should be set according to the desired environmental sensitivity and/or interaction of the targets of interest with the environment; however, in our use, it does not need to be finely tuned and is set for a general use based on the SAS image. In our application, the inner window is set for a area and the outer window is set for a area.
For the fusing of detections from each detector, detections that are close to each other from different detectors, less than approximately 1.5 m, are fused into a final detection call for FRED. All other detections are discarded.
B. Multi-detector fusion
Given the union set of detections from both FRED and ASEG, our multi-detector fusion first merges the overlapping bounding boxes. In practice, we find merging any detections with a two-thirds overlap to be effective at minimizing redundant (duplicate) object detections. We then extract the lacunarity30 and texture features via a texture feature coding method (TFCM)31 from the SAS image chips centered on each merged detection with each parameterized as detailed in Brandes and Ballard.19 Lacunarity is a measure of how patterns fill space and extracted highly efficiently via integral image calculations. The TFCM measures the local texture based on the multidirectional rising and falling of pixel intensity primitives and is likewise fast to calculate for large images using a series of comparative threshold operations and coding via lookup tables. Both of these features can also be thought of as filters because they yield outputs that match the size of the image chips that they measure.
These lacunarity and TFCM features are then passed to a Bayesian pre-screener model, which provides target likelihood probabilities for each merged detection. This Bayesian pre-screener model consists of a relevance vector machine (RVM) that is trained on the statistical Hu moment32 (seven features) and gamma distribution parameter point estimates (two features) of each lacunarity, TFCM, HF, and BB chips, which are extracted at the location of each detection. This results in a set of 36 features (9 for each of these 4 image chips) that were used by the pre-screener model to comprise the detector fusion process. The pre-screener serves as an initial false alarm mitigation stage before performing more computationally intensive feature extraction, classification, and seabed-informed fusion processing to further refine the accuracy of the automated mine recognition described in Secs. IV and V.
As shown in Fig. 4, our multi-detector fusion algorithm provides a higher probability of detection of targets than using either the FRED or ASEG detectors independently. This result also highlights a trade-off between the ASEG and FRED detectors in their abilities to filter out clutter while recognizing potential targets that our fusion algorithm can exploit. For instance, our LEAPR multi-detector fusion algorithm obtains at least 4% higher than either of the individual detectors and more than 32% reduction in the FAR for similar levels as the best individual detector. In practice, we configure our detection threshold such that we maintain close to the maximum (approximately 0.95 from Fig. 4) while mitigating as many false alarms as possible at that high (this results in around 0.82 false alarms per image from Fig. 4). Of note in Fig. 4, the performance is measured based on the targets found within the SAS imagery, and reducing the threshold on an acceptable FAR does not guarantee all targets get detected within the imagery. In the results shown here, the combined detectors still miss about 5% of the targets. To allow a fair comparison among classifiers without any bias from the detectors, any targets missed by the detectors are added back in for the classification performance comparisons in Secs. VI and VII.
V. FEATURE REPRESENTATIONS
Given the final set of candidate, mine-like objects produced by our upfront LEAPR detector algorithm, we then calculate more computationally intensive feature representations toward the objective of more accurate, automated mine-like object recognition. To this end, we consider a combination of feature extraction methods, including both expert-defined features and features learned through a neural network, as illustrated in the multifeature extraction block in Fig. 2. One expert-based feature set that we use is described as computer aided detection/computer aided classification (CADCAC), which combines a nonlinear matched filter and a numerous set of pixel measurements from a target-sized window as detailed in Dobeck et al.33 Additionally, we use Haar wavelet filters for the object feature extraction as described in Viola and Jones,34 along with a CCA for underwater targets as described in Tucker and Azimi-Sadjadi.11 Last, we measure the point statistics (PStats) within the SAS image chips centered on each detection, which consist of measurements of the pixel values: mean, variance, median, mode, quartiles, and mean absolute deviation.
A. Deep network feature learning
In parallel with the expert-defined features, we also employ feature discovery through a deep architecture using a residual neural network (ResNet).35 We find that the features learned through our deep learning architecture are more useful for target classification if we filter out much of the background imagery surrounding the detected mine-like objects. We achieve this through using what we call a SCN, which uses the foreground/background pixel segmentation mask generated by the ASEG detector for each SAS detection chip, forming a prefilter for the neural network to focus the deep learning architecture to learn features on detected objects rather than on background imagery. The SCN uses accurate pixel-level segmentations provided by ASEG to focus the network on the on-target pixels and mitigates the problem of over-fitting to undesired background characteristics by incorporating segmentations into the input layer and propagation through subsequent layers of the network. This facilitates improved generalization to new environments and clutter backgrounds. An example of the images segmentation boundary is shown in Fig. 5.
B. ResNet architecture for mine-like object recognition
The construction details of our ResNet based SCN architecture described in this section are shown in Fig. 6. The initial layers of the network (which we refer to as the stem) consist, in order, of a max-pooling layer, two convolution layers, a max-pooling layer, and a final convolution layer. A restricted linear unit (ReLU) activation is applied after the first two convolutions, and no activation is applied after the third convolution. Given the high resolution of the imagery, the max-pooling layers quickly reduce the spatial dimensions of the data with a minimal loss of information. The input to the network is the two channel image containing the segmented and centered HF and BB channels.
The network contains both down-sampling and residual blocks. The down-sampling block follows the principle suggested in Szegedy et al.36 of reducing the grid-size while expanding the filter bank: the input is fed through both a convolution layer with no activation and an average pooling layer, the outputs of which are then concatenated to form the spatially down-sampled filter bank. The residual block follows the proposed block in He et al.,37 consisting of two repetitions of a batch normalization layer, a ReLU activation, and a convolution layer with no activation. The input to the residual block is then added to the output of the last convolution layer in the residual block.
The output of the initial layers is then fed into two repetitions of a down-sampling block followed by a residual block. Following the convention in Szegedy et al.,36 the output of the last residual block is then average pooled across the spatial dimensions to obtain a final feature vector. This feature vector is then fed into a two-class Softmax layer, which classifies the input as a target (e.g., mine-like) or not.
As described in the literature, the Adam optimizer38 is used with the parameters = 0.9 and = 0.99. The initial learning rate is 1E-5 and decayed by a factor of 0.9 after every 18 epochs. The network is trained for 360 epochs with a batch size of 50.
The data set is parsed into an equal number of true positive target detections and false positive clutter imagery for training with a balanced data set to avoid biasing the network to classify everything as clutter (more numerous). Because we have a small training set, we use data augmentation and weight decay during training to mitigate overfitting the network. The data set is augmented by flipping the imagery along the cross-range direction (left/right flip), taking advantage of the symmetry of SAS imagery in the cross-range/azimuth. A weight decay is used on all weights with a penalty coefficient of 1E-6.
We also explored the use of the segmentation mask with two variations of ResNet, one with and one without the mask. The network not using the segmentation mask tended to overfit the training data, which was not the case with the network using the mask. We found that the segmentation reduces overfitting of the networks to the background features in the training data set, and our SCN exhibits improved spatial sensitivity for bright areas in the image even at lower contrast with the background, which facilitates an increased generalization irrespective of newer background textures and target categories.
VI. SEABED-INFORMED MINE RECOGNITION
Given the feature representations of the objects, we detect from the sonar imagery (described in Sec. IV) that our LEAPR approach performs seabed-informed mine-like object recognition using an ensemble of Bayesian classifiers consisting of a variety of approaches, which key in on different salient features39 such as the mixture proportions of the seabed type from the aoHDP model and classifier fusion achieved via model stacking. We then use active learning, which enables a human-in-the-loop setting to provide target labels to mature classifier training. This is particularly important when new seabed types are encountered and very little to no labeled data are available for classifier training for unseen mine and context combination types.
In the work presented here, the seabed characterization model finds 17 seabed types, each at one collection site or another, which is considered novel. Some have very little coverage within the collection sites in which they are found, and not all contain targets. In testing our full ATR system, we rotate through all collection sites in a leave-one-site-out fashion, shifting the order in which the seabed types are discovered and remain as novel within the site left out for testing. To use the seabed types in a more statistically sound way within our fusion process, in which the seabed type is used directly, we distinguish them as unique seabed types that cover at least 5% of at least one collection site in which they are found. The seabed types that do not reach this density are grouped together as a category labeled as “other.” As shown in Fig. 7, only seven types have enough site coverage to be categorized as distinct with the eighth labeled as other.
A. Classifier ensemble
Our classifier ensemble serves as way to package the output from a collection of classifiers for evaluation by the fusion process. As shown in Fig. 2 and detailed in Sec. V, two categories of features are extracted, expert features (EFs) and deep learning features (DL). The classifier ensemble consists of a set of four Bayesian classifiers spanning from sparse to non-sparse representations, including both the ridge multinomial logistic regression (RMLR) and sparse multinomial logistic regression (SMLR)40 variants, RVM,41 and sparse probit regression (SPR).42 Through a comparative analysis of classifiers [which included a probable neighborhood-based classifier (PNBC) and naive Bayes] with an image set, these were the top four models, each providing a complementary performance, including being the best classifier for different samples. In the analysis presented here, each of the two feature sets (EF and DL) are expanded to be used both directly and as nonlinear kernels achieved via a radial basis function (RBF) representation, resulting in four different feature representations, which are sent to the classifier ensemble independently. We also whiten the feature data to be mean zero and unit variance prior to training. The Bayesian classifier ensemble produces 16 classifier outputs, 1 for each feature-classifier pair. These 16 outputs are then sent to the fusion process.
In assessing the performance of each feature-classifier model individually, we observe that different feature-classifier pairings obtain the highest stand-alone performance for different seabed types and data collection sites (i.e., geographic locations). For instance, the best performing model on data collected from site 1 uses deep learning features (11% improvement over the expert feature models), whereas the best performing model for data from site 2 relies on expert features (16.5% improvement over the deep learning feature models). This motivates an environmentally adaptive approach (such as our seabed-informed classifier fusion algorithm) over conventional, single model approaches.
We investigated the performance variation among different environments by assessing the performances of the SMLR across seabed types using our environmental characterization model on our available data. The eight seabed categories and number of targets and clutter in each type are shown in Fig. 7. The classifier results for the SMLR on targets in these eight seabed categories using the expert features are shown in Fig. 8. The variation in the classifier performance suggests the importance of including the seabed information into the classification process because not all environments are equally difficult for mine-like object recognition. More noteworthy, though, is that the single best performing feature-classifier pair varies among the different seabed types as it did for the different geographic locations of data collection sites (for full details on how the inferred seabed types are distributed among collection sites, see the results of Brandes and Ballard.19 This provides additional motivation for using a fusion process with the 16 class probability outputs from the classifier ensemble to guide the blending of the feature-classifier pairs, along with the seabed characterization, for an optimized target classification.
B. Classifier fusion
We consider two fusion approaches. The first approach uses a generalized Chernoff fusion (GCF),43,44 which is applied to the outputs of the Bayesian classifier ensemble, shown in cyan in Fig. 9. A well-established model, GCF performs a weighted average of distributions in log-space, taking into account information-theoretic dependencies between the models. Unlike a naive Bayes approach to fusion, GCF mitigates overconfidence due to simplifying independence assumptions, yet has a tendency to result in under-confidence of correct decisions. A second fusion approach, which is shown in green in Fig. 9, uses a model-stacking approach to combine all available information from the seabed characterization, detector scores, and classifier ensemble to recognize the mine-like objects. We describe this second approach as seabed-informed fusion. As shown in Fig. 2, the seabed-informed fusion takes all of the available input and forms a second order polynomial kernel. This is then used by a sparse Bayesian classifier (SMLR), our best performing classifier, to provide the final classification for the underwater targets. The polynomial kernel that is sent into the SMLR is formed with observation vectors for each detected object containing elements from the following inputs: 8 seabed mixture proportions, 1 detector score (from ASEG), and 16 classifier outputs. The polynomial kernel expands this set of linear terms to include pairwise products (i.e., interaction terms), squared terms, and a bias term. All of the classifier outputs are whitened (zero mean, unit variance) prior to constructing the kernel.
C. Active learning
To enable our LEAPR classifier fusion to be more fully adaptive, we additionally employ active learning,23 where the novel content of the incoming detections is assessed and a human-in-the-loop can provide labels for detected mine-like objects that are sufficiently novel. The classifiers are then retrained to refine the models with this new knowledge. Our active learning approach does a greedy search on the detections to assess the degree of new information within each detection to generate a basis that maximizes the gain in the Fisher information.45,46 Thus, a small subset of detections whose environmental and target-centric signatures are sufficiently novel to the classifier is determined, and this subset is presented to the human analyst for providing labels. An active learning strategy can be either sequential or parallel in the sense that the novelty of the detection signature information can be assessed in two distinct ways. Sequential strategy involves assessing the novel content within detections in a streaming context, where detections are assessed in the order received, and any detection with a sufficiently high amount of new information can be sent to a human operator for immediate analysis for providing a label (e.g., target or clutter). This sequential strategy is, however, not optimized and can lead to an operator providing labels for more detections than are necessary if the most informative samples might come late in the data stream.
To address this operational inefficiency, we use an alternative parallel approach in which all of the detections within a collection site are assessed at the same time as a batch and ranked as to their novel information content. The highest-ranked, most novel samples are presented to the operator for labeling. This leads directly to a more optimized approach, requiring fewer labels from a human operator, while maximizing the classifier's knowledge maturation. This batch-approach matches the operational flow of our environmental characterization algorithm in which each data collection is processed as a batch to assess the seabed categories and new seabed type discovery. The active learning sample set is defined by selecting those signatures that are most representative of the measured data from the site of interest, using fundamental information-theoretic considerations. The new seabed types are more likely to contain new information for the classifier retraining governed by active learning and, thus, batch-processing the active learning samples fits in well with this workflow. For experimental purposes, we used a held-out subset of detections labeled by human analysts as a query set to evaluate active learning. Rather than set a threshold for the novel information content, we look up the labels for a predefined number (e.g., 10, 20, 30, 40, or 50) of detections from the top of a ranked list of detections based on their amount of novel content.
Along with options on how to select samples to use as new labeled training data, there are choices about what parts of the architecture to retrain. One approach that we evaluated, ensemble-update (EU) active learning, focuses on updating the full ensemble of Bayesian classifiers as the retraining step. In EU active learning, the feature vector used by the ensemble is evaluated for novel information using Fisher information to select new samples to use in retraining.
In the other approach that we explored, fusion-tuning (FT) active learning, only the single SMLR classifier used as the output layer of the fusion process is retrained (described as seabed-informed fusion in Sec. VI B). To assess the novel content, the full input to the fusion process is used, which includes the seabed mixture proportions within each new detection. The FT active learning accesses a wider scope for new information and is much faster to update through retraining because only one classifier is updated rather than a full ensemble of classifiers.
VII. OVERALL SYSTEM PERFORMANCE
To test the overall system performance and compare it against various models, we assessed data collected over 8 sites (see Brandes and Ballard19) with a total of 278 targets. We measure the overall system performance by measuring the probability of classification () vs the FAR standardized across methods. Because our analysis is not on a publicly available reference set, the actual numbers are not as important as the relative performance among the various methods. Our analysis is performed on a leave-one-site-out basis, where we trained on all but one site and left the last site out for testing. The aggregated results are shown for all possible combinations of leaving one site out.
We considered various categories of classification approaches to compare the performance against. We start with the minimum tools and evaluate a Bayesian classifier using traditional features (here, called expert features), described in the beginning of Sec. V. We compare this directly against using only deep learning features and classification with the ResNet model. These approaches are compared with an incremental subset of our model, which uses fusion (GCF; Sec. VI) of the Bayesian classifier ensemble with a combination of the deep neural net features from our SCN and expert features (Sec. V) from our combination of detectors (Sec. IV). We then bring in our EU active learning approach, where only the Bayesian ensemble is updated with the new labeled samples. Finally, the FT active learning model is evaluated, where we include the seabed characterization (Sec. III) directly within the information used to assess the active learning samples and apply retraining only with the seabed-informed fusion step.
The system comparisons of the mine-like-object recognition performance among the various methods are shown in Fig. 10 with selected comparison points provided in Table I. The black and red curves in the plot show the best performing Bayesian classifiers (SMLR and RVM), which are trained with expert-defined features (not including DL features) and serve as a baseline for comparison. GCF (shown by the dotted-cyan curve) provides a similar performance to ResNet (shown by the brown curve) over the baseline classifiers. To encourage an increase in performance with GCF, we prune the lowest 25% performing classifier ensemble models (labeled as GCF pruned and shown as the cyan curve). This provides clear performance gains over both ResNet and baseline classifiers.
|Algorithm .||FAR at 0.8Pc .||Pc at 0.2FAR .|
|EF + RVM (Ref. 39)||0.40||0.66|
|EF + SMLR (Ref. 38)||0.29||0.76|
|ResNet (Ref. 33)||0.28||0.73|
|GCF (Ref. 42)||0.28||0.73|
|LEAPR (EU active learning)||0.15||0.86|
|LEAPR (FT active learning)||0.08||0.95|
|Algorithm .||FAR at 0.8Pc .||Pc at 0.2FAR .|
|EF + RVM (Ref. 39)||0.40||0.66|
|EF + SMLR (Ref. 38)||0.29||0.76|
|ResNet (Ref. 33)||0.28||0.73|
|GCF (Ref. 42)||0.28||0.73|
|LEAPR (EU active learning)||0.15||0.86|
|LEAPR (FT active learning)||0.08||0.95|
For the various active learning LEAPR results in Fig. 10, we allow active learning to query 30 labels from the unlabeled data in the new collection site being tested (<5%, on average, of detections generated per each new site). These labels are from the held-out set from a larger ground-truth set labeled by multiple field operators. We experimented with selecting 10, 20, 30, 40, and 50 labeled samples, and the performance gains change little after using 30 samples with both types of active learning. The EU active learning (green curve) is the second best performer, providing a 40% reduction in the FAR at 0.8 and a 15% increase in at 0.2FAR over the GCF pruned approach. Overall, our best performance is achieved with seabed-informed FT active learning (dark blue curve) with a further 46% reduction in FAR at 0.8 and a 10% increase in at 0.2FAR over EU active learning. Thus, FT active learning not only provides a faster update than EU active learning, it also provides a more accurate performance in recognizing mine-like objects on the seafloor.
This work details a unified set of ML approaches to best measure and combine a wide range of information, which is measurable from SAS imagery to automate underwater mine-like object recognition. Along with outlining this complete process, this work also demonstrates the performance gains of the individual parts within the processing chain. An overall conclusion is that the system performs best when using all of the available information to recognize mine-like objects, including seabed characterization, a fusion of detectors, and expert-based features, along with deep learning features, a fusion of classifiers, and active learning with fusion fine-tuning as new environments are encountered. Other conclusions at the component level are that fusing multiple detectors outperforms fusing individual detectors, including seabed characterization, into the classifier fusion greatly improves performance, and active learning provides a classification performance gain, particularly at low FARs. The gains shown by bringing the seabed context into the fusion process suggest that future work would benefit by also using the seabed information within the detector step. This is, particularly, the case with the FRED detector and ASEG segmentation approach in which environmental characterization would allow the foreground/background assessment to be tailored for specific seabed types. Last, this work also suggests that additional improvements to automated mine-like object recognition are possible by adding additional sensors with new information into this process. The system detailed in this work is well suited to expand and accommodate new information and additional ML components as they are developed by the research community.
This work was funded under the Office of Naval Research (ONR) Contract No. N00014-16-C-3051.