Coherent processing in synthetic aperture sonar (SAS) requires platform motion estimation and compensation with sub-wavelength accuracy for high-resolution imaging. Micronavigation, i.e., through-the-sensor platform motion estimation, is essential when positioning information from navigational instruments is absent or inadequately accurate. A machine learning method based on variational Bayesian inference has been proposed for unsupervised data-driven micronavigation. Herein, the multiple-input multiple-output arrangement of a multi-band SAS system is exploited and combined with a hierarchical variational inference scheme, which self-supervises the learning of platform motion and results in improved micronavigation accuracy.
1. Introduction
Synthetic aperture sonar (SAS) combines coherently the backscattered echoes recorded with an active sonar from a platform moving along a predefined trajectory.1 Coherent processing requires platform motion estimation and compensation with sub-wavelength accuracy to produce high-resolution acoustic imaging of the seafloor.2 Motion estimation with navigational instruments, which are commonly mounted on the SAS platform, is limited by the nominal accuracy of the sensors and, possibly, by interrupted data acquisition.3 In multi-channel systems, the relative ping-to-ping platform motion can be estimated by cross-correlating the signals of overlapping elements between successive pings due to the spatiotemporal coherence of homogeneous reverberation.4,5 This through-the-sensor platform motion estimation, referred to as micronavigation, aims at providing sub-wavelength accuracy for coherent SAS processing.6
Contrary to traditional micronavigation methods, which are based on analytical or numerical coherence models and involve spatiotemporal interpolation and fitting,7–9 a representation learning approach based on variational inference and implemented with a variational autoencoder (VAE) offers a fully data-driven method for platform motion estimation.10 The trained VAE provides immediate ping-to-ping platform translation estimates from coherence measurements on the coarse spatiotemporal acquisition grid determined by overlapping sensors, without further processing. Compared to data-driven autofocusing methods,11 which aim to compensate for phase errors in post-processing after image reconstruction, and hence their algorithmic complexity depends on the size of the image patch, data-driven micronavigation is a pre-processing phase correction, and it is independent of the image size or the beamforming method used for image reconstruction.
This study extends the variational inference scheme for micronavigation introduced in Ref. 10, by relating the coherence data from each subsystem of a multiple-input multiple-output (MIMO) configuration with a hierarchical Bayesian model. MIMO configurations utilize the waveform diversity from multiple transmitters for multi-spectral processing12 or for improving the spatial sampling.13 Herein, a SAS system with a two-dimensional (2D) receiver array and two transmitters is considered for multi-band imaging.14,15 Such a configuration results in two subsystems of virtual monostatic phase centers,5 which can be utilized for coherence measurements. Due to each transmitter's distinctive aperture and transmitted pulse bandwidth, the shape of the coherence function on overlapping elements at successive pings differs between the two subsystems, but the location of the coherence peak is defined by the relative translation of the platform in both cases.
We show that, for such multi-static systems, micronavigation estimates can be fused for better estimation accuracy. Specifically, we introduce a variational inference scheme with two coupled but independently parameterized VAEs that uses the common latent space between the two coherence datasets to learn jointly the corresponding generative features. Such cross-domain learning has been used for data fusion from different modalities of sensory signals16 and unified learning from multi-view images.17 The hierarchical formulation of the variational inference problem transfers the knowledge between the datasets and thus self-supervises the training of coupled VAEs and improves the estimation accuracy.
2. Coherence of diffuse backscatter in multi-band SAS
(a) Schematic of a multi-static SAS system with a two-dimensional (2D) array of receivers and two transmitters with different aperture sizes and frequency bands at either side of the receiver array. (b) The corresponding PCA virtual sensor configuration at two successive pings along the nominal trajectory (along x axis). An instance of the 3D spatial coherence of diffuse backscatter, as a function of relative displacement relative to the transducer size D and the pulse bandwith , is annotated for each of the virtual arrays.
(a) Schematic of a multi-static SAS system with a two-dimensional (2D) array of receivers and two transmitters with different aperture sizes and frequency bands at either side of the receiver array. (b) The corresponding PCA virtual sensor configuration at two successive pings along the nominal trajectory (along x axis). An instance of the 3D spatial coherence of diffuse backscatter, as a function of relative displacement relative to the transducer size D and the pulse bandwith , is annotated for each of the virtual arrays.
The matched filtered response in Eq. (2) assumes orthogonal waveforms, e.g., when the transmitted pulses occupy distinct parts of the frequency spectrum, which is the case considered in this study for multi-band imaging, i.e., when the residual term vanishes since .
3. Hierarchical variational inference
The loss function in Eq. (13) maximizes simultaneously the data likelihood for both datasets and couples the latent variables by minimizing the KL divergence between the two approximate posteriors. This coupling, introduced by the hierarchical model in Eq. (10), allows the approximate posterior to progressively supervise the training of the approximate posterior by labeling its prior, improving simultaneously the estimation accuracy of both models compared to the unsupervised case.
4. Coupled VAEs
The coupled model for hierarchical variational inference of the platform motion from coherence measurements with the multi-static system in Fig. 1 is implemented with two VAEs, depicted in Fig. 2, independently parameterized and trained with correlated datasets. Since this study builds upon the work in Ref. 10, we employ the same neural network architecture for the encoder and the decoder for each VAE.
The coherence data samples are simulated with 3D Gaussian functions, as derived in Sec. 2 for rectangular sensor apertures and LFM waveforms, on a 3D spatial grid corresponding to a 12 × 12 grid of adjacent PCA virtual transceivers with spacing 1.5 cm and a temporal window of 60 samples, which is transformed into slant-range as . To allow comparison with the unsupervised model introduced in Ref. 10, the data depend on nine generative factors: , and , which determine the 3D location of the Gaussian function and simulate the ping-to-ping translation, respectively; sx, sy, and sz, which determine the spread of the Gaussian function in each dimension and simulate the effect of transmitter aperture and pulse bandwidth; and rotation (ψ), scale (α), and noise-floor (ζ), which modulate and scale the Gaussian function and simulate the effect of anisotropic backscatter and noise, such that , where and . The choice of a Gaussian coherence function simplifies the analysis and aids reproducibility. Nevertheless, the models can be trained with other simulated or measured datasets from different array geometries7,9 with potential impact on performance.
Note that for multi-static SAS systems with several transmitters, the location of the maximum coherence is determined by the relative platform motion between pings and is common for all sub-systems. We employ a SAS configuration with two transmitters operating in different frequency bands, assuming that the second transducer has double the aperture size and half the bandwidth of the first transducer (80 and 40 kHz, respectively). Hence, the variance of the Gaussian coherence instances from the second dataset is double that of the first dataset for all dimensions (see Fig. 2). The rest of the generative factors are common for both datasets. Data points are generated simultaneously for the two datasets by randomly sampling each generative factor from a Gaussian distribution with mean and variance (see Table 1). The generative factors that are common between datasets are sampled once for each pair of data points.
Generative factors parameterizing the Gaussian functions in the coupled coherence datasets and the parameters of the Gaussian distributions they are sampled from.
Generative factor . | x-location . | y-location . | z-location . | x-spread sx . | y-spread sy . | z-spread sz . | Rotation ψ . | Scale α . | Noise-floor ζ . |
---|---|---|---|---|---|---|---|---|---|
0.03 m | 0 m | 0 m | m | m | m | 0.75 | 0.1 | ||
0.02 m | 0.04 m | 0.04 m | 0.01 m | 0.01 m | 0.01 m | 0.15 | 0.05 |
Generative factor . | x-location . | y-location . | z-location . | x-spread sx . | y-spread sy . | z-spread sz . | Rotation ψ . | Scale α . | Noise-floor ζ . |
---|---|---|---|---|---|---|---|---|---|
0.03 m | 0 m | 0 m | m | m | m | 0.75 | 0.1 | ||
0.02 m | 0.04 m | 0.04 m | 0.01 m | 0.01 m | 0.01 m | 0.15 | 0.05 |
The training of the coupled VAEs by optimizing Eq. (13) consisted of iterations for convergence, i.e., infinitesimal change of the loss value between iterations. At each training iteration, a batch of 1000 data points is used to update the parameters of the encoder and decoder networks. The data likelihood is considered Gaussian with for both datasets [see Eq. (11)]. The regularization parameter is set to β = 25, which offers a good balance between data reconstruction accuracy and a disentangled latent representation; see Ref. 10 for details on regularization parameter tuning. The latent space dimension, K = 15, is chosen larger than the dimension of the generative factors of the simulated datasets to account for realistic cases, where the number of latent features is not known a priori. Any extra features not present in the data will correspond to non-informative latents after training.
Figure 3 summarizes the capacity of the coupled VAEs to learn the latent features that represent their selective datasets. Specifically, the results in Fig. 3(a) refer to the VAE model fed with data points from the first dataset (corresponding to the smaller transmitter with wider bandwidth), referred to as β-VAE I, whereas the results in Fig. 3(b) are associated with the second model fed with data points from the second dataset, referred to as β-VAE II. The covariance matrix of the mean values that parameterize the approximate posterior of the latent variables inferred from the encoder during training indicates that both VAEs have learned disentangled representations of the data generative factors, indicated by the diminishing cross-correlations.
Performance statistics, including the covariance matrix of the approximate posterior mean values and the RMSE, the correlation coefficient, and the error histogram between the actual and the inferred variables from (a) β-VAE I and (b) β-VAE II, corresponding to the independent and dependent models of the coupled architecture, respectively.
Performance statistics, including the covariance matrix of the approximate posterior mean values and the RMSE, the correlation coefficient, and the error histogram between the actual and the inferred variables from (a) β-VAE I and (b) β-VAE II, corresponding to the independent and dependent models of the coupled architecture, respectively.
β-VAE I and II have learned to represent the data in their corresponding datasets, with six and nine generative factors, respectively, indicated by the number of the non-zero diagonal elements that relate to the variance of the corresponding inferred mean values μi from zero. The number of informative latents for VAE I is smaller than that for VAE II due to the fact that the employed spatial grid is too coarse to resolve some parameters, such as the spread and the rotation, sx, sz, and ψ, respectively, for the corresponding dataset. Hence, the common features learned correspond to 3D location, scale, and noise-floor. The rest of the learned features for VAE II capture the variation in spread and rotation but not very accurately due to the lack of supervision (see Ref. 10 for details). In Fig. 3, the plots showing the root mean square error (RMSE) between the latent variables encoding the 3D location of the Gaussian coherence and the corresponding generative factors, as well as the square of the Pearson correlation coefficient, , associated with each pair quantify the predictive ability of the VAE models. The histograms show the statistics of the error between the actual generative factor and the corresponding inferred latent mean value, , and , after training and provide a statistical description of the inference accuracy. Note that the error variance is smaller for VAE II, even though it relates to the dataset corresponding to a transmitter with larger aperture and narrower bandwidth, as its approximate posterior is supervised by the approximate posterior of VAE I in the hierarchical formulation.
Finally, Fig. 4 demonstrates the predictive ability of the trained coupled VAEs on a specific test case, which is the same as in Ref. 10 for comparison with the unsupervised model. A predefined translation track over 100 pings is superimposed with an interval of ±1 standard deviation inferred from each of the coupled VAEs. The absolute difference between the actual and the inferred tracks from coherence measurements is less than 2 mm for all translations for both models in the coupled architecture. Coupling the training of VAE through a common loss reduces the micronavigation estimation error up to 10 times compared to the unsupervised case.10
Ping-to-ping 3D translation trajectory (black solid line) of the platform carrying a SAS system along with the estimated values from an unsupervised (see Ref. 10) and a self-supervised coupled β-VAE (β = 25).
Ping-to-ping 3D translation trajectory (black solid line) of the platform carrying a SAS system along with the estimated values from an unsupervised (see Ref. 10) and a self-supervised coupled β-VAE (β = 25).
5. Conclusion
Coherent processing in SAS requires platform motion estimation and compensation with sub-wavelength accuracy. Micronavigation aims to infer the ping-to-ping platform displacement from the spatial coherence of diffuse backscatter on redundant recordings between pings. Variational inference offers a fully data-driven method for platform motion from coherence measurements. In this study, we introduce a hierarchical variational model implemented with coupled VAEs to relate the common latent features between datasets of coherence measurements in multi-band MIMO SAS systems. Self-supervising the training process of independently parameterized but coupled VAEs improves significantly the accuracy of the micronavigation estimates.
Acknowledgment
This work was performed under Project No. SAC000E04 of the STO-CMRE Programme of Work, funded by the NATO Allied Command Transformation.
Author Declarations
Conflict of Interest
The authors have no conflicts to disclose.
Data Availability
The data that support the findings of this study are available within the article.