Coherent processing in synthetic aperture sonar (SAS) requires platform motion estimation and compensation with sub-wavelength accuracy for high-resolution imaging. Micronavigation, i.e., through-the-sensor platform motion estimation, is essential when positioning information from navigational instruments is absent or inadequately accurate. A machine learning method based on variational Bayesian inference has been proposed for unsupervised data-driven micronavigation. Herein, the multiple-input multiple-output arrangement of a multi-band SAS system is exploited and combined with a hierarchical variational inference scheme, which self-supervises the learning of platform motion and results in improved micronavigation accuracy.

## 1. Introduction

Synthetic aperture sonar (SAS) combines coherently the backscattered echoes recorded with an active sonar from a platform moving along a predefined trajectory.^{1} Coherent processing requires platform motion estimation and compensation with sub-wavelength accuracy to produce high-resolution acoustic imaging of the seafloor.^{2} Motion estimation with navigational instruments, which are commonly mounted on the SAS platform, is limited by the nominal accuracy of the sensors and, possibly, by interrupted data acquisition.^{3} In multi-channel systems, the relative ping-to-ping platform motion can be estimated by cross-correlating the signals of overlapping elements between successive pings due to the spatiotemporal coherence of homogeneous reverberation.^{4,5} This through-the-sensor platform motion estimation, referred to as micronavigation, aims at providing sub-wavelength accuracy for coherent SAS processing.^{6}

Contrary to traditional micronavigation methods, which are based on analytical or numerical coherence models and involve spatiotemporal interpolation and fitting,^{7–9} a representation learning approach based on variational inference and implemented with a variational autoencoder (VAE) offers a fully data-driven method for platform motion estimation.^{10} The trained VAE provides immediate ping-to-ping platform translation estimates from coherence measurements on the coarse spatiotemporal acquisition grid determined by overlapping sensors, without further processing. Compared to data-driven autofocusing methods,^{11} which aim to compensate for phase errors in post-processing after image reconstruction, and hence their algorithmic complexity depends on the size of the image patch, data-driven micronavigation is a pre-processing phase correction, and it is independent of the image size or the beamforming method used for image reconstruction.

This study extends the variational inference scheme for micronavigation introduced in Ref. 10, by relating the coherence data from each subsystem of a multiple-input multiple-output (MIMO) configuration with a hierarchical Bayesian model. MIMO configurations utilize the waveform diversity from multiple transmitters for multi-spectral processing^{12} or for improving the spatial sampling.^{13} Herein, a SAS system with a two-dimensional (2D) receiver array and two transmitters is considered for multi-band imaging.^{14,15} Such a configuration results in two subsystems of virtual monostatic phase centers,^{5} which can be utilized for coherence measurements. Due to each transmitter's distinctive aperture and transmitted pulse bandwidth, the shape of the coherence function on overlapping elements at successive pings differs between the two subsystems, but the location of the coherence peak is defined by the relative translation of the platform in both cases.

We show that, for such multi-static systems, micronavigation estimates can be fused for better estimation accuracy. Specifically, we introduce a variational inference scheme with two coupled but independently parameterized VAEs that uses the common latent space between the two coherence datasets to learn jointly the corresponding generative features. Such cross-domain learning has been used for data fusion from different modalities of sensory signals^{16} and unified learning from multi-view images.^{17} The hierarchical formulation of the variational inference problem transfers the knowledge between the datasets and thus self-supervises the training of coupled VAEs and improves the estimation accuracy.

## 2. Coherence of diffuse backscatter in multi-band SAS

^{5}replaces each transmitter-receiver pair with a virtual transceiver located at the middle of the distance between them (see Fig. 1). In MIMO systems, each transmitter $ m \u2208 { 1 , \u2026 , M}$ with spatial aperture $ b T m ( r )$ insonifies the seafloor with a pulse, $ q m ( f )$, as a function of frequency

*f*, referred to as a ping. In any given time frame, the frequency response of the sound pressure recorded at a receiver located at

**r**is the total backscattered field from all scatterers within the corresponding isochronous volume

*vs*insonified by the superposition of the

*M*transmitted pulses,

*s*is the scattering strength of a scatterer located at $ r s , \u2009 r v m$ is the location of the transceiver on the virtual aperture associated with the

*m*th transmitter and $ B m ( k s )$ expresses the combined beampattern of each transmitter-receiver pair, where $ k s = ( 2 \pi f / c ) ( r s \u2212 r v m ) / | | r s \u2212 r v m | | 2$ is the wavenumber vector for sound speed

*c*. The information from the transmit diversity is exploited with matched filtering, i.e., by multiplying the recorded signal in Eq. (1) with the complex conjugate (denoted by an overline) of each transmitted pulse,

The matched filtered response in Eq. (2) assumes orthogonal waveforms, e.g., when the transmitted pulses occupy distinct parts of the frequency spectrum, which is the case considered in this study for multi-band imaging, i.e., when the residual term $ res = q \xaf m ( f ) \u222b V s [ \u2211 n = 1 , n \u2260 m M q n ( f ) B n ( k s ) e \u2212 j ( 2 \pi f / c ) 2 | | r s \u2212 r v n | | 2 / ( 4 \pi | | r s \u2212 r v n | | 2 ) 2 ] s ( r s , f ) d r s$ vanishes since $ q \xaf m ( f ) q n \u2260 m ( f ) = 0$.

*s*, which is the only random variable in Eq. (3), and considering that all scatterers in the isochronous volume lie on a spherical shell with mean radius $ R s \u2248 | | r s | | 2$ at the far field of the sonar such that $ | | r s \u2212 r v m | | 2 \u2248 | | r s | | 2 \u2212 r \u0302 s r v m$.

^{18}The spatial dependence of the coherence is described by the integral in Eq. (3), which represents the spatial Fourier transform of the power spectral density of the virtual transceiver beampattern modulated by the spatial coherence of the scattering function. For unresolved scatterers, the scattering function is spectrally and spatially uncorrelated, $ E s [ s ( r s , f ) s \xaf ( r s , f ) ] = | | s | | 2 2$. In this case, the spatial coherence depends only on the autocorrelation function of the virtual transceiver aperture,

^{19}

^{,}$ C T m ( r v m ) = ( b v m * b v m ) [ r ]$, where $ b v m = b T m * b R$,

*b*is the receiver's spatial aperture and $*$ is the autocorrelation operator. Considering sensors with rectangular apertures, as in Fig. 1, the resulting autocorrelation function of the virtual transceiver aperture can be well-approximated by a 2D Gaussian function.

_{R}^{8,20}Including the effect of spatially coherent scattering due to inhomogeneous media or anisotropic backscatter modulates the spatial coherence accordingly: $ C T m ( r v m ) = ( b v m * b v m ) [ r ] * ( s * s ) [ r ]$.

^{19}The temporal dependence of the coherence is obtained through the inverse Fourier transform of Eq. (3), which depends on the power spectral density of the transmitted pulse, $ C q m ( \tau ) = \u222b | | q m ( f ) | | 2 4 e \u2212 j 2 \pi f \tau d f = ( q m * q m ) [ \tau ] * ( q m * q m ) [ \tau ]$. For linear frequency modulated (LFM) waveforms with uniform frequency response over their corresponding bandwidth, the temporal coherence is a sinc

^{2}function, which can be also well-approximated by a Gaussian function. Normalizing Eq. (3) with the average power of the corresponding signals, the spatiotemporal correlation coefficient on each virtual sub-array is the product of the normalized spatial and temporal coherence associated with the corresponding transmitter characteristics,

*z*axis.

## 3. Hierarchical variational inference

*L*voxels, as derived in Sec. 2, and let $ u \u2208 \mathbb{R} K$ describe the

*K*-dimensional vector ( $ K \u226a L$) of the unobserved latent features such that

*f*is a non-linear generative model with parameters

*θ*. The non-linear generative model offers flexibility as higher moments of the data are captured in the likelihood, which, in the presence of additive Gaussian noise with zero mean and variance $ \sigma 2$, is expressed as the Gaussian distribution

^{21}as in Eq. (5)] due to the marginal likelihood, which involves the integration of the joint probability distribution $ p \theta ( d , u )$ over the latent variables. To overcome the intractability of the true posterior distribution, variational inference methods approximate the true posterior with a simpler distribution, such as a Gaussian distribution with diagonal covariance matrix

^{21}The accuracy of the approximation, $ q \varphi ( u | d ) \u2248 p \theta ( u | d )$, is quantified with the Kullback–Liebler (KL) divergence, $ D K L$, between the approximating and the exact posterior distributions and is optimized through the parameters $\varphi $ and $\theta $:

^{22,23}

*β*is introduced to control the importance between the data-fitting term and KL divergence between the approximate posterior and the prior distribution of the latent variables.

^{24}With a prior that promotes independent latent variables, e.g., $ p ( u ) = N ( u \u2009 | \u2009 0 , I )$ due to the diagonal covariance, setting $ \beta > 1$ favors disentangled representations of the latent variables at the expense of less accurate data reconstruction.

^{25}VAEs solve the variational inference problem by implementing the non-linear generative $ f \theta $ and inference $ g \varphi $ models that parameterize the data likelihood and approximate posterior distribution, respectively, as deep neural networks with mirrored architecture.

^{26}

The loss function in Eq. (13) maximizes simultaneously the data likelihood for both datasets and couples the latent variables by minimizing the KL divergence between the two approximate posteriors. This coupling, introduced by the hierarchical model in Eq. (10), allows the approximate posterior $ q \varphi 1$ to progressively supervise the training of the approximate posterior $ q \varphi 2$ by labeling its prior, improving simultaneously the estimation accuracy of both models compared to the unsupervised case.

## 4. Coupled VAEs

The coupled model for hierarchical variational inference of the platform motion from coherence measurements with the multi-static system in Fig. 1 is implemented with two VAEs, depicted in Fig. 2, independently parameterized and trained with correlated datasets. Since this study builds upon the work in Ref. 10, we employ the same neural network architecture for the encoder and the decoder for each VAE.

The coherence data samples are simulated with 3D Gaussian functions, as derived in Sec. 2 for rectangular sensor apertures and LFM waveforms, on a 3D spatial grid corresponding to a 12 × 12 grid of adjacent PCA virtual transceivers with spacing 1.5 cm and a temporal window of 60 samples, which is transformed into slant-range as $ y = c \tau / 2$. To allow comparison with the unsupervised model introduced in Ref. 10, the data depend on nine generative factors: $ \Delta x , \u2009 \Delta y$, and $ \Delta z$, which determine the 3D location of the Gaussian function and simulate the ping-to-ping translation, respectively; *s _{x}*,

*s*, and

_{y}*s*, which determine the spread of the Gaussian function in each dimension and simulate the effect of transmitter aperture and pulse bandwidth; and rotation (

_{z}*ψ*), scale (

*α*), and noise-floor (

*ζ*), which modulate and scale the Gaussian function and simulate the effect of anisotropic backscatter and noise, such that $ d = C ( x , y , z ) = max ( \alpha \u2009 exp \u2009 ( \u2212 1 / 2 [ ( x \u2032 / s x ) 2 + ( y \u2032 / s y ) 2 + ( z \u2032 / s z ) 2 ] ) , \zeta )$, where $ x \u2032 = cos \u2009 \psi ( x \u2212 \Delta x ) \u2212 sin \u2009 \psi ( z \u2212 \Delta z ) , \u2009 y \u2032 = ( y \u2212 \Delta y )$ and $ z \u2032 = sin \u2009 \psi ( x \u2212 \Delta x ) + cos \u2009 \psi ( z \u2212 \Delta z )$. The choice of a Gaussian coherence function simplifies the analysis and aids reproducibility. Nevertheless, the models can be trained with other simulated or measured datasets from different array geometries

^{7,9}with potential impact on performance.

Note that for multi-static SAS systems with several transmitters, the location of the maximum coherence is determined by the relative platform motion between pings and is common for all sub-systems. We employ a SAS configuration with two transmitters operating in different frequency bands, assuming that the second transducer has double the aperture size and half the bandwidth of the first transducer (80 and 40 kHz, respectively). Hence, the variance of the Gaussian coherence instances from the second dataset is double that of the first dataset for all dimensions (see Fig. 2). The rest of the generative factors are common for both datasets. Data points are generated simultaneously for the two datasets by randomly sampling each generative factor from a Gaussian distribution with mean $ \mu gf$ and variance $ \sigma gf 2$ (see Table 1). The generative factors that are common between datasets are sampled once for each pair of data points.

Generative factor . | x-location $ \Delta x$
. | y-location $ \Delta y$
. | z-location $ \Delta z$
. | x-spread s
. _{x} | y-spread s
. _{y} | z-spread s
. _{z} | Rotation ψ
. | Scale α
. | Noise-floor ζ
. |
---|---|---|---|---|---|---|---|---|---|

$ \mu gf$ | 0.03 m | 0 m | 0 m | $ 0.015 / 0.03$ m | $ 0.01 / 0.02$ m | $ 0.015 / 0.03$ m | $ 0 \xb0$ | 0.75 | 0.1 |

$ \sigma gf$ | 0.02 m | 0.04 m | 0.04 m | 0.01 m | 0.01 m | 0.01 m | $ 15 \xb0$ | 0.15 | 0.05 |

Generative factor . | x-location $ \Delta x$
. | y-location $ \Delta y$
. | z-location $ \Delta z$
. | x-spread s
. _{x} | y-spread s
. _{y} | z-spread s
. _{z} | Rotation ψ
. | Scale α
. | Noise-floor ζ
. |
---|---|---|---|---|---|---|---|---|---|

$ \mu gf$ | 0.03 m | 0 m | 0 m | $ 0.015 / 0.03$ m | $ 0.01 / 0.02$ m | $ 0.015 / 0.03$ m | $ 0 \xb0$ | 0.75 | 0.1 |

$ \sigma gf$ | 0.02 m | 0.04 m | 0.04 m | 0.01 m | 0.01 m | 0.01 m | $ 15 \xb0$ | 0.15 | 0.05 |

The training of the coupled VAEs by optimizing Eq. (13) consisted of $ 5 \xd7 10 3$ iterations for convergence, i.e., infinitesimal change of the loss value between iterations. At each training iteration, a batch of 1000 data points is used to update the parameters of the encoder and decoder networks. The data likelihood is considered Gaussian with $ \sigma = 0.05$ for both datasets [see Eq. (11)]. The regularization parameter is set to *β* = 25, which offers a good balance between data reconstruction accuracy and a disentangled latent representation; see Ref. 10 for details on regularization parameter tuning. The latent space dimension, *K* = 15, is chosen larger than the dimension of the generative factors of the simulated datasets to account for realistic cases, where the number of latent features is not known *a priori*. Any extra features not present in the data will correspond to non-informative latents after training.

Figure 3 summarizes the capacity of the coupled VAEs to learn the latent features that represent their selective datasets. Specifically, the results in Fig. 3(a) refer to the VAE model fed with data points from the first dataset (corresponding to the smaller transmitter with wider bandwidth), referred to as *β*-VAE I, whereas the results in Fig. 3(b) are associated with the second model fed with data points from the second dataset, referred to as *β*-VAE II. The covariance matrix of the mean values $\mu $ that parameterize the approximate posterior of the latent variables inferred from the encoder during training indicates that both VAEs have learned disentangled representations of the data generative factors, indicated by the diminishing cross-correlations.

*β*-VAE I and II have learned to represent the data in their corresponding datasets, with six and nine generative factors, respectively, indicated by the number of the non-zero diagonal elements that relate to the variance of the corresponding inferred mean values *μ _{i}* from zero. The number of informative latents for VAE I is smaller than that for VAE II due to the fact that the employed spatial grid is too coarse to resolve some parameters, such as the spread and the rotation,

*s*,

_{x}*s*, and

_{z}*ψ*, respectively, for the corresponding dataset. Hence, the common features learned correspond to 3D location, scale, and noise-floor. The rest of the learned features for VAE II capture the variation in spread and rotation but not very accurately due to the lack of supervision (see Ref. 10 for details). In Fig. 3, the plots showing the root mean square error (RMSE) between the latent variables encoding the 3D location of the Gaussian coherence and the corresponding generative factors, as well as the square of the Pearson correlation coefficient, $ \rho 2$, associated with each pair quantify the predictive ability of the VAE models. The histograms show the statistics of the error between the actual generative factor and the corresponding inferred latent mean value, $ \Delta x \u2212 \mu x , \u2009 \Delta y \u2212 \mu y$, and $ \Delta z \u2212 \mu z$, after training and provide a statistical description of the inference accuracy. Note that the error variance is smaller for VAE II, even though it relates to the dataset corresponding to a transmitter with larger aperture and narrower bandwidth, as its approximate posterior is supervised by the approximate posterior of VAE I in the hierarchical formulation.

Finally, Fig. 4 demonstrates the predictive ability of the trained coupled VAEs on a specific test case, which is the same as in Ref. 10 for comparison with the unsupervised model. A predefined translation track over 100 pings is superimposed with an interval of ±1 standard deviation inferred from each of the coupled VAEs. The absolute difference between the actual and the inferred tracks from coherence measurements is less than 2 mm for all translations for both models in the coupled architecture. Coupling the training of VAE through a common loss reduces the micronavigation estimation error up to 10 times compared to the unsupervised case.^{10}

## 5. Conclusion

Coherent processing in SAS requires platform motion estimation and compensation with sub-wavelength accuracy. Micronavigation aims to infer the ping-to-ping platform displacement from the spatial coherence of diffuse backscatter on redundant recordings between pings. Variational inference offers a fully data-driven method for platform motion from coherence measurements. In this study, we introduce a hierarchical variational model implemented with coupled VAEs to relate the common latent features between datasets of coherence measurements in multi-band MIMO SAS systems. Self-supervising the training process of independently parameterized but coupled VAEs improves significantly the accuracy of the micronavigation estimates.

## Acknowledgment

This work was performed under Project No. SAC000E04 of the STO-CMRE Programme of Work, funded by the NATO Allied Command Transformation.

## Author Declarations

### Conflict of Interest

The authors have no conflicts to disclose.

## Data Availability

The data that support the findings of this study are available within the article.

## REFERENCES

*Introduction to Fourier Optics*

*Machine Learning: A Probabilistic Perspective*

*β*-VAE: Learning basic visual concepts with a constrained variational framework