Conventional direction-of-arrival (DOA) estimation algorithms for shallow water environments usually contain high amounts of error due to the presence of many acoustic reflective surfaces and scattering fields. Utilizing data from a single acoustic vector sensor, the magnitude and DOA of an acoustic signature can be estimated; as such, DOA algorithms are used to reduce the error in these estimations. Three experiments were conducted using a moving boat as an acoustic target in a waterway in Houghton, Michigan. The shallow and narrow waterway is a complex and non-linear environment for DOA estimation. This paper compares minimizing DOA errors using conventional and machine learning algorithms. The conventional algorithm uses frequency-masking averaging, and the machine learning algorithms incorporate two recurrent neural network architectures, one shallow and one deep network. Results show that the deep neural network models the shallow water environment better than the shallow neural network, and both networks are superior in performance to the frequency-masking average method.

## I. INTRODUCTION

Source direction-of-arrival (DOA) estimation in shallow water has seen strong advancements for applied water acoustics in the past decade with success specifically in machine learning (Niu *et al.*, 2017a; Niu *et al.*, 2017b; Wang and Peng, 2018). It is of interest to determine the location of anthropogenic sources for many applications: naval operations, merchant shipping, and environmental studies, to name a few. Using neural networks to estimate the DOA of an underwater acoustic source is of recent interest, including the use of multi-layer perceptron (MLP) networks (Ozanich *et al.*, 2020; Yangzhou *et al.*, 2019; Zou *et al.*, 2017), convolutional neural networks (CNNs) (Cao *et al.*, 2021; Ferguson *et al.*, 2019), and recurrent neural networks (RNNs) (Huang *et al.*, 2019; Qin *et al.*, 2020).

This paper discusses conventional and machine learning methods of improving surface-water angle-finding utilizing a single underwater acoustic vector sensor (AVS). Generally, multiple sensors working together are required to find the angle-of-arrival of a signal source (Huang *et al.*, 2018; Trees, 2002; Yangzhou *et al.*, 2019). A pressure-particle acceleration (*pa*) AVS is capable of determining the angle-of-arrival with a triaxial piezoelectric accelerometer in a neutrally buoyant body. The triaxial accelerometer in the AVS generates a vector quantity of the DOA of the acoustic wave (Bereketli *et al.*, 2015; Fahy, 1995; Kang *et al.*, 2004). There are different types of AVSs: pressure-particle velocity (*pu*), *pa*, pressure-pressure (*pp*), and particle velocity-particle velocity (*uu*); all have their advantages and disadvantages. This paper solely discusses angle-finding utilizing a Meggitt VS-209 (Wilcoxon Sensing Technologies, Frederick, MD) underwater *pa* AVS for its broader frequency response, though the methods described here would generalize to any AVS.

We will investigate a shallow RNN architecture and a deep RNN architecture as the machine learning algorithms in the paper. The parameters, such as the inner node lengths and depth of the network, were tested and compared for accuracy. The best models we found with our data are shown in Sec. IV.

## II. MATERIALS AND METHODS

### A. Acoustic vector sensor

The Meggitt VS-209 AVS consists of a hydrophone and a triaxial accelerometer oriented with its −*x*, −*y*, and −*z* orientations—as shown in Fig. 1—with respect to the physical sensor's orientation. The underwater *pa*-type AVS records the particle acceleration in three orthogonal axes together with a scalar underwater sound pressure measurement. The particle acceleration and sound pressure are combined to produce a sound intensity vector, where the intensity vector contains the strength and angle-of-arrival of all the incident wavefronts.

### B. Acoustic post-processing

The estimation techniques in this paper require some post-processing of the AVS data. Let $ax(t),\u2009ay(t)$, and $az(t)$ be the three components of the time-domain accelerometer data, and let $p(t)$ be the pressure time-series data from the underwater *pa* AVS. To account for sensor bandwidth and noise, the sensor measurements are first projected into the frequency domain, where $ax(\omega )=F(ax(t))$ is the Fourier transform of $ax(t)$, and, respectively, for each component of the sensor data. Since we are concerned with a moving acoustic source, a short-time Fourier transform (STFT) facilitates its time-dependence. Using the STFT, we compute $Ax,Ay,Az,P\u2208\u2102N\xd7T$ for the respective three time-domain accelerometer data and hydrophone data, where *N* is the block-size of the STFT and *T* is the number of time-series samples divided by the block-size, rounded down. Equations (1) and (2) are computed along each axis with only the *x* axis shown for brevity. The measurements are composed into the cross power spectra via

where $Ax*$ is the complex conjugate of the frequency domain accelerometer data in the *x* axis direction and *P* is the pressure vector. With the cross power spectra, $GAxP\u2208\u2102N\xd7T$, the acoustic intensity is computed as

where $Ix\u2208\mathbb{R}N\xd7T$ are the active intensity levels in the *x* axis direction. The intensities are computed for all three axes, i.e., the *x*, *y*, and *z* directions corresponding to the three-axis accelerometer. With the three AVS-relative intensity orientations, an intensity vector, $Ir=(Ix,Iy,Iz)T\u2208\mathbb{R}3\xd7N\xd7T$, can be composed. The intensity vector is relative to the orientation of the AVS as shown in Fig. 1.

The Meggitt VS-209 AVS has a magnetic heading sensor and a gravitational sensor to remove any relative orientation in data collection. The pitch, roll, and heading are the respective rotations along the *x*, *y*, and *z* axes in Fig. 1. A rotation matrix, *Q _{fixed}*, is calculated from the magnetic and gravitational sensors (Penhale, 2019), such that

After the rotation, the intensity vector $Ig$ is no longer oriented with respect to the sensor's orientation; instead, it is oriented relative to magnetic north and the gravity vector. We call this a global coordinate system, and global angle measurements are now considered for localization.

The re-oriented intensity vector, $Ig=(Iwest,Inorth,Iup)T$, is then converted to a spherical coordinate system with

Parameter . | Value . |
---|---|

Sample rate | 17–067 Hz |

STFT block-size | 1706 samples |

STFT zero padding | 1024 samples |

Noise gate threshold | –40 dB (re 1 pW/m^{2}) |

Frequency range | 100–8000 Hz |

Parameter . | Value . |
---|---|

Sample rate | 17–067 Hz |

STFT block-size | 1706 samples |

STFT zero padding | 1024 samples |

Noise gate threshold | –40 dB (re 1 pW/m^{2}) |

Frequency range | 100–8000 Hz |

In the experiments in this paper, all signal sources are assumed to be on the surface of the water; hence, we only need to estimate the azimuth angle Θ from the AVS signals. Also, note that this paper focuses on DOA estimation, so range is not of interest. To determine the estimated azimuth angle, $\theta *$, of the signal source in our experiment, Θ must be processed along its frequency axis into a single angle prediction at each time step, such that

To process Θ in a machine learning approach, a linear regression—i.e., single-layer perceptron (SLP) network—can be trained to output $\theta *$ using the input Θ. Comparatively, a conventional approach can average Θ along its frequency axis to generate a $\theta *$ angle prediction.

After processing Θ to estimate $\theta *$, time-series filtering can be performed to smooth out the effect of noise and outliers to generate more realistic results. Considering machine learning, our hypothesis is that a RNN architecture can be trained to output a better estimate of $\theta *$ than conventional averaging, enhancing the localization performance of the AVS.

### C. Weighted average

We use a weighted average with our experimental data to demonstrate a conventional approach for combining the predicted DOA of an acoustic signal from an AVS. For each frequency component in the AVS signal, there is an angle measurement Θ and intensity measurement $|I|$. The intensity measurement is directly proportional to the SNR; hence, the intensity is used as a weight for the angle measurement. The sample-based average of the weighted angles is the estimated $\theta *$. It follows that

with the intensities, *I*, in dB scale normalized on the interval $[0,1]$, and each *f _{i}* term corresponds to a frequency bin from $i=1,2,\u2026,N$. This estimate gives more weight to an angle that has a stronger corresponding intensity, with the assumption that this signal is emanating from the direct path of the source to be localized. This approach works well with high SNR measurements (Bereketli

*et al.*, 2015), though the results deteriorate appreciably with band limited, low SNR responses, as demonstrated in Sec. V. When the acoustic source generates a strong signal, the acoustic intensity,

*I*, at that point dominates the weighted average, while a weak signal will vary greatly depending upon the noise. To address this degraded performance with low SNR measurements, we next explore use of a SLP as an alternative approach to estimate DOA.

### D. SLP network

While the weighted average is a reasonable approach for processing the AVS measurements into a predicted DOA, there are numerous sources of error that are not taken into account. The source may be a band limited signal and thus only be present in certain frequencies; there may be signal outside these bands that emanates from other sources, say marine mammals, other underwater activity, or noise. Hence, to implicitly learn the best relationship between the AVS measurements, $|I|$ and *θ*, we will employ machine learning, specifically a neural network. For this experiment, we use a SLP network regression to process the frequency domain of the signal. The SLP network processes the frequency domain angle measurements by

where *w _{f}* is a vector of weights for each frequency bin in $\theta t$ and

*b*is a scalar bias. In essence, if $wf=1/N,\u2009\u2200f$, where

*N*is the number of frequency bins, and

*b*= 0, then the neural network would estimate a non-weighted average of the angle measurements across the frequency axis. To create a weighted average, the neural network learns

*w*and

_{f}*b*such that it minimizes

*E*, with respect to the root-mean squared error (RMSE),

where $\theta ttrue$ is the true angle measurement (or label), and the neural network predicts $\theta t*$ at each time step *t*.

Since the AVS is the source of the angle measurements, the neural network must minimize a modified RMSE that considers the AVS's polar nature. The angle measurements for the noise source are wrapped around a $\u2212180\xb0$ and $180\xb0$ range, so a circular RMSE where the error is the difference between two angles is necessary. This is important because a prediction that is at $\u2212179\xb0$ with a true angle at $181\xb0$ should have an angle difference of $2\xb0$. A standard RMSE would have an angle difference of $358\xb0$, overly penalizing this small error. The circular mean squared error that the neural network incorporates is

where $dt=||\theta t*\u2212\theta ttrue||1$ is the absolute difference of predicted angle and truth angle at each time step *t*. The SLP processes the AVS measurements in a linear fashion [see Eq. (7)]; hence, this algorithm may be unable to capture non-linearity present in the system. Thus, we next describe a neural network architecture that can better model non-linearities.

### E. RNN

The SLP network is useful in determining the frequencies at which a band limited signal is present; the learned weights *w _{f}* in Eq. (7) show how the SLP weights the measurements at each frequency. On the other hand, the SLP architecture does not handle time-dependent parameters or non-linearity in the environment. Following Eq. (7), the SLP estimates at each time step,

*t*, calculated independently of one another. However, a RNN considers the current and previous samples (Connor and Atlas, 1991). Thus, a RNN is better able to handle temporal aspects of the signal, creating a time-dependence in its predictions from looking at previous samples. We use a conventional form of a RNN, a fully recurrent neural network with no gates, as a basis for the simplest neural network model. A fully recurrent neural network predicts with

*n*previous samples and its current sample,

where **w** and **h** are trainable parameters. Equation (10) is repeated *n* times, for each $\theta t$ until

where **b** is also a trainable bias parameter. There is an inherent issue with fully RNN architectures, where **w** is back-propagated *n* times during training. The issue arises with values significantly greater than 1 or significantly less than 1 causing very large gradient or close to zero gradients, respectively (Bengio *et al.*, 1994). For example, with *n* = 20 and *w* = 1.4, the gradient would increase to $1.420=836$. A SLP is used to reduce the dimensionality of the RNN backbone, and a small *n* value is used to prevent forms of the gradient descent failing due to this issue. The weights in the RNN—**w, h**, and **b**—are learned using the truncated backpropagation through time (TBPTT) algorithm (Werbos, 1990) to minimize *E* in Eq. (9).

The output of a RNN is either multi-input, multi-output (MIMO) or multi-input, single-output (MISO), shown in Fig. 2. In this paper, the MIMO-type RNN is used for internal layers. With the output of the MIMO-type RNN having the same vector length as the input, the internal layers can be connected multiple times, permitting use of a deep neural network (DNN) architecture. The MISO-type RNN is used for the final prediction layer so that a single prediction is made, $\theta *$. The MISO-type RNN is useful for predicting a single angle measurement based on the previous *n* samples. Both the SLP network and the RNN network can be combined such that the output of one network is the input of another. Now that we have described the basis of the three main algorithms we will use for predicting DOA, we turn to our experiments.

## III. EXPERIMENTS

To record angle data, we staged collections from three events on the Keweenaw Portage Waterway in Houghton, Michigan, on July 14, July 27, and August 18, 2020. Figure 3 shows the location of the Keweenaw Portage Waterway in Michigan. The events consisted of driving a boat near the AVS while recording the boat's GPS position at a 1 Hz sample rate. The three experiments total roughly 79 min of GPS and acoustic data. A bathymetric cross section and measured sound speed profile are shown in Sec. VI.

The sensor data were recorded using a data acquisition (DAQ) unit, National Instruments (NI) cRIO-9035, which has eight slots for NI C-series modules. The C-series modules used in this setup were two NI-9234 analog-to-digital converters (ADCs) for reading the acoustic data, one NI-9467 GPS receiver for timing and location, and one NI-9344 switch module for system-related control. The NI-9234 ADC has 24-bit precision and stored each data point as a 32-bit, single floating point number. The acoustic data collected on the cRIO-9035 were sampled at 17.067 kHz and chunked into 4-min intervals. These intervals are continuous, meaning that there are no missing data between each 4-min interval. The 17.067 kHz sample rate was used, since this rate is the closest discrete range that the NI-9234 module has above the Meggitt VS-209 *pa* AVS 3-dB frequency cutoff above 7 kHz.

The post-processing of these data, described in Table I, converts the 17.067 kHz sampled data into 1023 frequency bins at a block-size of 0.1 s using the STFT. The four AVS channels are used to generate Θ in Eq. (4). Since the GPS data were recorded at 1 Hz, we linearly interpolated between GPS measurements to match the time interval at which the AVS data were post-processed. Figure 4 shows the 1 Hz rate at which the GPS locations were mapped onto the Keweenaw Portage Waterway.

## IV. ARCHITECTURES

Table II shows the parameters used within the two compared RNN architectures, and Table III shows the layer structures, which are illustrated visually in Fig. 5. The optimizer used is stochastic gradient descent (SGD) with a learning rate of 0.01. No activation function is used on the output layer of the neural network to prevent any skewing of the angle measurement data. The experimental data are split between training and testing for the machine learning algorithm 20 times, so that 20 different models are generated per neural network architecture to test on every portion of the data set in a cross-fold validation setup. Within a single data split, 5% of the training data is used as validation data to determine lowest error in the training set. Then the neural network predicts the test data using the lowest validation error along each fold of the data split. To generate the network architectures, we use the Keras open-source library for its simple modularity and ease of use. Since Keras is written in python, the AVS post-processing in Sec. II B is also written in python.

Parameter . | Value . |
---|---|

SLP activation | None |

RNN activation | tanh |

RNN lookback | 5 steps |

Epochs | 20 |

Train/Validation/Test split | $90%/5%/5%$ |

Optimizer | SGD |

Learning rate | 0.01 |

Parameter . | Value . |
---|---|

SLP activation | None |

RNN activation | tanh |

RNN lookback | 5 steps |

Epochs | 20 |

Train/Validation/Test split | $90%/5%/5%$ |

Optimizer | SGD |

Learning rate | 0.01 |

Layer type . | Deep RNN dimensions . | Shallow RNN dimensions . |
---|---|---|

SLP | $\mathbb{R}1023\xd71$ | $\mathbb{R}1023\xd71$ |

RNN | $\mathbb{R}1\xd732$ | $\mathbb{R}1\xd71$ |

RNN | $\mathbb{R}32\xd732$ | — |

RNN | $\mathbb{R}32\xd732$ | — |

SLP | $\mathbb{R}32\xd71$ | $\mathbb{R}1\xd71$ |

Layer type . | Deep RNN dimensions . | Shallow RNN dimensions . |
---|---|---|

SLP | $\mathbb{R}1023\xd71$ | $\mathbb{R}1023\xd71$ |

RNN | $\mathbb{R}1\xd732$ | $\mathbb{R}1\xd71$ |

RNN | $\mathbb{R}32\xd732$ | — |

RNN | $\mathbb{R}32\xd732$ | — |

SLP | $\mathbb{R}32\xd71$ | $\mathbb{R}1\xd71$ |

## V. RESULTS

All results in this section only use the test data defined per model described in Sec. IV. Once the networks have been trained on the experiment training data, the networks are compared with one another. The RMSEs of the test data follow Eq. (9) and are shown in Table IV.

. | Weighted average . | Shallow RNN . | Deep RNN . |
---|---|---|---|

RMSE | $39.4\xb0$ | $33.5\xb0$ | $24.8\xb0$ |

SD | $45.3\xb0$ | $22.4\xb0$ | $13.8\xb0$ |

. | Weighted average . | Shallow RNN . | Deep RNN . |
---|---|---|---|

RMSE | $39.4\xb0$ | $33.5\xb0$ | $24.8\xb0$ |

SD | $45.3\xb0$ | $22.4\xb0$ | $13.8\xb0$ |

Each neural network has its test data folded 20 times and averaged to yield a RMSE and standard deviation (SD). The time-series predictions of the different algorithms are compared to the total testing truth data in Fig. 6 with Fig. 6(b) using a Kalman filter added to the output of each algorithm. The covariance of the process noise ($Q=10\u22126$) and covariance of the observation noise (*R* = 0.025) are chosen empirically to show the differences between each algorithm along a larger portion of the data set. It should be noted that no results other than Fig. 6(b) use these filtered data; every other figure, table, result, and discussion uses the original algorithm data.

The results show that the trained deep RNN has the lowest total error throughout the data set, but a single RMSE does not fully convey the deep RNN's results. Another representation is the average angle error with respect to the SNR of the signal. The SNR is calculated by subtracting the ambient acoustic intensity from the acoustic source intensity. A time average of 4 min before the acoustic experiment was conducted is used as the acoustic ambient signature. Figure 7 shows the comparison of the acoustic source signal at different boat distances with a time average of 30 s each.

Figure 8 shows the error with respect to SNR. These data are presented by averaging the RMSEs according to the respective 0.5-dB SNR bins and then comparing the results of the three different estimation techniques. For example, in the discrete SNR range of 10–10.5 dB, there are 121 error points, and the mean of these errors for a deep RNN is 13.47°. The shallow RNN and weighted average at this range have an error of 30.26° and 44.22°. To prevent any discrepancies, if a SNR average contains fewer than five samples within the SNR bin, the SNR average is removed. The data with high SNR correspond to a small portion of very fast crossings of the boat driving by the sensor. Due to the high vessel speed, the experimental timing errors become noticeable at these data.

What is of particular note is in the range from 0 to 20 dB SNR. Both RNN architectures perform significantly better than the weighted average. The shallow RNN produces results slightly better than a weighted average of the angles, and the deep RNN produces results significantly better than the shallow RNN and a weighted average of the angles inside this range. The shallow RNN architecture gives more non-linearity in the algorithm, but the amount of training data permits the usage of a deeper RNN without overfitting. The large amount of training data prevents the deep RNN from overfitting the data while training.

Each model converges in quality at a SNR of 20 dB. We see that the weighted average algorithm performs equally well as the neural network architectures at this SNR. A SNR of 20 dB is high enough for the weighted average, a linear model, to perform as well as the neural networks, a non-linear model. Our data find the neural networks unnecessary for signals above 20 dB SNR in our acoustic environment.

In some points in these data, the acoustic source's distance from the AVS is too large, and/or there is no direct acoustic path to the AVS. Using solely the weighted frequency intensity analysis, the results are poor at high angle values, above 100°, shown in Fig. 9. The high angles map to the boat to the west of the sensor (Fig. 4) with no direct acoustic path present and far away from the sensor itself. These data are kept in the analysis, as the purpose of the machine learning algorithms is to work with these highly noisy signals and still map the DOA with higher accuracy than the weighted average. The results in Table IV show this is the case.

## VI. EXPERIMENTAL VALIDATION

The experimental data contain multi-path interferences. To validate this claim, two simulations were created to compare the Portage Waterway acoustic channel and an open field. Figure 10 shows a comparison of two RAMGeo (Collins, 1993) simulations (one with multi-path and one without) and the corresponding experimental data. The distance is used equally among all panels in Fig. 10 using the experimental GPS distances from a single pass in Fig. 4, and each simulation time step is computed independently. The Portage Waterway simulation parameters are shown in Fig. 11 from recorded bathymetry and water velocity on the Portage Waterway. Note that the sound speed varies by less than 0.05 m/s at 1471.5 m/s.

The open water simulation has the same sound speed velocity with an infinite depth. The swept frequency patterns are a common result of acoustic interference patterns from a moving source in a channel, while the open water simulation contains very little of this pattern. Multi-path constructive and destructive interference is present in the shallow waveguide both in the Portage Waterway simulation and in the Portage Waterway experimental data. The experimental data also show electrical power noise present at harmonics of 60 Hz, common for working with alternating current (ac) power in a marine environment.

## VII. CONCLUSION AND FUTURE WORK

In this paper, we compared two types of RNNs and a weighted acoustic intensity average to predict the direction-of-arrival from acoustic vector sensor data. The RNNs helped in predicting the temporal aspect of a moving acoustic source. The weighted acoustic intensity average was a good indicator to determine the benefits of using deep learning. Our real-world experiment results suggest that DNNs are a strong candidate for use for direction-of-arrival estimation in high-noise scenarios. Conversely, if the signal has a relatively high SNR—our data show that in our environment the threshold is around 25 dB SNR—linear methods, such as weighted averaging or SLPs, suffice.

These results encourage further study of the use of machine learning for localization with multiple acoustic vector sensors in difficult-to-model acoustic environments. There is also an opportunity to analyze detection and estimation tasks in near-shore ice in Houghton's surrogate Arctic environment (Penhale, 2019; Penhale *et al.*, 2018) with the neural network models. Near-shore ice has been shown to be a difficult acoustics environment (Penhale, 2019; Penhale *et al.*, 2018), and we anticipate that machine learning will show to be a good candidate for increased performance in detection and estimation tasks in this scenario. We are currently carrying out experiments to test this hypothesis. Future work will also examine advanced machine learning methods, such as other deep network architectures—long short-term memory networks (Hochreiter and Schmidhuber, 1997), transformers (Vaswani *et al.*, 2017), etc.—which will be enabled by ongoing data collects.

## ACKNOWLEDGMENT

This work was funded by the United States Naval Undersea Warfare Center and Naval Engineering Education Consortium (NEEC) (Grant No. N00174-19-1-0004) and the Office of Naval Research (ONR) (Grant No. N00014-20-1-2793). This is Contribution No. 76 of the Great Lakes Research Center at Michigan Technological University.