Predicting acoustic transmission loss in the SOFAR channel faces challenges, such as excessively complex algorithms and computationally intensive calculations in classical methods. To address these challenges, a deep learning-based underwater acoustic transmission loss prediction method is proposed. By properly training a U-net-type convolutional neural network, the method can provide an accurate mapping between ray trajectories and the transmission loss over the problem domain. Verifications are performed in a SOFAR channel with Munk's sound speed profile. The results suggest that the method has potential to be used as a fast predicting model without sacrificing accuracy.

## 1. Introduction

Predicting low-frequency acoustic transmission loss (TL) in the SOFAR channel is an important research field in acoustics. Low-frequency TL plays a crucial role in various applications, such as early warning of undersea earthquakes,^{1} underwater sound source localization,^{2} and monitoring of marine mammals.^{3} As sound travels through the complex environment of the deep ocean, it encounters various factors, such as varying seafloor and stratified medium, that affect its propagation and intensity. Therefore, the prediction of low-frequency underwater acoustic TL has long been a challenge.

Ray-based models are commonly used to calculate TL in the SOFAR channels by providing a simplified representation of sound waves traveling through water. Based on the tracing of rays, the sound field can be calculated by solving the *eikonal* equation and the *transport* equations. Ray-based methods can handle range-dependent environment and are well adapted for long-range propagation. However, due to the high-frequency approximation, classical ray methods are usually not considered to be suitable for low-frequency problems.^{4} Here, “low frequency” does not refer to a constant value or range, but a variable for different environments. For example, in the user manual of BELLHOP, which uses ray theories as its core algorithms, a calculation example at 50 Hz is performed in a SOFAR channel with a range of 100 km and a depth of 5 km. It is mentioned that 50 Hz is usually considered to be a low frequency in such an environment and the ray methods cannot give accurate results for this problem due to the errors in the shadow zone.^{5}

Wave-based models are another important type of methods for calculating TL in the SOFAR channel. Normal mode (NM) methods are one of a basic wave-based methods. In the NM, the sound pressure is expressed by summing up a set of modal functions weighted in accordance with the source depth.^{6} The NM method has high accuracy on calculating the sound field. However, the method is ineffective for range-dependent ocean environments. The parabolic equation (PE) method is a suitable and popular wave-theory technique for solving range-dependent propagation problems.^{7} Early PEs usually have inherent phase errors, which limit their applicability to a certain range of angles around the main propagation direction. However, very-wide-angle PE implementation based on Padé approximants, which has been proposed in subsequent research, has nearly eliminated the small-angle limitations.^{8} This high-angle capability is achieved with additional computational effort.

To improve the performance of calculating TL in the complex ocean environment, many extensions to classical methods, such as Gaussian beam tracing method,^{9} coupled mode method,^{10} and hybrid method,^{11} have been proposed. These methods have improved the accuracy of underwater sound field simulation in various aspects. However, they have also raised issues, such as long computation time.

In recent years, deep learning techniques have achieved remarkable progress in various scientific research fields.^{12,13} Deep learning architectures based on neural networks are capable of extracting valuable patterns and insights that would be challenging or time-consuming with traditional methods. It has been successfully applied in the field of underwater acoustics, such as the source localization,^{14} source depth estimation,^{15} and dim frequency line detection.^{16} Deep learning technique also has been increasingly used in modeling the ocean acoustic propagation. A deep convolutional recurrent autoencoder network is presented for data-driven learning of complex underwater wave scattering and interference.^{17} Deep learning methods also have been used to predict modal horizontal wavenumbers and group velocities,^{18} and predict far-field acoustic propagation based on near field data.^{19}

To rapidly and accurately predict the acoustic TL in SOFAR channels, we develop a convolutional neural network-based method for predicting low-frequency underwater acoustic TL map from ray trajectories. In this method, a U-net type of neural network is trained with ray trajectories as input to predict TL at low frequencies. Compared to the conventional ray-based method, the solving of the transport equation is replaced by the deep learning model. This avoids the problem of high-frequency approximation in the construction of the transport equation. In addition, since ray trajectories can be easily determined even in complex environments, it is possible to conveniently and accurately predict the TL using the proposed method.

## 2. Method

### 2.1 Problem description

*R*and

*Z*, respectively. For a simple harmonic point source located at range position

*r*= 0 and depth of

*z*

_{s}, the Helmholtz equation for calculating the sound pressure at $ x \u2208 R$ can be expressed as follows:

^{20}

^{10,21}Regarding this problem, a ray-trajectory-guided underwater acoustic TL prediction method based on deep neural network is proposed. We transform the problem of calculating TL at individual points into solving for a TL map on a grid $D$ defined on the SOFAR channel. By successfully training a neural network $ g \Theta $ with parameter set $\Theta $, the TL map on $D$ at a frequency

*f*can be obtained from a ray-traces-related input $ U D$ as follows:

### 2.2 Calculation of ray trajectories and downsampling

To calculate ray trajectories in a given SOFAR channel, a grid $ D 0 : { m z \xd7 m r}$ is defined on the 2D plane illustrated in Fig. 1(a). Assume *N _{r}* rays are emitted from the sound source within an angular range $ \theta = [ \u2212 \theta 0 , \theta 0 ]$ with equal angular intervals.

For each ray, its trajectory on grid $ D 0$ can be calculated according to the Snell's law.^{4} After calculating all rays, a set of ray trajectories $ U D 0$ is obtained as shown in Fig. 1(b). For the cell corresponding to depth *i* ( $ 0 < i \u2264 m z$) and range *j* ( $ 0 < j \u2264 m r$), the *ij*th component of $ U D 0$ is the number of rays that pass through the cell.

*k*and range

*l*, $ U k , l D$ is the sum over the $ U D 0$ for the cells that fit inside the

*kl*th cell on $D$. After processing all cells on $D$, a downsampled set of ray trajectories $ U D$ can be obtained as illustrated in Fig. 1(d). This collection of downsampled ray trajectories after scale processing, which will be introduced in Sec. 2.4, is used as the input for the neural network.

### 2.3 Neural network architecture

A U-net is used as the neural network architecture in this research. The U-net is a type of convolutional neural network architecture commonly used for image segmentation tasks. Proposed in 2015, U-net has become widely adopted in the field of image analysis.^{22} In recent years, it has been introduced in the sound field prediction^{23} and has achieved promising results.

The architecture of the U-Net used in this research is illustrated in Fig. 2(a). It consists of an encoder path and a decoder path, which are connected through skip connections. The encoder path gradually downsamples the input ray trajectory, extracting high-level features. In each convolutional layer, convolution is performed as illustrated in Fig. 2(b). The “same mode” of padding is used after convolution, and Rectified Linear Unit is performed as the activate function after the padding to avoid vanishing gradient problem. Following each convolutional layer, max pooling with filters of 2 × 2 and stride of (2, 2) is performed, which reduce the spatial dimensions of the feature maps. The decoder path performs upsampling operations to progressively recover the spatial resolution determined by $D$ and finally generates the scaled predicted TL map.

The skip connections between the encoder and decoder paths allow the network to preserve and integrate both local and global information. They help in recovering fine details by bypassing the low-level feature maps directly to the decoder path.^{24}

### 2.4 Data scaling and loss function

The reference data for the neural network consist of TL maps produced in the same environment as the ray trajectory calculation. The data can be calculated by selecting an appropriate method for the specified environment. The TL in dB at frequency *f* is referred to as the ground truth data $ T GT D , f$ in the training.

*N*

_{r}is the number of rays in the ray trajectory calculation, and $\beta $ is a parameter to scale the ground truth data. In this research, $\beta $ is set to be 200, which is usually larger than the maximum of TL magnitudes in a general environment. In this way, all data are scaled to the range of [0, 1] while maintaining the global data structure. As shown in Fig. 2(a), the network outputs the scaled result $ T \u0303 pred D , f$. Finally, results $ T pred D , f$ on their real scale are obtained through an inverse scaling processing on the output of the network.

^{25}in image processing field are used to construct the loss function. SSIM compares the local patterns of luminance, contrast, and structure in two images A and B as follows:

*μ*is the mean of the corresponding matrix entries,

*σ*

^{2}is the estimate of the variance of the entries, and

*σ*

_{AB}is the covariance estimate between the entries of A and B.

*c*

_{1}and

*c*

_{2}are two constants used to stabilize the division.

^{26}is also used to construct the loss function. IG refers to the spatial rate of change of an image, describing the variations trend of local pixels in the image. The loss function based on the IG is defined as follows:

*G*

_{z_pred},

*G*

_{z_GT},

*G*

_{r_pred}, and

*G*

_{r_GT}denote the IG matrices of the predicted and ground truth TL maps in the

*z*and

*r*directions, and mean[·] is the average operation on the tensor.

*α*is hyper-parameter, which is set to be 0.8 in our study.

## 3. Training and test

### 3.1 Training

To evaluate the performance of the proposed method on predicting the TL map, we train the network using the data generated from different sound source depths and then test the network using the data from new source depths that were never used in the training. The training and test data are simulated in a SOFAR channel with a continental slope as illustrated in Fig. 3(a). In such a range-dependent environment, ray-based methods lack sufficient accuracy at low frequencies, while wave-based methods usually require complex and time-consuming algorithms to perform the calculations.

The maximum range distance *R* is 100 km, depth *Z* is 5 km. The maximum range and height of the slope are 100 km and 1 km, respectively. The surface of the ocean is assumed to be pressure-release boundary, and the seafloor is an acousto-elastic half-space where its speed of sound is 1550 m/s and density is 1 g/cm^{3}. We consider the environment with depth-dependent sound speed following Munk's sound speed profile^{27} as shown in Fig. 3(b). The ray trajectories are calculated using BELLHOP code,^{28} and the ground truth TL maps are calculated using RAM code,^{29} which obtains the results using PE method.

In this paper, the networks are trained independently on individual frequencies from 10 to 50 Hz with interval of 5 Hz; thus, nine network models are obtained. This strategy increases the repetitive works in the training. However, for tasks that have specific frequency of interest, it can effectively reduce the complexity of the network.

Based on the aforementioned training strategy, nine training sets are built at the specified frequencies. Each training set consists of two parts of data, namely, the ray trajectories and ground truth TL maps. Original ray trajectories $ U D 0$ are computed on a grid with size of $ D 0 : { m z \xd7 m r} = { 200 \xd7 4000}$. Then, $ U D 0$ is downsampled to $ U D$ on a grid with size of $ D : { n z \xd7 n r} = { 128 \xd7 256}$. Note that the grids are generated in the rectangular plane as shown in Fig. 3(a), which covers the slope area. Calculations were performed under 541 source depths ranging from 300 to 3000 m with a constant interval of 5 m. For each source depth, a number of *N*_{r} = 30 rays are emitted from the source within an angle range of ** θ** = [−30°, 30°]. Ground truth TL maps are also computed on grid $D$ at the same source depths to calculate the loss. Since ray trajectories are frequency-independent, they are the same in all training sets.

In the training, $ \Omega train f$ is randomly divided into two sets with a ratio of 3:1, being used as the training set and validation set, respectively. The training is performed via the ADAM optimizer. Batchsize in the training is set to be 2 and the learning rate is 0.0001. The hyper-parameters of the neural networks in this paper are listed in Table 1.

Layer name . | Input size . | Hyper-parameters . | Output size . | |
---|---|---|---|---|

C: filter size, (stride), filter number; . | ||||

M: filter size, (stride); . | ||||

D: filter size . | ||||

C_{1}–M_{1} | 128 × 256 × 1 | C_{1}: 3 × 3 × 1, (1, 1), 64 | M_{1}: 2 × 2, (2, 2) | 64 × 128 × 64 |

C_{2}–M_{2} | 64 × 128 × 64 | C_{2}: 3 × 3 × 64, (1, 1), 128 | M_{2}: 2 × 2, (2, 2) | 32 × 16 × 128 |

C_{3}–M_{3} | 32 × 64 × 128 | C_{3}: 3 × 3 × 128, (1, 1), 256 | M_{3}: 2 × 2, (2, 2) | 16 × 32 × 256 |

C_{4}–M_{4} | 16 × 32 × 256 | C_{4}: 3 × 3 × 256, (1, 1), 512 | M_{4}: 2 × 2, (2, 2) | 8 × 16 × 512 |

D_{1}–C_{5} | 8 × 16 × 512 | D_{1}: 2 × 2 | C_{5}: 3 × 3 × 1024, (1, 1), 512 | 16 × 32 × 512 |

D_{2}–C_{6} | 16 × 32 × 512 | D_{2}: 2 × 2 | C_{6}: 3 × 3 × 768, (1, 1), 256 | 32 × 64 × 256 |

D_{3}–C_{7} | 32 × 64 × 256 | D_{3}: 2 × 2 | C_{7}: 3 × 3 × 384, (1, 1), 128 | 64 × 128 × 128 |

D_{4}–C_{8} | 64 × 128 × 128 | D_{4}: 2 × 2 | C_{8}: 3 × 3 × 192, (1, 1), 64 | 128 × 256 × 64 |

C_{9} | 128 × 256 × 64 | — | C_{8}: 1 × 1 × 64, (1, 1), 1 | 128 × 256 × 1 |

Layer name . | Input size . | Hyper-parameters . | Output size . | |
---|---|---|---|---|

C: filter size, (stride), filter number; . | ||||

M: filter size, (stride); . | ||||

D: filter size . | ||||

C_{1}–M_{1} | 128 × 256 × 1 | C_{1}: 3 × 3 × 1, (1, 1), 64 | M_{1}: 2 × 2, (2, 2) | 64 × 128 × 64 |

C_{2}–M_{2} | 64 × 128 × 64 | C_{2}: 3 × 3 × 64, (1, 1), 128 | M_{2}: 2 × 2, (2, 2) | 32 × 16 × 128 |

C_{3}–M_{3} | 32 × 64 × 128 | C_{3}: 3 × 3 × 128, (1, 1), 256 | M_{3}: 2 × 2, (2, 2) | 16 × 32 × 256 |

C_{4}–M_{4} | 16 × 32 × 256 | C_{4}: 3 × 3 × 256, (1, 1), 512 | M_{4}: 2 × 2, (2, 2) | 8 × 16 × 512 |

D_{1}–C_{5} | 8 × 16 × 512 | D_{1}: 2 × 2 | C_{5}: 3 × 3 × 1024, (1, 1), 512 | 16 × 32 × 512 |

D_{2}–C_{6} | 16 × 32 × 512 | D_{2}: 2 × 2 | C_{6}: 3 × 3 × 768, (1, 1), 256 | 32 × 64 × 256 |

D_{3}–C_{7} | 32 × 64 × 256 | D_{3}: 2 × 2 | C_{7}: 3 × 3 × 384, (1, 1), 128 | 64 × 128 × 128 |

D_{4}–C_{8} | 64 × 128 × 128 | D_{4}: 2 × 2 | C_{8}: 3 × 3 × 192, (1, 1), 64 | 128 × 256 × 64 |

C_{9} | 128 × 256 × 64 | — | C_{8}: 1 × 1 × 64, (1, 1), 1 | 128 × 256 × 1 |

### 3.2 Test

Test data are calculated in the same environment as the training data produced. At frequency *f*, ray trajectories under a number of 100 random source depths ranging from 300 to 3000 m are computed as the inputs in tests. TL maps calculated by RAM under the same condition are considered as the ground truth data. The proposed method is also compared with the classical ray method, which also uses the ray trajectories to calculate the TL. The results of ray method are obtained by BELLHOP. Two examples of TL maps obtained from different methods are shown in Fig. 4(a).

^{23}to measure the performance of the proposed method. MAE and MSSIM at each frequency are defined as follows:

*L*

_{1}norm; $ N D$ equals 128 × 256, which is the number of points on grid $D$; and

*N*

_{sd}equals 100, which is the number of testing source depths at each frequency.

The MAE, MSSIM, and their 95% confidence intervals from 10 to 50 Hz with an interval of 5 Hz are illustrated in Fig. 4(b).

From Fig. 4(a), good similarities are observed between the predicted and true data. This proves that the proposed method is capable of predicting the TL maps for given source depth, frequency, and sound speed profile from the corresponding ray trajectories. Figure 4(b) illustrates that the proposed method predicts the TL maps on low error levels. The error tends to slightly increase with the increase in the frequency. This increase is because the distribution of TL at higher frequencies exhibits higher complexity. In addition, both the 95% confidence intervals of MAE and MSSIM are similar at different frequencies, which demonstrate that the network is stable to predict the TL maps for different source depths.

Figure 4 also demonstrates that the ray method has obvious larger errors than the proposed method. The ray method lacks sufficient accuracy in the area where rays do not travel through, which proves that the proposed method has good performance in low-frequency range.

## 4. Conclusion

To efficiently predict the TL in SOFAR channels, a deep learning-based method is proposed and examined in this research. The method provides an accurate mapping between ray trajectories and the TL using the convolutional neural networks in an image-processing-like framework. Ray trajectories contain rich information of the wave propagation, are usually easy to obtain, and thus provide a solid data foundation for predicting the TL. The U-net-type network is used, and a hybrid loss function combining SSIM and IG is designed. By successfully training the network, the model can achieve generalized learning of the underlying physics of underwater acoustic transmission phenomena from ray trajectories and then effectively and efficiently predict the low-frequency TL. The tests in a SOFAR channel with a continental slope show that trainings can converge quickly based on a small amount of training data. It also offers promising prospects for use in more complex environments, where its computational efficiency characteristics can be further exploited.

## Acknowledgments

This work was supported by the National Natural Science Foundation of China (12074317).

## Author Declarations

### Conflict of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

## Data Availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

## REFERENCES

*Computational Ocean Acoustics*

*The Parabolic Approximation Method*