Underwater acoustic target recognition based on ship-radiated noise is difficult owing to the complex marine environment and the interference by multiple targets. As an important technology for target recognition, deep-learning has high accuracy but poor interpretability. In this study, an attention-based neural network (ABNN) is proposed for target recognition in the pressure spectrogram with multi-source interference using an attention module to inspect the inner workings of the neural network. From data obtained during a September 2020 sea trial, the ABNN exhibited a gradual focus on the frequency-domain feature of the target ship and suppressed environmental noises and marine vessel interference, which led to high accuracy in the target detection and recognition.

Underwater acoustic target recognition based on ship-radiated noise is a major part of sonar systems. Underwater acoustic target recognition focuses on two main aspects:1 the extraction of typical features from original signals containing target information and the design of shallow classifiers with strong generalization and robustness. However, the performance of traditional feature extraction and classification methods is often limited by unfavorable factors, such as environmental complexity and multi-target interference.2 

Deep neural network (DNN) technology has promoted advanced pattern recognition; in particular, this method was recently applied to underwater acoustic target recognition.2,3 The complex model structure of deep-learning methods and the ability to process massive amounts of data enable stronger feature representation with good performance in underwater acoustic target recognition4–13 and localization.14–19 However, difficulties exist in interpreting the meaning of high prediction accuracy because such methods include the use of black-box models,20 which shield from analyzing the inner workings and behavior of the models. Since DNNs can focus too heavily on irrelevant noises to the target and allow such interference to affect the decision-making of the model, elucidation of the inner workings of DNNs in recognizing underwater acoustic targets is needed.

The attention mechanism is an important method that allows for visual analysis of the inner workings of neural models.21 This mechanism, which was proposed in natural language processing,22 imitates and demonstrates the human visual and auditory attention mechanism to focus the attention on the main aspects of items.23 Although the attention mechanism has been widely applied in machine translation,22 image recognition,24 speech recognition,25 and other fields to examine the inner workings of DNN, its application to underwater acoustics has not been discussed.

In the present study, an attention-based neural network (ABNN) is proposed for the underwater acoustic target recognition in an attempt to interpret the classification principle of DNN. The ABNN architectures include an attention module as a first step prior to a traditional DNN composed of connected layers. The attention module utilizes a trainable attention vector layer, a Gaussian layer, and a merge layer to mine effective information from the input spectrum and outputs attention maps in real time to visualize the frequency regions of concern. The fully connected layers serve as a classifier to fulfill specific tasks, such as target detection and recognition.

In September 2020, a ship-radiated noise measurement experiment was conducted in the shallow area of the South China Sea. Figure 1(a) shows the seabed topography of the experimental area. Two experimental ships A and B sailed along tracks 1 and 2 at 3 m/s or were floated at either end point with main and auxiliary engines turned off. The bathymetry of the two tracks was relatively flat. A hydrophone [star in Fig. 1(a)] was deployed as a submersible buoy 80 m below the surface. The source distance to the hydrophone was about 1–6 km and water depth 130–135 m. The radar and automatic identification system (AIS) with detection range of 11 km detected more than 17 interfering vessels in the experimental area, which can be observed from the spectrograms in Fig. 2. This study performs target detection and recognition from a pressure spectrogram in the presence of multi-source interference using ABNN.

For part of the signal data collected in the near-field sampled at 12 kHz, 2-s windows were selected and zero-padded and then was short time Fourier transformed with size of 215 = 32 768. Thus, the signal data were transferred into a spectrum dataset consisting of N = 1254 frames. Due to the significance of low-frequency features of ships, only S = 248 frequencies from 10 to 100 Hz were chosen as the input features for the DNN.

To prepare the dataset, the data samples included the individual spectra of all N frames with S × 1 in size and were divided into the four classes as shown in Table 1. Spectrograms of the signals related to the four categories are shown in Fig. 2. The targets were situated in a complex environment containing multi-source interference and ambient noise, and the original signal contained the line spectrum of electrical noise caused by equipment problems of the hydrophone. All line spectra are shown in Table 1. Significantly, ships A and B each had a low-frequency line spectrum at fA = 54.57 Hz and fB = 29.30 Hz, respectively.

In the target recognition, the discrete 10–100 Hz frequency spectrum is fed to the input layer of the DNN. Through a series of nonlinear mappings, the DNN outputs the classification results.

The architecture of ABNN is in Fig. 1(b), where N is the total number of samples, NBatch is the batch size of network training determined by GPU performance, S is the size of input vector, and c is the number of classes determined by specific task, here N = 1254 and S = 248. An attention module is before the hidden layers. In the model, the input features are weighted by a Softmax-activated dense layer of length S = 248 to control the sensitivity of frequency components. Then, the attention layers are followed by a multiplication with a Gaussian kernel26,27 to prevent overconcentration, also with dimension of S = 248. This is necessary because the attention module can become too focused on certain points in the feature space. For marine vessels this could be radiating noise consisting of discrete narrow-band line spectra in addition to continuous broadband spectra, it also reduces the impact of Doppler shift, measurement error, and other factors caused by random variation in the features.26 Then, the attention module masks the output sequence of the Gaussian layer with the input sequence through a merge layer of length S = 248. Attention layers act as a mask to retain just features relevant to the target. The attention module is connected to a fully connected network with two dense layers of length 128 and 64 for classification, and the attention weights are the output for visualization as a N×S=1254×248 matrix to show the attention for all frequency points on N samples.

To build an attention module using the mask mechanism of soft attention,22,28 the most basic soft mask layer can be achieved through a Softmax-activated dense layer. Each input data sample, I, is a vector of length S = 248. The attention weight vector α of size S is expressed as

α=softmaxWI,
(1)

where W is the attention score matrix of size S × S. W is learned by the DNN and the softmax function,28 which represents the relevance between each input feature in I and the events of concern. The stronger the relevance, the greater the score.

Then, to define a Gaussian layer, the above is multiplied with the Gaussian kernel function, as represented by matrix operations

H=12σ2h112h122h1S2h212h222hS12hSS2hjk=kj,j,k[0,S],G=1σ2πexp12σ2Hα,
(2)

where G is the output vector of size S of the Gaussian layer, which is the attention weights layer and σ is the standard deviation of the Gaussian function.27 Larger σ gives more distracted attention, which helps attention not overly focused on a single frequency point, so as to improve the robustness of frequency fluctuations. Finally, the output vector from the attention module o is a vector of size S is obtained by a Hadamar multiplication (element-wise, represented by ),

o=GI.
(3)

The Adam optimizer29 is adopted to adjust learning rate dynamically to prevent overlearning and avoid local minima because the cost function of target classification might be non-convex and can get stuck at local minima due to ambient noise, reverberation, and multipath interference in underwater acoustic channels. In addition, the DNN methods often show overfitting owing to their sensitivity to environmental changes and signal distortion. Therefore, dropout regularization30 was adopted to reduce the network parameters before the dense layer.

In this section, tests are performed on four ABNNs. Section 4.1 describes the general setup of the ABNNs. In Sec. 4.2, ABNN-1 and ABNN-2 are trained to carry out target detection for ships A and B, respectively. In Sec. 4.3, ABNN-3 is applied for ship classification, and ABNN-4 is used in a multi-target recognition test. Finally, a comparison of the proposed model with traditional DNN is presented in Sec. 4.4.

The ABNN was trained by the Adam optimizer with an initial learning rate of 0.001 and exponential decay rates for the first and second moment estimates of 0.9 and 0.999, which are good default parameters.29 The standard deviation of the Gaussian layer was set to 1 after several tests to prevent overconcentration. The loss function of network training is the cross-entropy function. Dropout was adopted with the probability of neuron activation 90% was added to the dense 128-node layer. The initial weights of the network were generated by the truncated Gaussian distribution at the range of (−0.2, 0.2) with standard deviation 0.1. In the dense layers apart from the attention module, the ReLU was used as activation function to prevent the vanishing gradient problem. The maximum number of epochs was 20 000. The attention map of all samples was output in each training epoch. All neural networks were implemented in python 3.6 using TensorFlow 2.1. As a supervised learning model, each of the data samples was for each of the 4 classes labeled 1 if it belonged to the class, or 0 otherwise. ABNN-1 was trained by data from Classes I and IV, ABNN-2 by Classes II and IV, and ABNN-3 by Class I and II. For these ABNNs, for each class, 80% of the samples were selected randomly as the training set; the remaining 20% was used as the test set. For ABNN-4, all samples of Class I, II, and IV were used as training set, while all samples of Class III were used as a test set. The details of the dataset and label definition in each class and ABNN were shown in Table 2. The data were randomly shuffled before training.

ABNN-1 and ABNN-2 were used to determine the presence of ship A and ship B. For each ABNN, the output labels were in the form of a sparse label where each element represented the probability of a certain class with the value range of [0,1]. The maximum among these elements determined the class. After 20 000 epochs, the accuracy of detection on the test datasets was 98.0% and 97.4%.

During the network training, the output of the attention weight layer, the vector G, of all samples was recorded every ten epochs, as shown in Figs. 3(a) and 3(b). As the epochs increased, the DNN gradually focused its attention on specific line-spectrum features of ship A (fA = 54.57 Hz) and ship B (fB = 29.30 Hz), whereas other spectral lines in Table 1 were given lower weights.

In Figs. 3(a) and 3(b), the attention maps explain the frequency components that could be used by the algorithm to make decisions. This suggest that the attention module can act as a feature extractor.

In this section, a binary-classification ABNN-3 for ships A and B is trained. The samples with index numbers less (greater) than 343 are those of ship A (B), which are divided by the red dotted line in Fig. 3(c). After 20 000 epochs, the test accuracy of binary classification was 96.6%. The test-loss curve in Fig. 3(d) shows that that the DNN was not overfitted.

The network attention maps after several epochs are shown in Fig. 3(c). The attention maps displayed in real time show that the attention was mainly focused on specific line-spectrum features of ship A (fA = 54.57 Hz) and ship B (fB = 29.30 Hz), whereas the other frequencies given in Table 1 had negligible influence on the decision-making. These frequencies are also consistent with those extracted by ABNN-1 and ABNN-2 in Part B, as shown in Fig. 3. This result shows that in only 100 epochs, the model quickly converged and focused its attention on the target features and suppressed most disturbances. This process was synchronized with the reduction of network loss and the increase in accuracy, as shown in Figs. 3(d) and 3(e).

For the Class III data (including both the noise of ship A and ship B), we used only single-target and ambient noise data to train an ABNN-4. The test results for Class III data showed that for 100% of the samples, the value of the third element of the output nodes (corresponding to the probability of the ambient noise class) was less than 1 × 10−5, indicating that all samples evaluated met the target. In addition, the first two elements of the output nodes (corresponding to the probability of ship A class and ship B class) were greater than 0.8 in 74.3% of the samples, suggesting that the two targets coexisted. The distribution of examination results for all samples is shown in Fig. 4(a). All samples are concentrated in the upper triangle in the figure. The proportion of samples increased as the first two elements of the output nodes approached (1,1). The output of the attention module shown in Fig. 4(b) indicates that the attention of the DNN was focused on the union of Class I and Class II concerns at 20 000 epochs. This result indicates that although the network was trained using single-target data, it was effectively used for multi-target resolution in the current dataset.

To summarize the ABNN results above, the accuracy of detecting ship A and ship B was 98.0% and 97.4%, respectively; the accuracy of the binary classifications of ship A and ship B was 96.6%; when trained by single-target data, the proposed model achieved 74.3% accuracy on multi-target resolution. To compare with traditional DNN, we removed the attention module from the ABNN architecture in Fig.1(b), and re-tested with the model. The results showed that the accuracy of the proposed model was slightly higher than that of the traditional model in ship A detection and two-ship classification (about 0.9% and 0.7%) as shown in Table 3 and Fig. 4(c), which indicates that the improvement in accuracy of the attention module was very limited. However, the accuracy of ABNN in multi-target resolution, at 74.3%, was 16.0% higher than that using traditional DNN, at 58.4%. Therefore, ABNN has better potential for multi-target resolution when only small amounts of single-target training samples are present.

More importantly, the attention maps in Figs. 3 and 4 identify the frequency regions that might be used by ABNN to make decisions during the evaluation on the test dataset, which contributed to reliability in the analysis. Because a well-trained ABNN should focus its attention on the features it learned previously, the users can decide whether the estimation is credible according to the highlighted regions on the attention map; this feature is not available in traditional DNNs.

A target classification method based on ABNN is proposed and tested with multi-source interference. An attention module is added to the DNN to facilitate its focus on the features of the target, to suppress other interference features, and to visualize its attention area during the running of the network. ABNN is characterized by the following features:

  • focus on the target characteristics and suppression of multi-source interference,

  • visualization of features of concern during target detection or recognition,

  • multi-target resolution using only single-target data, and

  • ability to be used as a dedicated feature extraction model.

During DNN training, the ABNN gradually focuses its learning on the features closely correlated with the training goals as the training loss continuously decreases. The ABNN shows good performance in target detection and recognition as well as multi-target resolution. In addition, the attention maps visualize the frequency regions of focused attention, which improves the accuracy and interpretability compared with that realized by traditional DNNs.

1.
D.
Yu-Wei
, “
Review on passive sonar target recognition
,”
Tech. Acoust.
23
,
253
257
(
2004
).
2.
S.
Kamal
,
S. K.
Mohammed
,
P. R. S.
Pillai
, and
M. H.
Supriya
, “
Deep learning architectures for underwater target recognition
,” in
Proceedings of the 2013 Ocean Electronics Symposium
, Kochi, India (October 23–25,
2013
), pp.
48
54
.
3.
X.
Cao
,
X.
Zhang
,
Y.
Yu
, and
L.
Niu
, “
Deep learning-based recognition of underwater target
,” in
Proceedings of the 2016 IEEE International Conference on Digital Signal Processing (DSP)
, Beijing, China (October 16–18,
2016
), pp.
89
93
.
4.
X.
Wang
,
A.
Liu
,
Y.
Zhang
, and
F.
Xue
, “
Underwater acoustic target recognition: A combination of multi-dimensional fusion features and modified deep neural network
,”
Remote Sens.
11
,
1888
(
2019
).
5.
A. K.
Ibrahim
,
L. M.
Chérubin
,
H.
Zhuang
,
M. T.
Schärer Umpierre
,
F.
Dalgleish
,
N.
Erdol
,
B.
Ouyang
, and
A.
Dalgleish
, “
An approach for automatic classification of grouper vocalizations with passive acoustic monitoring
,”
J. Acoust. Soc. Am.
143
(
2
),
666
676
(
2018
).
6.
T.
Oikarinen
,
K.
Srinivasan
,
O.
Meisner
,
J. B.
Hyman
,
S.
Parmar
,
A.
Fanucci-Kiss
,
R.
Desimone
,
R.
Landman
, and
G.
Feng
, “
Deep convolutional network for animal sound classification and source attribution using dual audio recordings
,”
J. Acoust. Soc. Am.
145
(
2
),
654
662
(
2019
).
7.
V. E.
premus
,
M. R.
Rvans
, and
P. A.
Abbot
, “
Machine learning-based classification of recreational fishing vessel kinematics from broadband striation patterns
,”
J. Acoust. Soc. Am.
147
(
2
),
EL184
EL188
(
2020
).
8.
C.
Li
,
Z.
Liu
,
J.
Ren
,
W.
Wang
, and
J.
Xu
, “
A feature optimization approach based on inter-class and intra-class distance for ship type classification
,”
Sensors
20
,
5429
(
2020
).
9.
X.
Cao
,
R.
Togneri
,
X.
Zhang
, and
Y.
Yu
, “
Convolutional neural network with second-order pooling for underwater target classification
,”
IEEE Sens. J.
19
,
3058
3066
(
2019
).
10.
N.
Wang
,
M.
He
,
J.
Sun
,
H.
Wang
,
L.
Zhou
,
C.
Chu
, and
L.
Chen
, “
IA-PNCC: Noise processing method for underwater target recognition convolutional neural network
,”
Comput. Mater. Contin.
58
,
169
181
(
2019
).
11.
C.
Li
,
Z.
Huang
,
J.
Xu
, and
Y.
Yan
, “
Underwater target classification using deep learning
,” in
Proceedings of OCEANS 2018 MTS/IEEE Charleston
, Charleston, SC (October 22–25,
2018
).
12.
S.
Shen
,
H.
Yang
,
X.
Yao
,
J.
Li
,
G.
Xu
, and
M.
Sheng
, “
Ship type classification by convolutional neural networks with auditory-like mechanisms
,”
Sensors
20
,
253
(
2020
).
13.
H.
Yang
,
J.
Li
,
S.
Shen
, and
G.
Xu
, “
A deep convolutional neural network inspired by auditory perception for underwater acoustic target recognition
,”
Sensors
19
,
1104
(
2019
).
14.
Y.
Liu
,
H.
Niu
, and
Z.
Li
, “
A multi-task learning convolutional neural network for source localization in deep ocean
,”
J. Acoust. Soc. Am.
148
(
2
),
873
883
(
2020
).
15.
H.
Niu
,
Z.
Gong
,
E.
Reeves
,
P.
Gerstoft
,
H.
Wang
, and
Z.
Li
, “
Deep-learning source localization using multi-frequency magnitude-only data
,”
J. Acoust. Soc. Am.
146
(
1
),
211
222
(
2019
).
16.
H.
Niu
,
E.
Reeves
, and
P.
Gerstoft
, “
Source localization in an ocean waveguide using supervised machine learning
,”
J. Acoust. Soc. Am.
142
(
3
),
1176
1188
(
2017
).
17.
H.
Niu
,
E.
Reeves
, and
P.
Gerstoft
, “
Ship localization in Santa Barbara Channel using machine learning classifiers
,”
J. Acoust. Soc. Am.
142
(
5
),
EL455
EL460
(
2017
).
18.
E.
Ozanich
,
P.
Gerstoft
, and
H.
Niu
, “
A feedforward neural network for direction-of-arrival estimation
,”
J. Acoust. Soc. Am.
147
(
3
),
2035
2048
(
2020
).
19.
H.
Cao
,
W.
Wang
,
L.
Su
,
H.
Ni
,
P.
Gerstoft
,
Q.
Ren
, and
L.
Ma
, “
Deep transfer learning for underwater direction of arrival using one vector sensor
,”
J. Acoust. Soc. Am.
149
(
3
),
1699
1711
(
2021
).
20.
R.
Guidotti
,
A.
Monreale
,
S.
Ruggieri
,
F.
Turini
,
F.
Giannotti
, and
D.
Pedreschi
, “
A survey of methods for explaining black box models
,”
ACM Comput. Surv.
51
,
1
42
(
2019
).
21.
S.
Chaudhari
,
G.
Polatkan
,
R.
Ramanath
, and
V.
Mithal
, “
An attentive survey of attention models
,” arXiv:1904.02874 (
2019
).
22.
D.
Bahdanau
,
K.
Cho
, and
Y.
Bengio
, “
Neural machine translation by jointly learning to align and translate
,” arXiv:1409.0473 (
2014
).
23.
K.
Xu
,
J.
Ba
,
R.
Kiros
,
K.
Cho
,
A.
Courville
,
R.
Salakhudinov
,
R.
Zemel
, and
Y.
Bengio
, “
Show, attend and tell: Neural image caption generation with visual attention
,” in
Proceedings of the International Conference on Machine Learning
, Lille, France (July 6–11,
2015
).
24.
R.
Poplin
,
A. V.
Varadarajan
,
K.
Blumer
,
Y.
Liu
,
M. V.
McConnell
,
G. S.
Corrado
,
L.
Peng
, and
D. R.
Webster
, “
Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning
,”
Nat. Biomed. Eng.
2
,
158
164
(
2018
).
25.
W.
Chan
,
N.
Jaitly
,
Q.
Le
, and
O.
Vinyals
, “
Listen, attend and spell: A neural network for large vocabulary conversational speech recognition
,” in
Proceedings of the 2016 ICASSP
, Shanghai, China (March 20–25,
2016
), pp.
4960
4964
.
26.
W.
Wang
,
H.
Ni
,
L.
Su
,
T.
Hu
,
Q.
Ren
,
P.
Gerstoft
, and
L.
Ma
, “
Deep transfer learning for source ranging: Deep-sea experiment results
,”
J. Acoust. Soc. Am.
146
,
EL317
EL322
(
2019
).
27.
X.
Xiao
,
W.
Wang
,
L.
Su
,
X.
Guo
,
L.
Ma
, and
Q.
Ren
, “
Localization of immersed sources by modified convolutional neural network: Application to a deep-sea experiment
,”
Sensors
21
(
9
),
3109
(
2021
).
28.
F.
Wang
,
M.
Jiang
,
C.
Qian
,
S.
Yang
,
C.
Li
,
H.
Zhang
, and
X.
Tang
, “
Residual attention network for image classification
,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, Honolulu, HI (July 21–26,
2017
), pp.
3156
3164
.
29.
D.
Kingma
and
B.
Jimmy
, “
ADAM: A method for stochastic optimization
,” arXiv:1412.6980 (
2014
).
30.
N.
Srivastava
,
G. E.
Hinton
,
A.
Krizhevsky
,
I.
Sutskever
, and
R.
Salakhutdinov
, “
Dropout: A simple way to prevent neural networks from overfitting
,”
J. Mach. Learn. Res.
15
,
1929
1958
(
2014
).