Convolutional neural networks (CNNs) have proven highly effective in automatically identifying and classifying underwater sound sources, enabling efficient analysis of marine environments. This work examines two key design choices for a CNN classifier: input representation and network architecture, analyzing their importance as training data size varies and their effectiveness in generalizing between sites. Passive acoustic data from three offshore sites in Western Scotland were used for hierarchical classification; categorizing sounds into one of four classes: delphinid tonal, delphinid clicks, vessels, and ambient noise. Three different input representations of the acoustic signals were investigated along with four CNN architectures, including three pre-trained for image classification tasks. Experiments show that a custom-built shallow CNN can outperform more complex ar chitectures if the input representation is chosen appropriately. For example, a shallow CNN using Mel-spectrogram normalised with per channel energy normalization (MS-PCEN) achieved a 12.5% accuracy improvement over a ResNet model when small amounts of training data are available. Studying model performance across the three sites demonstrates that input representation is an important factor for achieving robust results between sites, with MS-PCEN achieving the best performance. However, the importance of the choice of input representation decreases as the training dataset size increases.
Skip Nav Destination
,
,
,
,
,
Article navigation
April 2025
April 18 2025
Sounds of the deep: How input representation, model choice, and dataset size influence underwater sound classification performance Available to Purchase
Abdullah Olcay
;
Abdullah Olcay
a)
1
Institute of Sound and Vibration Research, University of Southampton
, Southampton, SO17 1BJ, United Kingdom
Search for other works by this author on:
Paul R. White
;
Paul R. White
1
Institute of Sound and Vibration Research, University of Southampton
, Southampton, SO17 1BJ, United Kingdom
Search for other works by this author on:
Jonathan M. Bull
;
Jonathan M. Bull
2
School of Ocean and Earth Science, University of Southampton
, Southampton, SO14 3ZH, United Kingdom
Search for other works by this author on:
Denise Risch
;
Denise Risch
3
Marine Science Department, Scottish Association of Marine Science
, Oban, PA37 1QA, United Kingdom
Search for other works by this author on:
Benedict Dell
;
Benedict Dell
1
Institute of Sound and Vibration Research, University of Southampton
, Southampton, SO17 1BJ, United Kingdom
Search for other works by this author on:
Ellen L. White
Ellen L. White
2
School of Ocean and Earth Science, University of Southampton
, Southampton, SO14 3ZH, United Kingdom
Search for other works by this author on:
Abdullah Olcay
1,a)
Paul R. White
1
Jonathan M. Bull
2
Denise Risch
3
Benedict Dell
1
Ellen L. White
2
1
Institute of Sound and Vibration Research, University of Southampton
, Southampton, SO17 1BJ, United Kingdom
2
School of Ocean and Earth Science, University of Southampton
, Southampton, SO14 3ZH, United Kingdom
3
Marine Science Department, Scottish Association of Marine Science
, Oban, PA37 1QA, United Kingdom
a)
Email: [email protected]
J. Acoust. Soc. Am. 157, 3017–3032 (2025)
Article history
Received:
September 20 2024
Accepted:
April 04 2025
Citation
Abdullah Olcay, Paul R. White, Jonathan M. Bull, Denise Risch, Benedict Dell, Ellen L. White; Sounds of the deep: How input representation, model choice, and dataset size influence underwater sound classification performance. J. Acoust. Soc. Am. 1 April 2025; 157 (4): 3017–3032. https://doi.org/10.1121/10.0036498
Download citation file:
Pay-Per-View Access
$40.00
Sign In
You could not be signed in. Please check your credentials and make sure you have an active account and try again.
315
Views
Citing articles via
Focality of sound source placement by higher (ninth) order ambisonics and perceptual effects of spectral reproduction errors
Nima Zargarnezhad, Bruno Mesquita, et al.
A survey of sound source localization with deep learning methods
Pierre-Amaury Grumiaux, Srđan Kitić, et al.
Variation in global and intonational pitch settings among black and white speakers of Southern American English
Aini Li, Ruaridh Purse, et al.
Related Content
Automatic detection and classification of baleen and toothed whale calls via machine learning approaches over instantaneous wide areas in the Gulf of Maine received on a coherent hydrophone array
J. Acoust. Soc. Am. (October 2022)
Development of a machine learning detector for North Atlantic humpback whale song
J. Acoust. Soc. Am. (March 2024)
Separation of overlapping sources in bioacoustic mixtures
J. Acoust. Soc. Am. (March 2020)
Advancing robust underwater acoustic target recognition through multitask learning and multi-gate mixture of experts
J. Acoust. Soc. Am. (July 2024)
Predicting the perception of performed dynamics in music audio with ensemble learning
J. Acoust. Soc. Am. (March 2017)