The increasing needs for controlling high noise levels motivate development of automatic sound event detection and classification methods. Little work deals with automatic train pass-by detection despite a high degree of annoyance. To this matter, an innovative approach is proposed in this paper. A generic classifier identifies vehicle noise on the raw audio signal. Then, combined short sound level analysis and mel-spectrogram-based classification refine this outcome to discard anything but train pass-bys. On various long-term signals, a 90% temporal overlap with reference demarcation is observed. This high detection rate allows a proper railway noise contribution estimation in different soundscapes.

With the continued expansion of highly dense urban areas, noise pollution has become an increasing concern. For several decades, reports have highlighted noise-related health issues.1,2 Long-term noise overexposure is a risk factor for hearing loss, hypertension, and other cardiovascular conditions.3,4

Among existing sound sources, industrial and transport noise have been identified as the most harmful and unpleasant due to their omnipresence in our everyday lives.5,6 Recommendations have been outlined by health institutes,6 along with several independent studies.7 Transport noise is now regulated by national or international standards.8 Most of these standards require detection of the targeted source activity. Noise levels during active periods of the targeted source are used to compute its noise contribution. Thus, an effective acoustic-based train pass-by detector enables railway traffic monitoring as well as railway noise contribution estimation.

Event detection on acoustic measurements is known in the literature as Sound Event Detection (sed). With the rise of machine learning, sed has become an efficient way to identify specific sound sources within a signal.9,10 Depending on the soundscape and the studied source, specific features can be extracted from raw audio data.11 Time-frequency representations, mainly mel-spectrograms,12 are favored in most sed systems. Spectrograms efficiently highlight the most energetic spectral patterns over sliding window applied on a signal. Such representations also benefit from the Convolutional Neural Network (CNN) popularity and accuracy.13 Thanks to layers of optimized filters, CNNs are efficient to extract meaningful information on images, facilitating their classification. CNNs and their variations are a baseline for diverse sed tasks, especially in environmental contexts.14,15 In recent years, other complex networks (e.g., Transformers16,17) have emerged, but there is little application on long-term acoustic recordings. The smaller size and relatively low computational cost of CNNs18 remain a considerable asset, especially for rare sed.19 

Contrary to most environmental sed systems that target various sources or source sets,10,14 our proposed work focuses on train pass-by events. In this regard, Molina-Moreno et al.20 have recently presented a two-step procedure. First, they design an event detector consisting of a 66 dB(A) threshold to be sustained for 4 s on the 1 s A-weighted sound pressure level (LAeq,1s). Then, a classifier separates train pass-bys from other events. Two classifiers are proposed by:20 either a logistic regression classifier or a recurrent neural network-based autoencoder. Both only use features extracted from LAeq,1s as inputs. Compared to mel-spectrograms, LAeq,1s provides signal energy information with a larger temporal resolution and without frequency bands. This choice lowers computational costs, what matters in long-term acoustic monitoring.20–22 However, additional steps or complementary features are often necessary to ensure generalization on new data.

Our proposed acoustic-based train pass-by detection method relies on both mel-spectrograms and LAeq-based features. Two acoustic classifiers use mel-spectrograms to detect vehicle noise and then to specify the vehicle type. Meanwhile, an LAeq analysis unit enhances detected events demarcation thanks to advanced signal processing techniques. Applied on raw acoustic recordings, our proposed method automatically detects train pass-by and computes railway noise contribution.

In order to introduce our method, a quick overview is presented in Fig. 1. It displays the major stages, their inputs, and the global outcome: the detected train pass-bys.

Fig. 1.

Block diagram of the proposed automatic train pass-by detection method.

Fig. 1.

Block diagram of the proposed automatic train pass-by detection method.

Close modal

Generic acoustic classifier YAMNet is used for preclassification (Sec. 2.1): it allows the identification of several frames associated with transport noise, called frames of interest (FoI) in Fig. 1. Then, an LAeq-based algorithm (Sec. 2.2) extends these frames into complete continuous events and discards those with inconsistent properties. This second stage is called event boundary determination and filtering. Finally, over each event, an internal complementary classifier (Sec. 2.3) is applied to distinguish train pass-bys from other vehicle sound events.

Working on detection of a specific sound source often requires different processing layers to first detect events of interest and then narrow down to the desired source. The main objective of the first step is to detect periods of interest, without missing any train pass-bys. Instead of thresholds on sound pressure level, the use of a broad classifier working on raw signals or on transformations of these signals permits the detection of low-noise events. However, detection accuracy may suffer from overlapping sounds in polyphonic environments (multiple sources at the same time).

To select the first classifier, we compared three different “state-of-the-art” (SOA) acoustic classifiers: YAMNet,23 E-PANNS,24 and Audio Spectrogram Transformer (AST).16 They were all trained on the AudioSet corpus,25 audio clips drawn from YouTube videos, whose ontology contains 521 labels, covering environmental acoustic sources and events. Among these labels, we identified three major ones that can relate to the sounds of a train pass-by: Crail = {Vehicle, Train, Rail transport}. Each time the used classifier assigns one of these labels to a frame, we consider that the model predicted railway noise on this particular frame. Preliminary assessments, later confirmed by a complete evaluation (Sec. 3.3), show that YAMNet provides a great recall on railway noise frames classification. A high recall is crucial for a preclassifier because it proves that few real frames are missed by the model. Such performance justifies our selection of YAMNet as the preclassifier in our context.

YAMNet is a CNN developed by Google, based on the initial MobileNet version 1 architecture.23 The acoustic signal is processed within a sliding window (960 ms with 50% overlap) to form a series of frames. A mel-spectrogram is computed for each frame and used as input to the CNN for frame-level classification. The 64 mel filters are applied from 125 Hz to 7500 Hz. Such filters cover most of the relevant environmental acoustics frequency range.

Along the whole signal, we define FoI [Eq. (1)] as those whose highest likelihood (defined by YAMNet prediction) corresponds to a Crail label,
(1)

FoI become the input for the following stage (Fig. 1): Event boundary determination and filtering. By definition, these are timestamps, multiple of the sliding window hop size δt = 480 ms (window length × overlap). To maintain temporal consistency, the 480-ms duration, denoted as δt, becomes the reference time step throughout the method.

Figure 2 provides a detailed layout of the proposed method. The three stages appearing in Fig. 1 are shown in the form of units within the dashed lines. The colored areas correspond to the intermediate outputs after each stage. Thus, in the YAMNet preclassification unit, FoI of a 20-minute audio signal are represented by colored areas.

Fig. 2.

Description of the proposed automatic train pass-by detection method on a 20-minute-long example.

Fig. 2.

Description of the proposed automatic train pass-by detection method on a 20-minute-long example.

Close modal

On the example from Fig. 2, FoI do not directly provide well-demarcated events. YAMNet frame classification lacks continuity, as each frame is processed individually, without using information from its neighborhood. Thus, we need to improve event demarcation in order to get complete and continuous pass-by events. To address this issue, we draw inspiration from the train pass-by detection carried out by human experts.

For detecting train pass-bys on long-term acoustic signals, human experts usually rely on the evolution of A-weighted sound pressure level (LAeq,δt). On the LAeq,δt curve, a linear mobile sound source emerging from background noise during several δt periods, follows a specific pattern. Ascending and descending phases of similar duration should be observed, sometimes separated by a sound level plateau for low source speeds or long linear sources. This pattern, common to road and rail vehicles, can be observed multiple times on the LAeq,δt evolution of Fig. 2. Acoustics experts use this property, in addition to other signal processing criteria, to recognize train pass-bys on LAeq,1s evolution. To adapt and automate this process, two successive algorithms are implemented. First, event demarcation is addressed by extending FoI to the real event boundaries. Then, LAeq,δt profiles are analyzed to reject events whose profile differs from the expected pattern.

2.2.1 Event boundary determination

Typical train pass-by patterns on sound level evolution curve have been established above. They consist of ascending and descending phases of similar duration with a potential sound level plateau. Thus, a proper train pass-by demarcation can be obtained by extending preselected FoI to the closest significant sound level local minima.

To evaluate whether a sound level local minima is significant or not, it should be compared to its neighborhood. We introduce the concept of inverse prominence (IP), inspired from the peak prominence concept.26 Equation (2) defines IP, while Fig. 3 details a graphical method to find IP of a given local minimum,
(2)
Fig. 3.

Description of the graphical method to measure IP [Eq. (2)] for each local minima of function ƒΔt.

Fig. 3.

Description of the graphical method to measure IP [Eq. (2)] for each local minima of function ƒΔt.

Close modal

Our selection of significant sound level local minima includes those whose IP exceeds a IPmin threshold, empirically fixed at 1 dB(A). In Fig. 3, two local minima (in t0 and t2) could be considered significant thanks to a large IP. On the other hand, local minima in t3, t4, and t5 show little IP and would likely be considered non-significant.

Initial FoI are extended to the closer significant sound level local minimum in each direction. This operation improves event demarcation, as disjointed frames turn into continuous events. A typical outcome is displayed on Fig. 2, where colored areas of the event boundary determination and filtering unit represent well-demarcated events.

2.2.2 LAeq,δt-based event rejection

The improvement from initial discontinuous FoI to accurately demarcated events allows a deeper analysis and first separation between railway or non-railway events. Indeed, we keep drawing inspiration from detection techniques of acoustics experts. Short or non-emerging events can easily be identified and discarded. Both these properties are evaluated via two metrics, presented in Eqs. (3) and (4),
(3)
(4)

Thresholds are applied on both metrics to discard events whose properties largely differs from typical train pass-by. On our evaluation dataset, optimal thresholds are 5 s for event duration and 10 dB(A) for relative amplitude. Such thresholds do not wrongfully reject any real train pass-by on evaluation dataset. However, false positives also pass these thresholds because truck or motorcycle LAeq,1s profiles do resemble train pass-by LAeq,1s profiles. Therefore, a complementary classifier is built to handle remaining misclassifications.

To complement previous stages and to obtain accurate results, we aim at improving distinction between different types of vehicles. For this matter, a complementary classifier has been internally developed. In various soundscapes, we collected data and labeled it across five distinct classes: railway vehicles, roadway vehicles, construction machines, aircraft noise, and other surrounding noise. Around 14 hours of verified data thus form the training dataset of our complementary classifier. In addition to the last class, these sources emit similar noise when engine sounds prevail. Therefore, precise features are required to optimize classification of these sound sources. To such extent, mel-spectrogram representation is often preferred to LAeq,δt.

Choice is made to fine-tune YAMNet model with the aforementioned training dataset. This way, we benefit from YAMNet pretraining and reduce computation cost, as mel-spectrograms are already generated. Also, the sliding window over the signal brings multiple options to process the model outcome. To suit poorly demarcated events and favor loudest periods, a weighting scheme is implemented [Eq. (5)]. This equation describes how the series of frames during an event k is handled to provide a decision for the whole event,
(5)

Thanks to this process, a single probability vector represents the LAeq,δt-weighted class likelihood for each event. A final threshold on railway vehicle probability provides the decision to keep the event as a train pass-by or discard it. In the complementary classifier unit of Fig. 2, the railway vehicle probability is displayed for each event, and the color represents the final decision. Unless otherwise specified, threshold value is constant for all recording sites. However, depending on measurement conditions, a site-by-site threshold adjustment increases overall performances.

This section describes the evaluation process, from the used labeled dataset to the evaluation metrics computation. Two main aspects should be carefully assessed:

  • Method robustness to new and complex environments.

  • Validity of the railway noise contribution resulting from the automatic detection.

Both these subjects matter in the definitions of our evaluation dataset, method, and metrics.

The evaluation dataset is crucial to estimate the value of our proposed automatic train pass-by detection method. For this purpose, two elements are prioritized when building our dataset: a large quantity and a wide variety of acoustic recordings. As a reminder, railway noise is evaluated over long representative periods, and most of our acoustic measurements last more than 24 h. The recording device (class 1 sound level meter) stands between 20 and 50 m away from the railway line. With a daily average around 80 train pass-bys (including high-speed “TGV” trains, freight and passenger trains), 16 measurements are selected to bring the total of expected events well over 1000. Above all, specific attention is paid to the soundscape around each recording sites, split into three categories:

  • Urban: High level of background noise due to various sound sources. High probability of overlapping sources.

  • Rural (with road traffic): High level of background noise due to high-speed roads near the recording area.

  • Rural (without road traffic): Low level of background noise. Low probability of overlapping sound sources.

Eight urban and eight rural soundscapes (four with and four without road traffic) form the evaluation dataset. This context identification enables a deeper analysis of the process strengths and weaknesses, via the method described below.

In order to assess sed systems, frame-based and event-based approaches can be considered.12 Event-based evaluation is easier to understand but does not provide any information about event demarcation. Thus, frame-based approach is favored for assessing intermediate results. It consists of a frame-to-frame comparison between the system output and the ground truth. Ultimately, the final detection prediction is also evaluated using the event-based approach.

First, precision and recall, respectively, evaluate the method tendency to include false positives and false negatives. F1-score, harmonic mean of precision and recall, is computed, too.

Finally, we also compare railway noise contribution resulting from our method to what is obtained with the reference annotation. This is a way to verify that the most annoying pass-bys are properly classified and demarcated. The Lden,railway indicator (see definition in the supplementary material) is chosen to make this comparison. In substance, it provides an equivalent noise level over train pass-bys with penalties during the evening and night. The absolute difference between Lden,railway obtained through the reference annotation and the proposed system is noted ǀLden,railwayǀ.

First, we are willing to evaluate the progress through the three method stages (Figs. 1 and 2), and compare it to existing models. As established earlier (Sec. 3.2), it is more convenient to assess intermediate results using frame-based evaluation. Table 1 shows the frame-based precision, recall, and F1-score for major SOA classifiers trained on Audioset.25 Table 2 shows the same information after each stage of the proposed method.

Table 1.

Evolution of the frame-based precision, recall, and F1-score on the evaluation dataset applied on SOA classifiers trained on Audioset (Ref. 25).

Model No. of parameters (M) Frame-based metrics (%)
Precision Recall F1-score
YAMNet (Ref. 23 3.7  6.1  84.1  10.9 
E-PANNS (Ref. 24 24.3  9.5  80.7  15.1 
AST (Ref. 16 88.1  21.1  69.1  27.1 
Model No. of parameters (M) Frame-based metrics (%)
Precision Recall F1-score
YAMNet (Ref. 23 3.7  6.1  84.1  10.9 
E-PANNS (Ref. 24 24.3  9.5  80.7  15.1 
AST (Ref. 16 88.1  21.1  69.1  27.1 
Table 2.

Evolution of the frame-based precision, recall, and F1-score on the evaluation dataset through the three stages of our method.

Method stage Frame-based metrics (%)
Precision Recall F1-score
Preclassification (YAMNet)  6.1  84.1  10.9 
Event boundary determination and filtering  54.1  92.3  67.5 
Complementary classification (final)  77.2  90.0  82.8 
Method stage Frame-based metrics (%)
Precision Recall F1-score
Preclassification (YAMNet)  6.1  84.1  10.9 
Event boundary determination and filtering  54.1  92.3  67.5 
Complementary classification (final)  77.2  90.0  82.8 

Table 1 clarifies the purpose and relevance of each step. First, it shows that, despite a significantly lower number of parameters, YAMNet provides an equivalent or higher recall than E-PANNS or AST. Recall is the most crucial metric while assessing a preclassifier: we want to avoid missing events and will mostly rely on the following steps to discard false positives. Using YAMNet preclassified frames as input, our second stage helps improving frame-based recall from 84% to 92% and frame-based precision from 6% to 54%. Finally, after complementary classification, a 77% precision and 90% recall are reached. Such results represent a substantial improvement compared to direct application of pretrained classifiers.

To complement these first observations, Table 3 provides an event-based evaluation, differentiating the three formerly defined measurement contexts. Only the final prediction is assessed in Table 3. F1-score with optimal threshold refers to the average F1-score obtained when railway vehicle probability threshold is adjusted site-by-site in order to maximize accuracy. Any other metric is computed with a constant railway vehicle probability threshold for all recording sites.

Table 3.

Frame-based precision, recall, F1-score, AUC, and ǀLden,railwayǀ of the proposed train pass-by detection method. Results are first provided according to measurement context and global results refer to the application on the whole evaluation dataset.

Event-based metrics (%) ǀLden,railwayǀ [dB(A)]
Measurement context Precision Recall F1-score F1-score with optimal threshold Average Maximum
Rural without road traffic  92.1  92.5  92.3  95.2  0.125  0.3 
Rural with road traffic  52.4  83.5  64.4  78.3  0.200  0.6 
Urban  45.8  95.7  62.0  80.9  0.700  1.2 
Global results  59.0  91.9  70.2  83.9  0.431  1.2 
Event-based metrics (%) ǀLden,railwayǀ [dB(A)]
Measurement context Precision Recall F1-score F1-score with optimal threshold Average Maximum
Rural without road traffic  92.1  92.5  92.3  95.2  0.125  0.3 
Rural with road traffic  52.4  83.5  64.4  78.3  0.200  0.6 
Urban  45.8  95.7  62.0  80.9  0.700  1.2 
Global results  59.0  91.9  70.2  83.9  0.431  1.2 

Such results underscore the challenge of detecting train pass-bys in the presence of road traffic noise. This results in a significant precision gap between soundscapes with road traffic (48% on average) and those without (92%). However, recall remains high in any context, meaning that expected train pass-bys are detected. Also, adjusting the threshold site-by-site largely improves F1-score in complex soundscapes (+19% in urban context). In terms of railway noise contribution, the average ǀLden,railwayǀ is low, meaning that our automatic detection closely aligns with the reference. The maximal observed gap, at 1.2 dB(A), is relatively low for noise contribution estimations.

In this paper, a new acoustic-based automatic train pass-by detection method is proposed. The open-source audio classifier YAMNet is used to identify vehicle noise within long-term audio signals. Despite a low precision (6%) on our evaluation dataset, it provides a higher recall (84%) than E-PANNS (81%) or AST (69%) with lower computational cost. After YAMNet preclassification, two successive stages are introduced to improve detection performances, especially precision.

First, an event boundary determination and filtering algorithm is applied. It relies on A-weighted sound level evolution to extend preclassified frames into complete and continuous pass-by events. Two criteria on sound level properties are also introduced to discard events that largely differ from train pass-bys. Overall, this stage permits a large precision improvement (+48%) compared to the preclassification. Second, a complementary classifier is introduced to identify train pass-bys among the remaining events. Trained on internally collected and labeled data, this classifier distinguishes railway vehicles from other sources, including road vehicles, construction machines, and aircraft noise. Application of this classifier also results in a large precision improvement (+23%) while maintaining a high recall.

On our evaluation dataset, the complete method provided a frame-based 77.2% precision and 90% recall. Above all, the estimated railway noise contribution is very close to what is obtained manually. Average difference regarding the Lden,railway indicator is around 0.43 dB(A). This demonstrates the reliability of the method for railway noise contribution estimation. Nonetheless, further improvements could focus on refining event boundaries and reducing computational cost.

See the supplementary material for a brief introduction to Lden,railway indicator and its use.

This work was supported by ACOUSTB and GIPSA-Lab, as part of a collaborative PhD thesis in an ANRT frame.

The authors have no conflicts to disclose.

The data that support the findings of this study are available from the corresponding author upon reasonable request.

1.
S.
Stansfeld
,
M.
Haines
, and
B.
Brown
, “
Noise and health in the urban environment
,”
Rev. Environ. Health
15
(
1
),
43
82
(
2000
).
2.
E.
Daniel
, “
Noise and hearing loss: A review
,”
J. School Health
77
(
5
),
225
231
(
2007
).
3.
X.
Li
,
Q.
Dong
,
B.
Wang
,
H.
Song
,
S.
Wang
, and
B.
Zhu
, “
The influence of occupational noise exposure on cardiovascular and hearing conditions among industrial workers
,”
Sci. Rep.
9
(
1
),
11524
(
2019
).
4.
S. A.
Stansfeld
and
M. P.
Matheson
, “
Noise pollution: Non-auditory effects on health
,”
Br. Med. Bull.
68
(
1
),
243
257
(
2003
).
5.
J.-P.
Faulkner
and
E.
Murphy
, “
Estimating the harmful effects of environmental transport noise: An EU study
,”
Sci. Total Environ.
811
,
152313
(
2022
).
6.
World Health Organization
, “
Environmental noise guidelines for the European region
” (
2018
).
7.
T. A.
Gilani
and
M. S.
Mir
, “
A study on the assessment of traffic noise induced annoyance and awareness levels about the potential health effects among residents living around a noise-sensitive area
,”
Environ. Sci. Pollut. Res.
28
(
44
),
63045
63064
(
2021
).
8.
E.
Cuadra
,
W.
Sperry
, and
W.
Roper
, “
Regulation of transportation noise in the United States
,”
J. Sound Vib.
43
(
2
),
449
458
(
1975
).
9.
D.
Stowell
,
D.
Giannoulis
,
E.
Benetos
,
M.
Lagrange
, and
M. D.
Plumbley
, “
Detection and classification of acoustic scenes and events
,”
IEEE Trans. Multimedia
17
(
10
),
1733
1746
(
2015
).
10.
T. K.
Chan
and
C. S.
Chin
, “
A comprehensive review of polyphonic sound event detection
,”
IEEE Access
8
,
103339
103373
(
2020
).
11.
S. U.
Rahman
,
A.
Khan
,
S.
Abbas
,
F.
Alam
, and
N.
Rashid
, “
Hybrid system for automatic detection of gunshots in indoor environment
,”
Multimed. Tools Appl.
80
(
3
),
4143
4153
(
2021
).
12.
A.
Mesaros
,
T.
Heittola
,
T.
Virtanen
, and
M. D.
Plumbley
, “
Sound event detection: A tutorial
,”
IEEE Signal Process. Mag.
38
(
5
),
67
83
(
2021
).
13.
H.
Purwins
,
B.
Li
,
T.
Virtanen
,
J.
Schlüter
,
S.-Y.
Chang
, and
T.
Sainath
, “
Deep learning for audio signal processing
,”
IEEE J. Sel. Top. Signal Process.
13
(
2
),
206
219
(
2019
).
14.
E.
Cakır
,
G.
Parascandolo
,
T.
Heittola
,
H.
Huttunen
, and
T.
Virtanen
, “
Convolutional recurrent neural networks for polyphonic sound event detection
,”
IEEE/ACM Trans. Audio. Speech Lang. Process.
25
(
6
),
1291
1303
(
2017
).
15.
Q.
Kong
,
Y.
Xu
,
W.
Wang
, and
M. D.
Plumbley
, “
Sound event detection of weakly labelled data with CNN-transformer and automatic threshold optimization
,”
IEEE/ACM Trans. Audio. Speech Lang. Process.
28
,
2450
2460
(
2020
).
16.
Y.
Gong
,
Y.-A.
Chung
, and
J.
Glass
, “
AST: Audio Spectrogram Transformer
,” in
Proceedings of Interspeech
(
2021
), pp.
571
575
.
17.
P.
Verma
and
J.
Berger
, “
Audio transformers: Transformer architectures for large scale audio understanding. adieu convolutions
,” arXiv:2105.00335 (
2021
).
18.
L.
Lhoest
,
M.
Lamrini
,
J.
Vandendriessche
,
N.
Wouters
,
B.
da Silva
,
M. Y.
Chkouri
, and
A.
Touhafi
, “
Mosaic: A classical machine learning multi-classifier based approach against deep learning classifiers for embedded sound classification
,”
Appl. Sci.
11
(
18
),
8394
(
2021
).
19.
E.
Cakır
and
T.
Virtanen
, “
Convolutional recurrent neural networks for rare sound event detection
,” in
Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)
(
2017
), pp.
27
31
.
20.
M.
Molina-Moreno
,
D.
de la Prida
,
L. A.
Azpicueta-Ruiz
, and
A.
Pedrero
, “
A noise monitoring system with domain adaptation based on standard parameters measured by sound analyzers
,”
Appl. Acoust.
218
,
109892
(
2024
).
21.
E.
Browning
,
R.
Gibb
,
P.
Glover-Kapfer
, and
K. E.
Jones
, “
Passive acoustic monitoring in ecology and conservation
,” in
Conservation Technology Acoustic Monitoring,
WWF Conservation TEchnology Series 1(2) (WWF-UK, Woking, UK, 2017).
22.
A. O.
Albaji
,
R. Bt. A.
Rashid
, and
S. Z.
Abdul Hamid
, “
Investigation on machine learning approaches for environmental noise classifications
,”
J. Electr. Comput. Eng.
1
,
3615137
(
2023
).
23.
E.
Tsalera
,
A.
Papadakis
, and
M.
Samarakou
, “
Comparison of pre-trained CNNs for audio classification using transfer learning
,”
J. Sensor Actuator Netw.
10
(
4
),
72
(
2021
).
24.
A.
Singh
,
H.
Liu
, and
M. D.
Plumbley
, “
E-PANNS: Sound recognition using efficient pre-trained audio neural networks
,” in
Inter-Noise and Noise-Con Congress and Conference Proceedings (Institute of Noise Control Engineering, Wakefield, MA,
2023
), Vol.
268
, pp.
7220
7228
.
25.
J. F.
Gemmeke
,
D. P.
Ellis
,
D.
Freedman
,
A.
Jansen
,
W.
Lawrence
,
R. C.
Moore
,
M.
Plakal
, and
M.
Ritter
, “
Audio set: An ontology and human-labeled dataset for audio events
,” in
IEEE International Conference on Acoustics, Speech and Signal Processing
(
2017
), pp.
776
780
.
26.
A.
Kirmse
and
J.
de Ferranti
, “
Calculating the prominence and isolation of every mountain in the world
,”
Prog. Phys. Geog.
41
(
6
),
788
802
(
2017
).

Supplementary Material