Today's low-cost automotive ultrasonic sensors perform distance measurements of obstacles within the close range of vehicles. For future parking assist systems and autonomous driving applications, the performance of the sensors should be further increased. This paper examines the processing of sensor data for the classification of different object classes and traversability of obstacles using a single ultrasonic sensor. The acquisition of raw time signals, transformation into time-frequency images, and classification using machine learning methods are described. Stationary and dynamic measurements at a velocity of 0.5 m/s of various objects have been carried out in a semi-anechoic chamber and on an asphalt parking space. We propose a scalogram-based signal processing chain and a convolutional neural network, which outperforms a LeNet-5-like baseline. Additionally, several methods for offline and online data augmentation are presented and evaluated. It is shown that carefully selected augmentation methods are useful to train more robust models. Accuracies of 90.1% are achieved for the classification of seven object classes in the laboratory and 66.4% in the outdoor environment. Traversability is correctly classified at an accuracy of 96.4% and 91.5%, respectively.

Recent developments in driver assist systems and automated driving have led to a growing need for advanced environmental sensing. In addition to sensors such as cameras, radar, or lidar, ultrasonic sensors are well suited because of their low production costs, robustness, and widespread use. Hence, improving the performance of the sensors is of great interest. Today's automotive ultrasonic sensors use the pulse-echo method to calculate the distance to obstacles in the near-field for parking and maneuvering applications.1 For future products, a classification of obstacles is desirable.

Generally, classification tasks using ultrasonic sensors are more common in medical imaging for human tissue analysis,2,3 non-destructive material testing (NDT) in solid bodies,4,5 or underwater sonar.6 In these applications, transducer arrays are typically used.7 Using transducer arrays allows beamforming, which provides advantages such as directional scanning of the environment and enhanced signal-to-noise ratios (SNRs).8 Due to increased hardware requirements for the processing of multiple channels and high power consumption, these sensors are costly compared to automotive ultrasonic sensors, which usually consist of only a single piezoelectric element.1 So far, the usage of low-cost automotive sensors for classification tasks has been poorly addressed. However, there has been some research on classifying terrain surfaces9,10 and the height of obstacles.11 Further, the detection of human beings using an ultrasonic sensor is successfully performed by Sonia et al.12 Promising results regarding classification tasks using an ultrasonic sensor have also been achieved by Bouhamed et al.13 in distinguishing floors and staircases.

In nature, the classification of acoustic echoes is seen in bats, who are able to detect their prey by emitting frequency-modulated ultrasonic calls.14,15 Some blind people have learned to use so-called click sonar allowing them to detect the position, size, shape, and even material substance of obstacles.16 Findings in biosonar and human echolocation can serve as an inspiration for the signal processing and feature extraction in ultrasonic sensing. Based on a biosonar model, Ming and Simmons17 have shown that time-frequency features are useful for echo classification and target geometry estimation. Inspired by bats, Riopelle et al.9 also successfully use features based on time-frequency images for the classification of ultrasonic echoes by a mobile robot. Sonia et al.12 extract time domain features, such as standard deviation and root mean square values, and frequency domain features via fast Fourier transform (FFT) from ultrasonic echoes to classify human beings. It is shown that the FFT outperforms other feature inputs in terms of classification accuracy, showing the significance of spectral features. Furthermore, the relevance of time-frequency images for classification has been shown in related fields such as underwater sonar6 or acoustic event classification.18,19

Deep neural networks (DNNs) for classification tasks have become very popular in computer vision and natural language processing.20,21 Currently, DNNs are also successfully applied in the field of acoustics.22 For the classification of acoustic echoes, Ming and Simmons17 compare a convolutional neural network (CNN) and a recurrent neural network (RNN). The CNN model slightly outperformed the RNN in terms of classification accuracy. Pöpperl et al.11 successfully use a capsule neural network for classifying the height of obstacles using an automotive ultrasonic sensor. The suitability of CNNs for acoustic echo classification is also shown in several works in the field of underwater sonar.6 For example, Castro-Correa et al.23 implement residual CNNs to perform underwater source localization and seabed classification based on simulated data as well as on at-sea measured data. To improve the model's performance, they use data augmentation methods on time-frequency images. It is shown that time stretching, flipping, time masking, frequency masking, and combinations positively impact the results on their dataset, increasing classification accuracy by 1%–4%. Another motivation for the use of data augmentation on acoustic signals and time-frequency representations can be found in the field of environmental sound classification. Salamon and Bello24 use data augmentation on audio time signals. They obtain more robust classification models and observe class-specific performance of the augmentation methods. Nanni et al.25 compare the performance of various offline augmentation methods applied on raw time signals and online augmentation methods applied on spectrograms for animal audio classification using CNNs. They conclude that augmentation methods must be carefully selected depending on application domain and the nature of the data. Several studies have shown that common augmentation methods from the field of computer vision cannot be applied to time-frequency images without hesitation.25–27 Therefore, more physics-motivated methods are applied and evaluated in this work.

This study examines the use of a low-cost automotive ultrasonic sensor for object classification in the close range of vehicles. We propose a processing chain to classify ultrasonic backscatter using bandpass filtering and the continuous wavelet transform (CWT) for feature extraction. In addition to time-frequency images, the target distance via time of flight is used as an input feature to a CNN. For training and evaluation, a dataset including ultrasonic backscatter of various objects at different distances and angles has been recorded. Measurements were taken in a semi-anechoic chamber as well as in an asphalt parking space. Moreover, several methods for offline and online data augmentation are presented and evaluated. We expect that more robust models can be obtained using physics-motivated augmentation methods such as modification of the SNR and echo delay, Doppler-motivated pitch shifting, and the superposition of echoes.

This paper is organized as follows. The principles of airborne ultrasonic sensing as well as target properties carried by acoustic echoes are discussed in Sec. II. Section III gives an overview of the experimental setup, including sensors and data acquisition procedures. In Sec. IV, the methods for signal processing and feature extraction are described. Several data augmentation techniques on both raw time signals and scalogram images are presented in Sec. V. In Sec. VI, a CNN architecture for classifying the processed signals is proposed. The classification results considering different measurement environments and data augmentation techniques are discussed in Sec. VII. Our conclusions are drawn in Sec. VIII.

In automotive ultrasonic sensing, piezoelectric sensors are used to transmit ultrasonic pulses and receive echoes from obstacles. The distance can be calculated based on the time of flight and the speed of sound in air. Typically, operating frequencies between 40 und 50 kHz are used. At higher frequencies, the sound absorption in air increases strongly. At lower frequencies, there are many extraneous sound sources in environments of the application.1 Moreover, frequency-modulated transmit signals (chirps) are well established, since range resolution and SNR can be improved via pulse compression. Details on conventional echo signal processing can be found in the literature.28,29

Transmitted ultrasonic pulses are propagated through the air and scattered or reflected on obstacles. The received echo's SNR can be described by the active sonar equation
(1)
with the source level of the transmitted signal SL, the transmission loss TL, the target strength TS, the noise level at the receiver NL, and the sensor's directivity index DI.30 The transmission loss describes the signal intensity loss when propagating between source and target. It is considered twice since the sound travels from the source to the target and back. The transmission loss mainly consists of geometric spreading loss and sound absorption. In contrast to the geometric spreading loss, sound absorption is frequency-dependent and more crucial for high frequencies.31 
The target strength is the ratio of the intensity I of an echo to the intensity of the impinging sound at a defined distance from the target,
(2)

The target strength depends on the size of the target and the impedance difference to the propagation medium. The impedance of most solid objects is large compared to air. Thus, echoes consist primarily of specular reflections on surfaces. Consequently, target geometry is the predominant factor for the target strength, characterized by the size of the reflective surface facing the incident sound (acoustic cross section).32 Depending on the geometry of an object, the target strength can also vary from different directions of ensonification. To receive an effective target strength, the wavelengths of the incident sound must be shorter than the circumference of the reflective surface.31 

Interfering noise consists of extraneous sound sources, ambient noise, and clutter reflections. Relevant extraneous sound sources in automotive ultrasonic sensing are, for example, compressed air noises or metallic grating noises.1 Clutter consists of echoes from scatterers that are not of interest, especially asphalt backscatter, thus, increasing with the intensity of the transmitted signal.32 Furthermore, the amount of clutter and ambient noise is restricted depending on the sensor's directivity index. Since the incident beam widens as it propagates in air, more clutter reflections are included in the overall backscatter at greater distances.

In practice, obstacles mostly consist of not only a single scattering point but of several sub-reflectors, so-called highlights. Low distances between these highlights lead to interferences of the single echoes. In the received signal, this results in a combined delay spread echo waveform, which has a duration exceeding the transmitted waveform. Additionally, Doppler spreading is given when the transmitter or the object is moving or when different velocities are given for the single highlights.33 Delay and Doppler spreading cause interference patterns in the combined echo signal that are defined by the object's unique features. These interference patterns are evident in the temporal and spectral structure of the received signal and show up as ripples and notches in time-frequency images based on the spacing of the object's highlights.34 

The distance and direction of a scatterer to the receiver affect the overall amplitude of the echo due to the transmission loss and directivity index of the sensor [Eq. (1)]. In addition, high frequencies are absorbed with distance and can be further attenuated by sensor directivity. Further, the backscatter is affected by the surface impedance of the object. It has been shown that human echolocators, bats, and dolphins are able to discriminate the substance of a scatterer.16,35–38

For the measurements, a typical automotive ultrasonic sensor is used as transmitter, and a GRAS 40 BE 1 4 condenser microphone (GRAS Sound and Vibration, Holte, Denmark) is used as receiver. The piezoelectric element of the ultrasonic sensor can also operate as a receiver, but there are limitations of the sensor's hardware in capturing raw time signals. Within the automotive ultrasonic sensor, the transducer element and associated electronics are effective as a unit. It outputs echo points that are a very compressed representation of the transducer signal. However, for this research study, the use of raw time signals for classification is crucial, so the microphone as a receiver has been used as a straightforward, practical solution. The sensor's transfer function is applied to the acquired microphone signals to imply its frequency response. Further characteristics such as the directional index or self-noise of the sensor are not taken into account. The sensor is mounted at a height of 0.5 m on a linear rail, which allows stationary as well as dynamic measurements at a velocity of 0.5 m/s. The microphone is located directly above the sensor. The measurement setup is shown in Fig. 1.

FIG. 1.

(Color online) Measurement setup: A conventional ultrasonic sensor and a GRAS 40 BE 1 4 microphone are mounted on a linear rail to allow static and dynamic measurements.

FIG. 1.

(Color online) Measurement setup: A conventional ultrasonic sensor and a GRAS 40 BE 1 4 microphone are mounted on a linear rail to allow static and dynamic measurements.

Close modal

A frequency-modulated pulse ranging from 42.5 to 52.5 kHz with a signal length of 1.6 ms is transmitted at an interval of 50 ms. At a speed of sound of 343 m/s, wavelengths of about 0.7–0.8 cm are given, which allows even small surfaces to be detected. For data acquisition, a sample rate of 215 kilosamples/s is used to satisfy the Nyquist criterion.39 The measurements are taken both in a semi-anechoic chamber (from now on referred to as lab) and in an asphalt parking space (from now on referred to as field) to compare classification performances in a controlled environment with low disturbances and a more real-world scenario. The captured time signal of the backscattering of a tube in the lab is shown in Fig. 2.

FIG. 2.

Time signal of ultrasonic backscattering in the lab. The direct signal, characterized by the highest amplitude, is followed by some ground reflections and the object-specific backscattering of a tube with a circular cross section.

FIG. 2.

Time signal of ultrasonic backscattering in the lab. The direct signal, characterized by the highest amplitude, is followed by some ground reflections and the object-specific backscattering of a tube with a circular cross section.

Close modal

Thirty objects were selected that can be found in parking or maneuvering situations, including pedestrians (for the sake of readability, pedestrians are also referred to as “objects” in this paper). The objects are categorized into seven object classes and can further be aggregated to a binary classification considering whether an object is traversable by a vehicle or not (see Table I). “No object” refers to measurements where an empty scenery is recorded to be able to determine the presence of objects. For the pedestrians, three subjects participated, each positioned once frontally and once laterally to the sensor, resulting in six instances of pedestrians. The object positions and the corresponding division into train and test positions is illustrated in Fig. 3. Each object was measured at 55 positions in different distances and incident angles to the sensor according to the sensor's radiation pattern and the dimensions of the semi-anechoic chamber. This results in radial positions ranging between 75 and 225 cm and ±45° to the sensor. Sixty static and 91 dynamic measurements were taken at each object position, resulting in a total of 249 150 hand-labeled measurements per environment. It can be stated that static measurements of a single object position are significantly more correlated than the dynamic ones and differ primarily from external disturbances such as noise. We perform a position-based train/test set splitting, i.e., measurements at 44 of the 55 positions of each object are used for the training dataset and 11 positions for the test dataset.

TABLE I.

Categorization of the objects into seven object classes and two traversability classes. The numbers in parentheses after the class names indicate the number of corresponding objects within a class.

Object class Traversability class
No object (4)  Traversable (16) 
Bag (4) 
Small object (4) 
Curb (4) 
Tree (4)  Not traversable (14) 
Tube/pole (4) 
Pedestrian (6) 
Object class Traversability class
No object (4)  Traversable (16) 
Bag (4) 
Small object (4) 
Curb (4) 
Tree (4)  Not traversable (14) 
Tube/pole (4) 
Pedestrian (6) 
FIG. 3.

Object positions, divided into positions for the training set (white) and positions for the test set (dark gray).

FIG. 3.

Object positions, divided into positions for the training set (white) and positions for the test set (dark gray).

Close modal

Several preprocessing steps are applied before the signals are passed to the CNN. The proposed processing differs fundamentally from conventional signal processing in automotive ultrasonic sensing. Whereas, in conventional processing, the distance to objects is generally determined using thresholding methods in the time domain,40 for object classification also the spectral patterns of echoes are of great interest.

After sampling the ultrasonic signals, a bandpass filter is used to remove unwanted noise and to limit the signals to the relevant frequency range. This prevents unwanted signal components such as environmental sounds in the audible frequency range from being passed to further processing. The cutoff frequencies are chosen slightly wider than the frequency range of the transmitted signal to ensure that relevant backscatter is not being suppressed. Accordingly, we use a finite impulse response (FIR) bandpass filter with a lower passband frequency of 40 kHz, a higher passband frequency of 55 kHz, and a stop band attenuation of 100 dB.

A time window of 768 samples is then cut out containing the backscattering time periods to be classified. The length of the window is found to be sufficient to include all backscatterings of the relevant objects. In this study, the windows are chosen based on the known distances to the objects in our dataset. Hence, we ensure correct labeling and comparable class sizes, which is crucial for the training process. Additionally, it is intended to avoid the risk of losing echoes or even false positives using the standard pulse-echo method. When trained models are employed in practice, the windows can be obtained using a sliding window approach or the standard pulse-echo method.

Motivated by the findings in Sec. II about relevant features in the temporal structure as well as in the spectrum of an echo, we apply the CWT on the windowed time signals to extract time-frequency images. We expect that time-frequency resolution is crucial for the detection of relevant spectral patterns during classification. The CWT is a linear time-frequency transform that has improved time-frequency resolution compared to the standard short-time Fourier transform (STFT). Based on a chosen wavelet function ψ *, the CWT of a signal x t is defined as
(3)
with the location of the wavelet τ and the wavelet scaling factor a. The scaling of the wavelet is wide for low frequencies and narrow for high frequencies, resulting in an improved time resolution for higher frequencies and an improved frequency resolution for lower frequencies.41–43 For the wavelet function, we choose a Morse wavelet44 with symmetry parameter γ = 30 and time-bandwidth product β = 1200. Since the CWT is complex-valued, we calculate the real-valued scalogram indicating the frequency-dependent amplitudes of the CWT,
(4)
Additionally, logarithmic compression of the amplitudes is applied to enhance low-energy components, ensuring that interferences of even small highlights are taken into account.

In Fig. 4, scalograms of static measurements of one object from each object class in the lab are shown. “No object” is not shown since it is noise only. In Fig. 4(a), the backscatter of a tube with a circular cross section of 7.6 cm diameter is illustrated. Two successive chirps can be seen, one starting at about 0.5 ms and the other at about 1.3 ms. The first reflection relates to a horizontal reflection and the latter to a diagonal reflection at the base of the tube. Likewise, the backscatter of every other object in our dataset being categorized as not traversable consists of at least two reflections since they all exceed the height of the sensor. Generally, diagonal reflections produce smaller echo amplitudes than horizontal reflections due to a greater distance to the sensor resulting in an increased transmission loss. The backscatter of a pedestrian in Fig. 4(b) is a more complex pattern with multiple, overlapping reflections due to many small highlights on the pedestrian's clothing and body. Although the pedestrian provides a larger acoustic cross section, the amplitudes are not significantly increased compared to the tube. This can be ascribed to sound absorption by the pedestrian's clothing. The scalogram of a curb in Fig. 4(c) shows a single reflection since there is only a diagonal reflection and no horizontal one due to the height of the object. Also, the amplitudes of the reflection are greater than those of the other objects. This can be explained by the beveled edge of the curb, resulting in an increased target strength in direction of the receiver compared to objects with a circular cross section. Figure 4(d) shows the backscatter of a tree, predominated by the reflection of the trunk, which is comparable to the tube in Fig. 4(a). However, the pattern is slightly more complex due to the rough surface of the tree as well as small, overlapping reflections from branches. The lowest amplitudes are obtained in Fig. 4(e), which shows the backscatter of a plastic bottle with a height of 21 cm from the class “small object.” Low amplitudes, which are more likely to be masked by noise and clutter, are also obtained in the backscatter of a plastic bag in Fig. 4(f). Compared to the plastic bottle, the duration of the bag's backscatter is increased due to many small highlights, which also leads to more interferences.

FIG. 4.

(Color online) Scalograms of different objects' backscatter in the lab at 1 m on the 0° axis.

FIG. 4.

(Color online) Scalograms of different objects' backscatter in the lab at 1 m on the 0° axis.

Close modal
Before they are fed to the CNN, the scalogram images x i are normalized using the z-normalization,45 
(5)
where the mean μ and the standard deviation σ are calculated for the entire training dataset to preserve the amplitude ratios between training samples. In combination with the distance of an object, the amplitude ratios are relevant to draw conclusions about the acoustic cross section and, therefore, the size of an object.37 The scalograms are given as 64 × 32 pixel tensors into the CNN. The distance to the object, which is calculated via the time offset between pulse and echo, is also fed into the CNN as a feature. The left edge of the considered time window is taken as the time of the echo. The processing chain from sampled time signal to object prediction is shown in Fig. 5.
FIG. 5.

(Color online) Processing steps from sampled time signal to object prediction. A scalogram image and distance scalar is fed into the CNN for each time window.

FIG. 5.

(Color online) Processing steps from sampled time signal to object prediction. A scalogram image and distance scalar is fed into the CNN for each time window.

Close modal

Data augmentation refers to methods in machine learning that can improve the robustness of a model by reducing overfitting and overcome the problem of sparse data for training.46 This can be achieved by enriching the training data set with modified copies of the original data. In computer vision, images are often rotated, flipped, recolored, etc.47 However, these methods can be problematic when applied to time-frequency images since they do not constitute a physically analogous counterpart in the domain of sound propagation and backscattering. Physically motivated methods would be preferred. In the following, the methods applied in this work are described and can be divided into offline and online augmentation methods. Offline augmentation refers to methods that are applied before the training process.48 This results in high disk space requirements since a copy of the original signal with the modifications applied has to be saved. Afterward, the enriched training dataset can be loaded into smaller batches for the training process. In this work, offline augmentation is used for methods that are applied on the windowed time signals before they are transformed via CWT. In contrast to offline augmentation, online augmentation refers to methods that are applied during training after a small batch of training samples has been loaded temporarily to the main memory.48 Hence, no additional hard disk space is required, but the training duration is increased since the augmentation is processed each time a data batch is loaded. Since the CWT is applied before the training starts, online augmentation is used for methods that are applied on the scalogram images.

Offline augmentation refers to methods that we apply to the windowed time signals before they are transformed via CWT and before starting the training process. Time signals with the applied methods that are described in the following are illustrated in Fig. 6.

FIG. 6.

Time signals with the applied offline data augmentation methods.

FIG. 6.

Time signals with the applied offline data augmentation methods.

Close modal

1. White noise

The injection of white Gaussian noise leads to a modified SNR, which refers to changes in the environmental noise contribution or backscattering cross section of an object. Since the SNR depends on the distance between sensor and object, we consider this factor for the injection of the generated noise. According to the geometric spreading loss, we add bandpass-filtered white Gaussian noise n t to the signals depending on the distance as follows:
(6)
with the closest distance of r 1 = 0.75 m and r 2 representing the distance of the considered time window. Atmospheric attenuation is not taken into account in this context, since it is more critical for larger distances.49 For the generated noise n t, we choose a noise power of 0.3 mW. The signal power of the bandpass-filtered noise (t) is about 0.7 mW. For comparison, the average echo signal power of a tube with a circular cross section at 0° angle and 0.75 m distance is 4.9 mW.

2. Clutter

Another method for reducing the SNR is to add ground clutter, which is performed by mixing the signal with the recording of an empty scene. This results in greater interferences for the outdoor environment due to the reflective granularity of the asphalt. We add clutter c t from time windows with the same distance values. The new signal can be described as
(7)
where we chose a clutter factor of A = 0.1.

3. Pitch shifting

According to the Doppler effect, pitch shifting can be interpreted as a change in the velocity of the sensor. We aim to simulate an increased velocity for the static measurements and a reduced velocity for the dynamic measurements. To obtain the corresponding Doppler shifts, we perform linear interpolation of the windowed time signals to adjust the length to a target number of samples N shifted. Since we have a static target and a moving source and receiver, we calculate the target number of samples as follows:
(8)
with the speed of sound c, the velocity of the sensor v , and the number of samples of the original window N original. The length is decreased by 3 samples for simulating the dynamic measurements with a velocity of 0.5 m/s and increased by 3 samples for simulating the static measurements using the dynamic ones as a basis. This corresponds to a frequency shift of about 139 Hz for a base frequency of 47.5 kHz, which is the center frequency of the transmitted signal. To get the same window length as before, zero-padding or truncation of the samples is performed. By using zero-padding instead of adding the true samples of the signal, we avoid adding signal components that are not relevant for the classification task.

Online augmentation methods are applied to the scalogram images during the training process. In Fig. 7, the methods are illustrated, and they are explained in the following.

FIG. 7.

(Color online) Example of a scalogram with the applied online data augmentation methods. Note that time and frequency masking are applied independently and that it is also differentiated between multi-class and single-class mixup.

FIG. 7.

(Color online) Example of a scalogram with the applied online data augmentation methods. Note that time and frequency masking are applied independently and that it is also differentiated between multi-class and single-class mixup.

Close modal

1. Frequency masking

SpecAugment, introduced by Park et al.,50 involves time warping, time masking, and frequency masking on time-frequency images. Since time warping is not reasonable in our case, we only perform frequency masking and time masking. This can be seen as a loss of information in single frequency or time segments and can improve generalization of the CNN by not focusing on the same sections of the images during training. For frequency masking, f consecutive frequency bins [ f 0 , f 0 + f ) are masked using the mean pixel value of the image. f 0 is chosen from a uniform distribution 0 , F and f 0 from [ 0 , f total f ], where f total is the total number of frequency bins, and we set the hyperparameter F = 8 as the maximum number of masked frequency bins.

2. Time masking

Analogous to frequency masking, for time masking, t consecutive time steps [ t 0 , t 0 + t ) are masked. Here, t is chosen from a uniform distribution [ 0 , T ] and t 0 from [ 0 , t total t ], where t total is the total number of time steps, and we set the hyperparameter T = 10 as the maximum number of masked time steps.

3. Salt and pepper noise

For salt and pepper noise, each pixel of a scalogram is masked by a probability of p S P = 0.05 and randomly set to either the minimum or maximum pixel value. This can be interpreted as pixelwise information loss in the scalograms.

4. White noise

Pixelwise white Gaussian noise is added to the scalogram images. The mean μ Gauss of the Gaussian distribution is set to the mean pixel value of the current image. Based on the image pixel's variance σ img 2 and the chosen noise intensity hyperparameter I = 0.05, the Gaussian variance is set to σ Gauss 2 = σ img 2 · I.

5. Time shifting

The scalogram image is shifted for s pixel steps either to the left or to the right, which corresponds to a modified time of flight of the echoes. We choose s from a uniform distribution 1 , S, where we set the maximum number of steps S = 3 to ensure that object-related backscattering is not being truncated. The image is vertically cut into two slices either at t cut = 0 + s or t cut = τ s, and the new image is obtained by inverting the order of the slices. The distance value is not adjusted, since a maximum shift of s = 3 only results in a change of the object distance of about 3 cm, which lies within the variance of the position of the backscatter in the time window.

6. Multi-class and single-class mixup

Zhang et al.51 introduce mixup as a technique to produce new training samples by combining existing ones. For multi-class mixup, two random training images x i and x j are mixed to a new image
(9)
with the mixup factor λ Beta α , α, where we set α = 0.2. The corresponding labels y i and y j are mixed to a new multi-class label
(10)

To account for the distance feature, we only mix scalograms where the associated distance values d i = d j. In addition to multi-class mixup, we also perform single-class mixup, where only training samples with the same class label are mixed.

CNNs are specialized neural networks that are widely used in computer vision to process image or video data. Meanwhile, CNNs are also applied to extract features from audio data19,52 or acoustic measurements.22 A fundamental part of CNNs are their convolutional layers, in which features are extracted from the input and transformed into so-called feature maps: Kernels are slid over the input, producing an output via element-wise multiplication with learned kernel weights. This results in the advantage of shared weights, namely, a significantly smaller number of trainable parameters compared to conventional neural networks and translation invariance in time or space.21,53,54

The proposed CNN architecture is shown in Fig. 8. We apply several convolutional and pooling layers to the scalogram input before we concatenate the flattened feature maps with the distance input. Note that we use batch normalization55 and the rectified linear activation function (ReLU)56 after each convolutional layer. A stride of 1 is used for the convolutional layers and a stride of 2 for the pooling layers. We use average pooling since we observed that it slightly outperformed max pooling in our experiments. Furthermore, zero-padding is applied for the convolutional layers to get the same input and output dimensions.

FIG. 8.

Proposed CNN architecture. Next to the boxes, indicating the layers, the corresponding output shapes are listed. For the convolutional and pooling layers, the kernel shapes are given in the parentheses. The seven neurons in the output layer and the softmax activation function are dedicated to the classification of the object classes. For the binary classification, the number of neurons can be adjusted to a single neuron using the sigmoid activation function.

FIG. 8.

Proposed CNN architecture. Next to the boxes, indicating the layers, the corresponding output shapes are listed. For the convolutional and pooling layers, the kernel shapes are given in the parentheses. The seven neurons in the output layer and the softmax activation function are dedicated to the classification of the object classes. For the binary classification, the number of neurons can be adjusted to a single neuron using the sigmoid activation function.

Close modal
In the first convolutional layer, a wide kernel of shape 5 × 7 is applied to extract low-level features. After a pooling layer, we use two convolutional layers with kernel shapes 1 × 5 and 5 × 1, respectively, to separately point out dependencies in time and frequency direction. The approach of non-rectangular kernel shapes for time-frequency representations in CNNs has also been successfully applied for, e.g., music classification57 and audio scene classification.58 Another pooling layer is followed by two convolutional layers with small 3 × 3 kernels to extract high-level features. Both layers are followed by pooling, respectively. The feature maps are then flattened and concatenated with the distance feature and given to the classifier part, which consists of two fully connected layers with 256 and 7 neurons mapping it to the desired output shape of C = 7 object classes. For the fully connected layers, we use dropout46 with a rate of 0.1 to reduce overfitting. The ReLU function is applied after the first fully connected layer and softmax activation on the last layer to get relative probabilities for the classes. To perform binary classification considering traversability, the shape of the last fully connected layer can be modified to one neuron using the sigmoid activation function. For the training process, we use stochastic gradient descent (SGD) as an optimizer and the cross-entropy loss function59 
(11)
where p y is the target distribution and q y the estimated distribution of the labels y.

To evaluate the proposed architecture, we use a simpler CNN, adapted from LeNet-5,60 as a baseline. It differs from LeNet-5 only in the parameters of the convolution kernels (16 kernels with shape 5 × 5 in the first, 32 kernels with shape 3 × 3 in the second convolutional layer) and the number of neurons in the three fully connected layers, which are 2688, 488, and 7, respectively. In this work, deeper networks, such as ResNet or Inception nets, have not been considered as a baseline. The vast number of parameters might conflict with the hardware restrictions of automotive control units.

In the following, the baseline CNN is denoted as CNN#1 and our proposed architecture as CNN#2. Each training is performed for ten training rounds with different random seeds to deal with the CNN's stochastic behavior. The mean accuracies ± standard deviations are given in Table II. Note that balanced accuracies are calculated due to slightly different class sizes. In the table's columns, it is distinguished between lab and field data as well as between the aggregation to the object and traversability classes.

TABLE II.

Mean accuracy ± standard deviation in percent on the test set aggregated to object and traversability classes in the lab and field environment. In each column, the best accuracy is marked in bold. Since CNN#2 outperforms CNN#1, CNN#2 is used for the trainings with augmentation.

Mean accuracies ± standard deviation (%)
Lab environment Field environment
Object class Traversability class Object class Traversability class
CNN#1: no augmentation  82.9 ± 0.4  92.2 ± 0.4  61.1 ± 0.4  87.4 ± 0.4 
CNN#2: no augmentation  86.7 ± 0.4  94.9 ± 0.3  65.4 ± 0.1  90.6 ± 0.3 
 Offline         
  White noise  88.1 ± 0.3  95.9 ± 0.2  65.5 ± 0.1  91.4 ± 0.2 
  Clutter  86.8 ± 0.3  94.7 ± 0.2  64.9 ± 0.5  90.2 ± 0.2 
  Pitch shifting  85.5 ± 0.6  93.5 ± 0.3  63.3 ± 0.4  89.1 ± 0.3 
 Online         
  Frequency masking  87.6 ± 0.3  95.3 ± 0.3  65.6 ± 0.2  91.0 ± 0.3 
  Time masking  87.7 ± 0.3  95.6 ± 0.1  65.8 ± 0.2  91.1 ± 0.2 
  Salt and pepper noise  87.6 ± 0.4  95.4 ± 0.3  65.2 ± 0.1  91.2 ± 0.2 
  White noise  86.7 ± 0.2  94.6 ± 0.2  65.2 ± 0.5  90.8 ± 0.2 
  Time shifting  88.6 ± 0.5  95.5 ± 0.2  66.4 ±0.2  91.3 ± 0.2 
  Single-class mixup  87.4 ± 0.5  95.3 ± 0.4  65.3 ± 0.1  90.8 ± 0.2 
  Multi-class mixup  87.1 ± 0.5  95.2 ± 0.2  65.4 ± 0.3  90.7 ± 0.3 
 Combinationa  90.1 ±0.3  96.4 ±0.1  66.2 ± 0.2  91.5 ±0.2 
Mean accuracies ± standard deviation (%)
Lab environment Field environment
Object class Traversability class Object class Traversability class
CNN#1: no augmentation  82.9 ± 0.4  92.2 ± 0.4  61.1 ± 0.4  87.4 ± 0.4 
CNN#2: no augmentation  86.7 ± 0.4  94.9 ± 0.3  65.4 ± 0.1  90.6 ± 0.3 
 Offline         
  White noise  88.1 ± 0.3  95.9 ± 0.2  65.5 ± 0.1  91.4 ± 0.2 
  Clutter  86.8 ± 0.3  94.7 ± 0.2  64.9 ± 0.5  90.2 ± 0.2 
  Pitch shifting  85.5 ± 0.6  93.5 ± 0.3  63.3 ± 0.4  89.1 ± 0.3 
 Online         
  Frequency masking  87.6 ± 0.3  95.3 ± 0.3  65.6 ± 0.2  91.0 ± 0.3 
  Time masking  87.7 ± 0.3  95.6 ± 0.1  65.8 ± 0.2  91.1 ± 0.2 
  Salt and pepper noise  87.6 ± 0.4  95.4 ± 0.3  65.2 ± 0.1  91.2 ± 0.2 
  White noise  86.7 ± 0.2  94.6 ± 0.2  65.2 ± 0.5  90.8 ± 0.2 
  Time shifting  88.6 ± 0.5  95.5 ± 0.2  66.4 ±0.2  91.3 ± 0.2 
  Single-class mixup  87.4 ± 0.5  95.3 ± 0.4  65.3 ± 0.1  90.8 ± 0.2 
  Multi-class mixup  87.1 ± 0.5  95.2 ± 0.2  65.4 ± 0.3  90.7 ± 0.3 
 Combinationa  90.1 ±0.3  96.4 ±0.1  66.2 ± 0.2  91.5 ±0.2 
a

Applied methods: white noise (offline), frequency masking, time masking, and time shifting.

It can be seen from Table II that CNN#2 outperforms CNN#1 in both environments and both class aggregations. For the object classes, CNN#2 achieves 86.7% (lab) and 65.4% (field) accuracy. The accuracy for CNN#1 drops by 3.8% and 4.3%, respectively. For traversability, CNN#2 achieves 94.9% (lab) and 90.6% (field), whereas the accuracy drops by 2.7% and 3.2% when using CNN#1. Overall, lower accuracies are achieved in the field environment since the data are affected by more disturbances such as ground clutter. Higher accuracies are achieved for the traversability classes than for the object classes, as the model only has to perform binary classification. It can also be expected that the discrimination between traversable and non-traversable objects is easier to achieve, since most non-traversable objects have a larger acoustic cross section, resulting in greater echo amplitudes.

In Fig. 9, confusion matrices for CNN#2 without data augmentation are shown. In both lab and field environment, the highest accuracy is achieved for the pedestrian class. This is beneficial for autonomous driving applications since the detection of pedestrians has a higher priority than objects that can cause physical damage but not personal injury. Compared to other objects in our dataset, the pedestrian is characterized by an overall large acoustic cross section consisting of many highlights as well as by increased sound absorption due to clothing. The interference patterns and amplitudes can be assumed to be relevant features in the time-frequency images, allowing the model to easily discriminate the pedestrian against the other object classes.

FIG. 9.

(Color online) CNN#2: Confusion matrices for (a) lab data and (b) field data without data augmentation. The best accuracy is achieved for the pedestrian class in both environments. Using field data, the accuracies decrease the most for traversable object classes, including “no object.”

FIG. 9.

(Color online) CNN#2: Confusion matrices for (a) lab data and (b) field data without data augmentation. The best accuracy is achieved for the pedestrian class in both environments. Using field data, the accuracies decrease the most for traversable object classes, including “no object.”

Close modal

In the lab environment, the model is even able to detect bags and small objects with accuracies above 70%. In the field data, on the other hand, it seems to be difficult to distinguish traversable objects from ground clutter. In both environments, bags are the most challenging objects to classify. In addition to a small target strength, bags consist of many sub-reflectors, making them difficult to distinguish from clutter or from small objects. We assume that the higher accuracy of curbs compared to the other traversable object classes can be ascribed to the more uniform geometry of the objects within the class and to the increased target strength, especially for curbs with a beveled edge. Comparing lab and field data, the accuracy for “no object” drops from 97.1% to 55.8% since there's more confusion with traversable objects. In the field data, “no object” mainly consists of ground clutter, masking echoes from traversable objects with small target strengths, especially at distant positions. However, the accuracies for non-traversable objects in the field data are more stable, ranging between 76.3% (tube/pole) and 93.9% (pedestrian). Better accuracies for non-traversable objects are probably achieved because of greater target strengths and, thus, better SNRs. A successful discrimination between “tree” and “tube/pole” has been expected, since tubes and poles are very simple scatterers, while trees are more complex objects with rough surfaces and more highlights.

Since CNN#2 delivers better accuracies, it is used for the following experiments using the proposed augmentation methods. From the results in Table II, it is apparent that the augmentation methods vary in their ability to improve the classification results. While most of the methods positively affect the accuracy, pitch shifting reduces the accuracies in both environments. It may be assumed that this results from the inaccurate calculation of the Doppler shifts due to the interference of backscatter from different directions. It is neglected that the Doppler effect should not be applied equally to clutter or object backscatter from an angle of incidence ≠ 0°. Due to this simplification, classification improvements are assumed to be limited. This effect could be reduced by directional beamforming using a sensor array to suppress echoes apart from the direction of interest. Also, clutter augmentation does not lead to significantly improved results. Consideration of corresponding confusion matrices (not shown here) revealed that this can be ascribed to an increased confusion of traversable objects with “no object.” It is important to ensure that specific augmentations do not mask relevant features in the echo signals excessively, which is crucial for small objects with low target strengths. The offline white noise augmentation increases the accuracy by 1.4% in the lab and by 0.1% in the field environment. A smaller effect for the field data might result from the fact that the signals are already naturally noisier. In general, it can be found that the augmentation methods achieve greater improvements for the lab than for the field data. This could be due to the inherently more varied field data.

For the online augmentation methods, frequency masking, time masking, and time shifting improves the accuracies in both environments. We assume that frequency masking and time masking forces the CNN to also consider secondary interference patterns, resulting in a more robust behavior. Time shifting represents a natural variation in the signals since the estimation of the distance to an object and the corresponding time window is mostly not perfectly accurate. Salt and pepper noise, single-class mixup, and multi-class mixup only improves the results for the lab data. Most likely, relevant patterns in the time-frequency images are frequently masked using salt and pepper noise. It may be assumed that mixup performs better when backscattering of multiple objects are also given in the test set, which is true for many realistic situations. Therefore, it should be considered to include measurements with multiple objects in the scenery for future investigations. The online white noise augmentation does not improve the results. It can be concluded that the injection of white noise directly on the time signals is to be preferred to adding it to the time-frequency images. If a method can be applied to both time signals and time-frequency images, we suggest applying modifications to the raw time signals since the physical effects are generally reproduced more naturally. Overall, time shifting produces the best accuracies compared to other methods (+1.9% for the object classes and +0.6% for the traversability classes in the lab and +1.0% and +0.7% in the field, respectively).

Finally, a model has been trained using all augmentation methods that had resulted in accuracy improvements in both environments, namely offline white noise, frequency masking, time masking, and time shifting. Thereby, an increase in 3.4% for the lab data and 0.8% for the field data could be achieved for the object classes. For the traversability classes, the accuracies have been improved by 1.5% and 0.9%, respectively. It can be concluded that carefully selected augmentation methods can improve classification accuracies for both the aggregation to object classes and traversability. Furthermore, certain offline augmentation methods applied on raw time signals as well as online methods on time-frequency images are suitable.

Since it has been observed that many augmentation methods do not have the same impact on classification accuracies for all classes equally,24,61 we examined the class-specific accuracy improvements. In Fig. 10, the accuracy differences for each object class and each augmentation method are illustrated. It is apparent that for both Fig. 10(a) lab data and Fig. 10(b) field data, further accuracy improvements can be achieved by applying the methods on specific object classes only. There is no method that is suitable for all classes. For example, when using lab data, white noise (offline) should not be used on the classes “tree” and “pedestrian.” It is also the only method that significantly improves the performance for the class “small object.” The class “tube/pole” representing the simplest objects is the only class being improved by all augmentation methods using lab data. It also shows the most similar improvements over all methods. However, when using field data, the accuracy for “tube/pole” is decreased by all augmentation methods. In further confusion matrices, it has been found that objects of the class “tube/pole” are more frequently confused with the class “tree.” While the results of the class “small object” in the lab are improved only by white noise (offline), in the outdoor environment, several other methods lead to improvements. However, the accuracy for the class “bag” is increased by most of the methods using lab or field data. As shown in the confusion matrices in Fig. 9, this is also the class with the lowest overall accuracy. Finally, it can be concluded that the suitability of the single augmentation methods on individual classes also differs significantly for the different environments. A new training with a class-conditional selection of augmentation methods could be performed to increase the overall accuracy once the class-specific accuracies have been obtained.

FIG. 10.

(Color online) Difference in class-specific accuracy for the augmentation methods compared to using the training set without augmentation, based on (a) lab data and (b) field data.

FIG. 10.

(Color online) Difference in class-specific accuracy for the augmentation methods compared to using the training set without augmentation, based on (a) lab data and (b) field data.

Close modal

In this paper, we have demonstrated the feasibility of classifying obstacles that can appear in parking or maneuvering situations, not only considering traversability but also in several object classes using a single low-cost ultrasonic sensor. The categorization into object classes is desirable because it is, e.g., of special interest to detect pedestrians or other obstacles that can move or are particularly worth protecting. Improved performance of ultrasonic sensors can contribute to future parking assistance systems and autonomous driving applications.

For the discrimination of objects by acoustic echoes, target strength according to the acoustic cross section and reflectivity, as well as the number of highlights and their spacing, have been considered as relevant features. In the received signal, the target strength can be found in the amplitudes, while interferences caused by multiple discrete highlights are indicated in the temporal structure and in spectral patterns. We have used the CWT to extract temporal as well as spectral features, represented in time-frequency images. We have proposed a CNN using not only the time-frequency images but also the distance to the object as an input feature. The proposed CNN architecture has outperformed the baseline, a LeNet-5-like CNN, in all environments. It can be concluded that CNNs are capable of classifying different objects by their acoustic backscatter using CWT-generated time-frequency representations and the object's distance as input features. For automotive ultrasonic sensing, high accuracies can be achieved discriminating between traversable and non-traversable objects. However, the discrimination between small objects, whose echoes are likely to be masked by clutter reflections, can be challenging.

Furthermore, we have shown that data augmentation is useful to train a more robust model. It is important to note that augmentation methods should be selected carefully as not all methods improved the accuracies. Conventional methods from computer vision when dealing with time-frequency representations should be considered critically. Moreover, domain knowledge is necessary for selecting or engineering appropriate augmentation methods. We suggest applying augmentation methods to raw time signals rather than to time-frequency images whenever a method is available in both domains. Augmentation methods such as adding white noise on the time signals, frequency masking, time masking, salt and pepper noise, and time shifting have increased the accuracies on our test sets. Future work may include Doppler-motivated pitch shifting in a sensor array setup and mixup augmentation when measurements of overlapping backscatter of multiple objects are included. Overall, a combination of selected augmentation methods in the time domain as well as methods applied to the scalograms have led to the best result for our datasets. We have achieved 90.1% (lab) and 66.2% (field) accuracy for the object classes and 96.4% (lab) and 91.5% (field) accuracy for traversability. Further, it has been observed that a class-conditional selection of augmentation methods can be meaningful to further improve classification accuracy.

In this study, we have only used one impulse echo iteration to perform classification. It can be expected that including multiple iterations for classification will lead to significant accuracy improvements. This can be achieved by a simple majority vote of CNN outputs or specialized architectures such as 3D-CNNs62 or convolutional recurrent neural networks (CRNNs).63 

For future investigations, it is planned to expand the datasets to be able to test the model not only on a position-based train/test splitting but also with entirely new objects. Also, backscattering of multiple objects in a single measurement should be included as this is often true in realistic situations. Moreover, involving multiple sensors in the classification process should be considered.

We would like to thank Professor Dr. Andreas Koch (Stuttgart Media University) for constructive discussions and support of our work. Also, we would like to thank Andrew Buchanan for proofreading.

1.
M.
Noll
and
P.
Rapps
, “
Ultrasonic sensors for a K44DAS
,” in
Handbook of Driver Assistance Systems
, edited by
H.
Winner
,
S.
Hakuli
,
F.
Lotz
, and
C.
Singer
(
Springer
,
Cham, Switzerland
,
2016
), pp.
303
323
.
2.
S.
Liu
,
Y.
Wang
,
X.
Yang
,
B.
Lei
,
L.
Liu
,
S. X.
Li
,
D.
Ni
, and
T.
Wang
, “
Deep learning in medical ultrasound analysis: A review
,”
Engineering
5
,
261
275
(
2019
).
3.
G.
Litjens
,
T.
Kooi
,
B. E.
Bejnordi
,
A. A. A.
Setio
,
F.
Ciompi
,
M.
Ghafoorian
,
J. A. W. M.
van der Laak
,
B.
van Ginneken
, and
C. I.
Sánchez
, “
A survey on deep learning in medical image analysis
,”
Med. Image Anal.
42
,
60
88
(
2017
).
4.
A.
Masnata
and
M.
Sunseri
, “
Neural network classification of flaws detected by ultrasonic means
,”
NDT&E Int.
29
,
87
93
(
1996
).
5.
S.
Sambath
,
P.
Nagaraj
, and
N.
Selvakumar
, “
Automatic defect classification in ultrasonic NDT using artificial intelligence
,”
J. Nondestruct. Eval.
30
,
20
28
(
2011
).
6.
D.
Neupane
and
J.
Seok
, “
A review on deep learning-based approaches for automatic sonar target recognition
,”
Electronics
9
,
1972
(
2020
).
7.
H.
Kuttruff
, “
Ultrasound
,” in
Handbook of Engineering Acoustics
, edited by
G.
Müller
and
M.
Möser
(
Springer
,
Berlin
,
2013
), pp.
637
650
.
8.
D. W.
Ricker
, “
The spatial representation
,” in
Echo Signal Processing
(
Springer
,
Boston
,
2003
), pp.
407
467
.
9.
N.
Riopelle
,
P.
Caspers
, and
D.
Sofge
, “
Terrain classification for autonomous vehicles using bat-inspired echolocation
,” in
Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN)
, Rio de Janeiro, Brazil (July 8–13,
2018
), pp.
1
6
.
10.
A.
Bystrov
,
E.
Hoare
,
T.-Y.
Tran
,
N.
Clarke
,
M.
Gashinova
, and
M.
Cherniakov
, “
Road surface classification using automotive ultrasonic sensor
,”
Procedia Eng.
168
,
19
22
(
2016
).
11.
M.
Pöpperl
,
R.
Gulagundi
,
S.
Yogamani
, and
S.
Milz
, “
Capsule neural network based height classification using low-cost automotive ultrasonic sensors
,” in
Proceedings of the 2019 IEEE Intelligent Vehicles Symposium (IV)
, Paris, France (June 9–12,
2019
), pp.
661
666
.
12.
Sonia
,
A. M.
Tripathi
,
R. D.
Baruah
, and
S. B.
Nair
, “
Ultrasonic sensor-based human detector using one-class classifiers
,” in
Proceedings of the 2015 IEEE International Conference on Evolving and Adaptive Intelligent Systems (EAIS)
, Douai, France (December 1–3,
2015
), pp.
1
6
.
13.
S. A.
Bouhamed
,
I. K.
Kallel
, and
D. S.
Masmoudi
, “
Stair case detection and recognition using ultrasonic signal
,” in
Proceedings of the 2013 36th International Conference on Telecommunications and Signal Processing (TSP)
, Rome, Italy (July 2–4,
2013
), pp.
672
676
.
14.
D. R.
Griffin
,
F. A.
Webster
, and
C. R.
Michael
, “
The echolocation of flying insects by bats
,”
Anim. Behav.
8
,
141
154
(
1960
).
15.
B.
Falk
,
T.
Williams
,
M.
Aytekin
, and
C. F.
Moss
, “
Adaptive behavior for texture discrimination by the free-flying big brown bat, Eptesicus fuscus
,”
J. Comp. Physiol. A
197
,
491
503
(
2011
).
16.
T. A.
Stroffregen
and
J. B.
Pittenger
, “
Human echolocation as a basic form of perception and action
,”
Ecol. Psychol.
7
,
181
216
(
1995
).
17.
C.
Ming
and
J. A.
Simmons
, “
Target geometry estimation using deep neural networks in sonar sensing
,” arXiv:2203.15770 (
2022
).
18.
X.
Xia
,
R.
Togneri
,
F.
Sohel
,
Y.
Zhao
, and
D.
Huang
, “
A survey: Neural network-based deep learning for acoustic event detection
,”
Circuits Syst. Signal Process.
38
,
3433
3453
(
2019
).
19.
H.
Purwins
,
B.
Li
,
T.
Virtanen
,
J.
Schluter
,
S.-Y.
Chang
, and
T.
Sainath
, “
Deep learning for audio signal processing
,”
IEEE J. Sel. Top. Signal Process.
13
,
206
219
(
2019
).
20.
I. H.
Sarker
, “
Deep learning: A comprehensive overview on techniques, taxonomy, applications and research directions
,”
SN Comput. Sci.
2
,
420
(
2021
).
21.
I.
Goodfellow
,
Y.
Bengio
, and
A.
Courville
, “
Applications
,” in
Deep Learning
(
MIT
,
Cambridge, MA
,
2016
), pp.
443
485
.
22.
M. J.
Bianco
,
P.
Gerstoft
,
J.
Traer
,
E.
Ozanich
,
M. A.
Roch
,
S.
Gannot
, and
C.-A.
Deledalle
, “
Machine learning in acoustics: Theory and applications
,”
J. Acoust. Soc. Am.
146
,
3590
3628
(
2019
).
23.
J. A.
Castro-Correa
,
M.
Badiey
,
T. B.
Neilsen
,
D. P.
Knobles
, and
W. S.
Hodgkiss
, “
Impact of data augmentation on supervised learning for a moving mid-frequency source
,”
J. Acoust. Soc. Am.
150
,
3914
3928
(
2021
).
24.
J.
Salamon
and
J. P.
Bello
, “
Deep convolutional neural networks and data augmentation for environmental sound classification
,”
IEEE Signal Process. Lett.
24
,
279
283
(
2017
).
25.
L.
Nanni
,
G.
Maguolo
, and
M.
Paci
, “
Data augmentation approaches for improving animal audio classification
,”
Ecol. Inform.
57
,
101084
(
2020
).
26.
G.
Zhou
,
Y.
Chen
, and
C.
Chien
, “
On the analysis of data augmentation methods for spectral imaged based heart sound classification using convolutional neural networks
,”
BMC Med. Inf. Decis. Making
22
,
226
(
2022
).
27.
Z.
Mushtaq
,
S.-F.
Su
, and
Q.-V.
Tran
, “
Spectral images based environmental sound classification using CNN with meaningful data augmentation
,”
Appl. Acoust.
172
,
107581
(
2021
).
28.
D. W.
Ricker
,
Echo Signal Processing
(
Springer
,
Boston
,
2003
).
29.
F. L.
Chevalier
,
Principles of Radar and Sonar Signal Processing
(
Artech House
,
Boston
,
2002
).
30.
R. J.
Urick
, “
Generalized form of the sonar equations
,”
J. Acoust. Soc. Am.
34
,
547
550
(
1962
).
31.
P. T.
Madsen
and
A.
Surlykke
, “
Echolocation in air and water
,” in
Biosonar
, edited by
A.
Surlykke
,
P. E.
Nachtigall
,
R. R.
Fay
, and
A. N.
Popper
(
Springer
,
New York
,
2014
), pp.
257
304
.
32.
F.
Le Chevalier
, “
Target and background signatures
,” in
Principles of Radar and Sonar Signal Processing
(
Artech House
,
Boston
,
2002
), pp.
207
281
.
33.
D. W.
Ricker
, “
Spread scattering and propagation
,” in
Echo Signal Processing
(
Springer
,
Boston
,
2003
), pp.
319
405
.
34.
J. A.
Simmons
,
D.
Houser
, and
L.
Kloepper
, “
Localization and classification of targets by echolocating bats and dolphins
,” in
Biosonar
, edited by
A.
Surlykke
,
P. E.
Nachtigall
,
R. R.
Fay
, and
A. N.
Popper
(
Springer
,
New York
,
2014
), pp.
169
193
.
35.
S.
Hausfeld
,
R. P.
Power
,
A.
Gorta
, and
P.
Harris
, “
Echo perception of shape and texture by sighted subjects
,”
Percept. Mot. Skills
55
,
623
632
(
1982
).
36.
W. N.
Kellogg
, “
Sonar system of the blind: New research measures their accuracy in detecting the texture, size, and distance of objects ‘by ear
,’ ”
Science
137
,
399
404
(
1962
).
37.
J. A.
Simmons
and
L.
Chen
, “
The acoustic basis for target discrimination by FM echolocating bats
,”
J. Acoust. Soc. Am.
86
,
1333
1350
(
1989
).
38.
W. W. L.
Au
, “
Biosonar discrimination, recognition, and classification
,” in
The Sonar of Dolphins
(
Springer
,
New York
,
1993
), pp.
177
215
.
39.
S. W.
Smith
, “
ADC and DAC
,” in
The Scientist and Engineer's Guide to Digital Signal Processing
(
California Technical Publishing
,
San Diego, CA
,
1999
), pp.
35
66
.
40.
D.
Marioli
,
C.
Narduzzi
,
C.
Offelli
,
D.
Petri
,
E.
Sardini
, and
A.
Taroni
, “
Digital time-of-flight measurement for ultrasonic sensors
,”
IEEE Trans. Instrum. Meas.
41
,
93
97
(
1992
).
41.
V. C.
Chen
and
H.
Ling
, “
Time-frequency transforms
,” in
Time-Frequency Transforms for Radar Imaging and Signal Analysis
(
Artech House
,
Boston
,
2002
), pp.
25
46
.
42.
H.-G.
Stark
, “
Continuous analysis
,” in
Wavelets and Signal Processing: An Application-based Introduction
(
Springer
,
Berlin
,
2005
), pp.
13
40
.
43.
C. M.
Akujuobi
, “
Fundamental concepts
,” in
Wavelets and Wavelet Transform Systems and Their Applications
(
Springer International Publishing
,
Cham, Switzerland
,
2022
), pp.
1
10
.
44.
S. C.
Olhede
and
A. T.
Walden
, “
Generalized Morse wavelets
,”
IEEE Trans. Signal Process.
50
,
2661
2670
(
2002
).
45.
E.
Alpaydin
, “
Multivariate methods
,” in
Introduction to Machine Learning
(
MIT
,
Cambridge, MA
,
2014
), pp.
93
114
.
46.
I.
Goodfellow
,
Y.
Bengio
, and
A.
Courville
, “
Regularization for deep learning
,” in
Deep Learning
(
MIT
,
Cambridge, MA
,
2016
), pp.
228
273
.
47.
C.
Shorten
and
T. M.
Khoshgoftaar
, “
A survey on image data augmentation for deep learning
,”
J. Big Data
6
,
60
(
2019
).
48.
J.
Talukdar
,
A.
Biswas
, and
S.
Gupta
, “
Data augmentation on synthetic images for transfer learning using deep CNNs
,” in
Proceedings of the 2018 5th International Conference on Signal Processing and Integrated Networks (SPIN)
, Noida, India (February 22–23,
2018
), pp.
215
219
.
49.
W.-P.
Stilz
and
H.-U.
Schnitzler
, “
Estimation of the acoustic range of bat echolocation for extended targets
,”
J. Acoust. Soc. Am.
132
,
1765
1775
(
2012
).
50.
D. S.
Park
,
W.
Chan
,
Y.
Zhang
,
C.-C.
Chiu
,
B.
Zoph
,
E. D.
Cubuk
, and
Q. V.
Le
, “
SpecAugment: A simple data augmentation method for automatic speech recognition
,” in
Proceedings of Interspeech 2019
, Graz, Austria (September 15–19,
2019
), pp.
2613
2617
.
51.
H.
Zhang
,
M.
Cisse
,
Y. N.
Dauphin
, and
D.
Lopez-Paz
, “
mixup: Beyond empirical risk minimization
,” arXiv:1710.09412 (
2018
).
52.
G.
Peeters
and
G.
Richard
, “
Deep learning for audio and music
,” in
Multi-Faceted Deep Learning
, edited by
J.
Benois-Pineau
and
A.
Zemmari
(
Springer
,
Cham, Switzerland
,
2021
), pp.
231
266
.
53.
I.
Goodfellow
,
Y.
Bengio
, and
A.
Courville
, “
Convolutional networks
,” in
Deep Learning
(
MIT
,
Cambridge, MA
,
2016
), pp.
330
372
.
54.
H.
Habibi Aghdam
and
E.
Jahani Heravi
, “
Convolutional neural networks
,” in
Guide to Convolutional Neural Networks
(
Springer International Publishing
,
Cham, Switzerland
,
2017
), pp.
85
130
.
55.
S.
Ioffe
and
C.
Szegedy
, “
Batch normalization: Accelerating deep network training by reducing internal covariate shift
,” arXiv:1502.03167 (
2015
).
56.
I.
Goodfellow
,
Y.
Bengio
, and
A.
Courville
, “
Deep feedforward networks
,” in
Deep Learning
(
MIT
,
Cambridge, MA
,
2016
), pp.
168
227
.
57.
J.
Pons
and
X.
Serra
, “
Designing efficient architectures for modeling temporal features with convolutional neural networks
,” in
Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
, New Orleans, LA (March 5–9,
2017
), pp.
2472
2476
.
58.
D.
Stowell
,
D.
Giannoulis
,
E.
Benetos
,
M.
Lagrange
, and
M. D.
Plumbley
, “
Detection and classification of acoustic scenes and events
,”
IEEE Trans. Multimedia
17
,
1733
1746
(
2015
).
59.
C. M.
Bishop
, “
Linear models for classification
,” in
Information Science and Statistics, in Pattern Recognition and Machine Learning
(
Springer
,
New York
,
2006
), pp.
179
224
.
60.
Y.
Lecun
,
L.
Bottou
,
Y.
Bengio
, and
P.
Haffner
, “
Gradient-based learning applied to document recognition
,”
Proc. IEEE
86
,
2278
2324
(
1998
).
61.
E.
Aguilar
and
P.
Radeva
, “
Class-conditional data augmentation applied to image classification
,” in
Lecture Notes in Computer Science, in Computer Analysis of Images and Patterns
, edited by
M.
Vento
and
G.
Percannella
(
Springer International Publishing
,
Cham
,
2019
), pp.
182
192
.
62.
S.
Ji
,
W.
Xu
,
M.
Yang
, and
K.
Yu
, “
3D convolutional neural networks for human action recognition
,”
IEEE Trans. Pattern Anal. Mach. Intell.
35
,
221
231
(
2013
).
63.
X.
Shi
,
Z.
Chen
,
H.
Wang
,
D.-Y.
Yeung
,
W.
Wong
, and
W.
Woo
, “
Convolutional LSTM network: A machine learning approach for precipitation nowcasting
,” arXiv:1506:04214 (
2015
).