Today's low-cost automotive ultrasonic sensors perform distance measurements of obstacles within the close range of vehicles. For future parking assist systems and autonomous driving applications, the performance of the sensors should be further increased. This paper examines the processing of sensor data for the classification of different object classes and traversability of obstacles using a single ultrasonic sensor. The acquisition of raw time signals, transformation into time-frequency images, and classification using machine learning methods are described. Stationary and dynamic measurements at a velocity of 0.5 m/s of various objects have been carried out in a semi-anechoic chamber and on an asphalt parking space. We propose a scalogram-based signal processing chain and a convolutional neural network, which outperforms a LeNet-5-like baseline. Additionally, several methods for offline and online data augmentation are presented and evaluated. It is shown that carefully selected augmentation methods are useful to train more robust models. Accuracies of 90.1% are achieved for the classification of seven object classes in the laboratory and 66.4% in the outdoor environment. Traversability is correctly classified at an accuracy of 96.4% and 91.5%, respectively.
I. INTRODUCTION
Recent developments in driver assist systems and automated driving have led to a growing need for advanced environmental sensing. In addition to sensors such as cameras, radar, or lidar, ultrasonic sensors are well suited because of their low production costs, robustness, and widespread use. Hence, improving the performance of the sensors is of great interest. Today's automotive ultrasonic sensors use the pulse-echo method to calculate the distance to obstacles in the near-field for parking and maneuvering applications.1 For future products, a classification of obstacles is desirable.
Generally, classification tasks using ultrasonic sensors are more common in medical imaging for human tissue analysis,2,3 non-destructive material testing (NDT) in solid bodies,4,5 or underwater sonar.6 In these applications, transducer arrays are typically used.7 Using transducer arrays allows beamforming, which provides advantages such as directional scanning of the environment and enhanced signal-to-noise ratios (SNRs).8 Due to increased hardware requirements for the processing of multiple channels and high power consumption, these sensors are costly compared to automotive ultrasonic sensors, which usually consist of only a single piezoelectric element.1 So far, the usage of low-cost automotive sensors for classification tasks has been poorly addressed. However, there has been some research on classifying terrain surfaces9,10 and the height of obstacles.11 Further, the detection of human beings using an ultrasonic sensor is successfully performed by Sonia et al.12 Promising results regarding classification tasks using an ultrasonic sensor have also been achieved by Bouhamed et al.13 in distinguishing floors and staircases.
In nature, the classification of acoustic echoes is seen in bats, who are able to detect their prey by emitting frequency-modulated ultrasonic calls.14,15 Some blind people have learned to use so-called click sonar allowing them to detect the position, size, shape, and even material substance of obstacles.16 Findings in biosonar and human echolocation can serve as an inspiration for the signal processing and feature extraction in ultrasonic sensing. Based on a biosonar model, Ming and Simmons17 have shown that time-frequency features are useful for echo classification and target geometry estimation. Inspired by bats, Riopelle et al.9 also successfully use features based on time-frequency images for the classification of ultrasonic echoes by a mobile robot. Sonia et al.12 extract time domain features, such as standard deviation and root mean square values, and frequency domain features via fast Fourier transform (FFT) from ultrasonic echoes to classify human beings. It is shown that the FFT outperforms other feature inputs in terms of classification accuracy, showing the significance of spectral features. Furthermore, the relevance of time-frequency images for classification has been shown in related fields such as underwater sonar6 or acoustic event classification.18,19
Deep neural networks (DNNs) for classification tasks have become very popular in computer vision and natural language processing.20,21 Currently, DNNs are also successfully applied in the field of acoustics.22 For the classification of acoustic echoes, Ming and Simmons17 compare a convolutional neural network (CNN) and a recurrent neural network (RNN). The CNN model slightly outperformed the RNN in terms of classification accuracy. Pöpperl et al.11 successfully use a capsule neural network for classifying the height of obstacles using an automotive ultrasonic sensor. The suitability of CNNs for acoustic echo classification is also shown in several works in the field of underwater sonar.6 For example, Castro-Correa et al.23 implement residual CNNs to perform underwater source localization and seabed classification based on simulated data as well as on at-sea measured data. To improve the model's performance, they use data augmentation methods on time-frequency images. It is shown that time stretching, flipping, time masking, frequency masking, and combinations positively impact the results on their dataset, increasing classification accuracy by 1%–4%. Another motivation for the use of data augmentation on acoustic signals and time-frequency representations can be found in the field of environmental sound classification. Salamon and Bello24 use data augmentation on audio time signals. They obtain more robust classification models and observe class-specific performance of the augmentation methods. Nanni et al.25 compare the performance of various offline augmentation methods applied on raw time signals and online augmentation methods applied on spectrograms for animal audio classification using CNNs. They conclude that augmentation methods must be carefully selected depending on application domain and the nature of the data. Several studies have shown that common augmentation methods from the field of computer vision cannot be applied to time-frequency images without hesitation.25–27 Therefore, more physics-motivated methods are applied and evaluated in this work.
This study examines the use of a low-cost automotive ultrasonic sensor for object classification in the close range of vehicles. We propose a processing chain to classify ultrasonic backscatter using bandpass filtering and the continuous wavelet transform (CWT) for feature extraction. In addition to time-frequency images, the target distance via time of flight is used as an input feature to a CNN. For training and evaluation, a dataset including ultrasonic backscatter of various objects at different distances and angles has been recorded. Measurements were taken in a semi-anechoic chamber as well as in an asphalt parking space. Moreover, several methods for offline and online data augmentation are presented and evaluated. We expect that more robust models can be obtained using physics-motivated augmentation methods such as modification of the SNR and echo delay, Doppler-motivated pitch shifting, and the superposition of echoes.
This paper is organized as follows. The principles of airborne ultrasonic sensing as well as target properties carried by acoustic echoes are discussed in Sec. II. Section III gives an overview of the experimental setup, including sensors and data acquisition procedures. In Sec. IV, the methods for signal processing and feature extraction are described. Several data augmentation techniques on both raw time signals and scalogram images are presented in Sec. V. In Sec. VI, a CNN architecture for classifying the processed signals is proposed. The classification results considering different measurement environments and data augmentation techniques are discussed in Sec. VII. Our conclusions are drawn in Sec. VIII.
II. TARGET DISCRIMINATION IN ULTRASONIC ENVIRONMENTAL SENSING
In automotive ultrasonic sensing, piezoelectric sensors are used to transmit ultrasonic pulses and receive echoes from obstacles. The distance can be calculated based on the time of flight and the speed of sound in air. Typically, operating frequencies between 40 und 50 kHz are used. At higher frequencies, the sound absorption in air increases strongly. At lower frequencies, there are many extraneous sound sources in environments of the application.1 Moreover, frequency-modulated transmit signals (chirps) are well established, since range resolution and SNR can be improved via pulse compression. Details on conventional echo signal processing can be found in the literature.28,29
The target strength depends on the size of the target and the impedance difference to the propagation medium. The impedance of most solid objects is large compared to air. Thus, echoes consist primarily of specular reflections on surfaces. Consequently, target geometry is the predominant factor for the target strength, characterized by the size of the reflective surface facing the incident sound (acoustic cross section).32 Depending on the geometry of an object, the target strength can also vary from different directions of ensonification. To receive an effective target strength, the wavelengths of the incident sound must be shorter than the circumference of the reflective surface.31
Interfering noise consists of extraneous sound sources, ambient noise, and clutter reflections. Relevant extraneous sound sources in automotive ultrasonic sensing are, for example, compressed air noises or metallic grating noises.1 Clutter consists of echoes from scatterers that are not of interest, especially asphalt backscatter, thus, increasing with the intensity of the transmitted signal.32 Furthermore, the amount of clutter and ambient noise is restricted depending on the sensor's directivity index. Since the incident beam widens as it propagates in air, more clutter reflections are included in the overall backscatter at greater distances.
In practice, obstacles mostly consist of not only a single scattering point but of several sub-reflectors, so-called highlights. Low distances between these highlights lead to interferences of the single echoes. In the received signal, this results in a combined delay spread echo waveform, which has a duration exceeding the transmitted waveform. Additionally, Doppler spreading is given when the transmitter or the object is moving or when different velocities are given for the single highlights.33 Delay and Doppler spreading cause interference patterns in the combined echo signal that are defined by the object's unique features. These interference patterns are evident in the temporal and spectral structure of the received signal and show up as ripples and notches in time-frequency images based on the spacing of the object's highlights.34
The distance and direction of a scatterer to the receiver affect the overall amplitude of the echo due to the transmission loss and directivity index of the sensor [Eq. (1)]. In addition, high frequencies are absorbed with distance and can be further attenuated by sensor directivity. Further, the backscatter is affected by the surface impedance of the object. It has been shown that human echolocators, bats, and dolphins are able to discriminate the substance of a scatterer.16,35–38
III. EXPERIMENTAL SETUP
For the measurements, a typical automotive ultrasonic sensor is used as transmitter, and a GRAS 40 BE condenser microphone (GRAS Sound and Vibration, Holte, Denmark) is used as receiver. The piezoelectric element of the ultrasonic sensor can also operate as a receiver, but there are limitations of the sensor's hardware in capturing raw time signals. Within the automotive ultrasonic sensor, the transducer element and associated electronics are effective as a unit. It outputs echo points that are a very compressed representation of the transducer signal. However, for this research study, the use of raw time signals for classification is crucial, so the microphone as a receiver has been used as a straightforward, practical solution. The sensor's transfer function is applied to the acquired microphone signals to imply its frequency response. Further characteristics such as the directional index or self-noise of the sensor are not taken into account. The sensor is mounted at a height of 0.5 m on a linear rail, which allows stationary as well as dynamic measurements at a velocity of 0.5 m/s. The microphone is located directly above the sensor. The measurement setup is shown in Fig. 1.
(Color online) Measurement setup: A conventional ultrasonic sensor and a GRAS 40 BE microphone are mounted on a linear rail to allow static and dynamic measurements.
(Color online) Measurement setup: A conventional ultrasonic sensor and a GRAS 40 BE microphone are mounted on a linear rail to allow static and dynamic measurements.
A frequency-modulated pulse ranging from 42.5 to 52.5 kHz with a signal length of 1.6 ms is transmitted at an interval of 50 ms. At a speed of sound of 343 m/s, wavelengths of about 0.7–0.8 cm are given, which allows even small surfaces to be detected. For data acquisition, a sample rate of 215 kilosamples/s is used to satisfy the Nyquist criterion.39 The measurements are taken both in a semi-anechoic chamber (from now on referred to as lab) and in an asphalt parking space (from now on referred to as field) to compare classification performances in a controlled environment with low disturbances and a more real-world scenario. The captured time signal of the backscattering of a tube in the lab is shown in Fig. 2.
Time signal of ultrasonic backscattering in the lab. The direct signal, characterized by the highest amplitude, is followed by some ground reflections and the object-specific backscattering of a tube with a circular cross section.
Time signal of ultrasonic backscattering in the lab. The direct signal, characterized by the highest amplitude, is followed by some ground reflections and the object-specific backscattering of a tube with a circular cross section.
Thirty objects were selected that can be found in parking or maneuvering situations, including pedestrians (for the sake of readability, pedestrians are also referred to as “objects” in this paper). The objects are categorized into seven object classes and can further be aggregated to a binary classification considering whether an object is traversable by a vehicle or not (see Table I). “No object” refers to measurements where an empty scenery is recorded to be able to determine the presence of objects. For the pedestrians, three subjects participated, each positioned once frontally and once laterally to the sensor, resulting in six instances of pedestrians. The object positions and the corresponding division into train and test positions is illustrated in Fig. 3. Each object was measured at 55 positions in different distances and incident angles to the sensor according to the sensor's radiation pattern and the dimensions of the semi-anechoic chamber. This results in radial positions ranging between 75 and 225 cm and ±45° to the sensor. Sixty static and 91 dynamic measurements were taken at each object position, resulting in a total of 249 150 hand-labeled measurements per environment. It can be stated that static measurements of a single object position are significantly more correlated than the dynamic ones and differ primarily from external disturbances such as noise. We perform a position-based train/test set splitting, i.e., measurements at 44 of the 55 positions of each object are used for the training dataset and 11 positions for the test dataset.
Categorization of the objects into seven object classes and two traversability classes. The numbers in parentheses after the class names indicate the number of corresponding objects within a class.
Object class . | Traversability class . |
---|---|
No object (4) | Traversable (16) |
Bag (4) | |
Small object (4) | |
Curb (4) | |
Tree (4) | Not traversable (14) |
Tube/pole (4) | |
Pedestrian (6) |
Object class . | Traversability class . |
---|---|
No object (4) | Traversable (16) |
Bag (4) | |
Small object (4) | |
Curb (4) | |
Tree (4) | Not traversable (14) |
Tube/pole (4) | |
Pedestrian (6) |
Object positions, divided into positions for the training set (white) and positions for the test set (dark gray).
Object positions, divided into positions for the training set (white) and positions for the test set (dark gray).
IV. PREPROCESSING AND FEATURE EXTRACTION
Several preprocessing steps are applied before the signals are passed to the CNN. The proposed processing differs fundamentally from conventional signal processing in automotive ultrasonic sensing. Whereas, in conventional processing, the distance to objects is generally determined using thresholding methods in the time domain,40 for object classification also the spectral patterns of echoes are of great interest.
After sampling the ultrasonic signals, a bandpass filter is used to remove unwanted noise and to limit the signals to the relevant frequency range. This prevents unwanted signal components such as environmental sounds in the audible frequency range from being passed to further processing. The cutoff frequencies are chosen slightly wider than the frequency range of the transmitted signal to ensure that relevant backscatter is not being suppressed. Accordingly, we use a finite impulse response (FIR) bandpass filter with a lower passband frequency of 40 kHz, a higher passband frequency of 55 kHz, and a stop band attenuation of 100 dB.
A time window of 768 samples is then cut out containing the backscattering time periods to be classified. The length of the window is found to be sufficient to include all backscatterings of the relevant objects. In this study, the windows are chosen based on the known distances to the objects in our dataset. Hence, we ensure correct labeling and comparable class sizes, which is crucial for the training process. Additionally, it is intended to avoid the risk of losing echoes or even false positives using the standard pulse-echo method. When trained models are employed in practice, the windows can be obtained using a sliding window approach or the standard pulse-echo method.
In Fig. 4, scalograms of static measurements of one object from each object class in the lab are shown. “No object” is not shown since it is noise only. In Fig. 4(a), the backscatter of a tube with a circular cross section of 7.6 cm diameter is illustrated. Two successive chirps can be seen, one starting at about 0.5 ms and the other at about 1.3 ms. The first reflection relates to a horizontal reflection and the latter to a diagonal reflection at the base of the tube. Likewise, the backscatter of every other object in our dataset being categorized as not traversable consists of at least two reflections since they all exceed the height of the sensor. Generally, diagonal reflections produce smaller echo amplitudes than horizontal reflections due to a greater distance to the sensor resulting in an increased transmission loss. The backscatter of a pedestrian in Fig. 4(b) is a more complex pattern with multiple, overlapping reflections due to many small highlights on the pedestrian's clothing and body. Although the pedestrian provides a larger acoustic cross section, the amplitudes are not significantly increased compared to the tube. This can be ascribed to sound absorption by the pedestrian's clothing. The scalogram of a curb in Fig. 4(c) shows a single reflection since there is only a diagonal reflection and no horizontal one due to the height of the object. Also, the amplitudes of the reflection are greater than those of the other objects. This can be explained by the beveled edge of the curb, resulting in an increased target strength in direction of the receiver compared to objects with a circular cross section. Figure 4(d) shows the backscatter of a tree, predominated by the reflection of the trunk, which is comparable to the tube in Fig. 4(a). However, the pattern is slightly more complex due to the rough surface of the tree as well as small, overlapping reflections from branches. The lowest amplitudes are obtained in Fig. 4(e), which shows the backscatter of a plastic bottle with a height of 21 cm from the class “small object.” Low amplitudes, which are more likely to be masked by noise and clutter, are also obtained in the backscatter of a plastic bag in Fig. 4(f). Compared to the plastic bottle, the duration of the bag's backscatter is increased due to many small highlights, which also leads to more interferences.
(Color online) Scalograms of different objects' backscatter in the lab at 1 m on the 0° axis.
(Color online) Scalograms of different objects' backscatter in the lab at 1 m on the 0° axis.
(Color online) Processing steps from sampled time signal to object prediction. A scalogram image and distance scalar is fed into the CNN for each time window.
(Color online) Processing steps from sampled time signal to object prediction. A scalogram image and distance scalar is fed into the CNN for each time window.
V. DATA AUGMENTATION
Data augmentation refers to methods in machine learning that can improve the robustness of a model by reducing overfitting and overcome the problem of sparse data for training.46 This can be achieved by enriching the training data set with modified copies of the original data. In computer vision, images are often rotated, flipped, recolored, etc.47 However, these methods can be problematic when applied to time-frequency images since they do not constitute a physically analogous counterpart in the domain of sound propagation and backscattering. Physically motivated methods would be preferred. In the following, the methods applied in this work are described and can be divided into offline and online augmentation methods. Offline augmentation refers to methods that are applied before the training process.48 This results in high disk space requirements since a copy of the original signal with the modifications applied has to be saved. Afterward, the enriched training dataset can be loaded into smaller batches for the training process. In this work, offline augmentation is used for methods that are applied on the windowed time signals before they are transformed via CWT. In contrast to offline augmentation, online augmentation refers to methods that are applied during training after a small batch of training samples has been loaded temporarily to the main memory.48 Hence, no additional hard disk space is required, but the training duration is increased since the augmentation is processed each time a data batch is loaded. Since the CWT is applied before the training starts, online augmentation is used for methods that are applied on the scalogram images.
A. Offline augmentation methods
Offline augmentation refers to methods that we apply to the windowed time signals before they are transformed via CWT and before starting the training process. Time signals with the applied methods that are described in the following are illustrated in Fig. 6.
1. White noise
2. Clutter
3. Pitch shifting
B. Online augmentation methods
Online augmentation methods are applied to the scalogram images during the training process. In Fig. 7, the methods are illustrated, and they are explained in the following.
(Color online) Example of a scalogram with the applied online data augmentation methods. Note that time and frequency masking are applied independently and that it is also differentiated between multi-class and single-class mixup.
(Color online) Example of a scalogram with the applied online data augmentation methods. Note that time and frequency masking are applied independently and that it is also differentiated between multi-class and single-class mixup.
1. Frequency masking
SpecAugment, introduced by Park et al.,50 involves time warping, time masking, and frequency masking on time-frequency images. Since time warping is not reasonable in our case, we only perform frequency masking and time masking. This can be seen as a loss of information in single frequency or time segments and can improve generalization of the CNN by not focusing on the same sections of the images during training. For frequency masking, consecutive frequency bins are masked using the mean pixel value of the image. is chosen from a uniform distribution and from , where is the total number of frequency bins, and we set the hyperparameter as the maximum number of masked frequency bins.
2. Time masking
Analogous to frequency masking, for time masking, consecutive time steps are masked. Here, is chosen from a uniform distribution and from , where is the total number of time steps, and we set the hyperparameter as the maximum number of masked time steps.
3. Salt and pepper noise
For salt and pepper noise, each pixel of a scalogram is masked by a probability of and randomly set to either the minimum or maximum pixel value. This can be interpreted as pixelwise information loss in the scalograms.
4. White noise
Pixelwise white Gaussian noise is added to the scalogram images. The mean of the Gaussian distribution is set to the mean pixel value of the current image. Based on the image pixel's variance and the chosen noise intensity hyperparameter , the Gaussian variance is set to .
5. Time shifting
The scalogram image is shifted for pixel steps either to the left or to the right, which corresponds to a modified time of flight of the echoes. We choose from a uniform distribution , where we set the maximum number of steps to ensure that object-related backscattering is not being truncated. The image is vertically cut into two slices either at or , and the new image is obtained by inverting the order of the slices. The distance value is not adjusted, since a maximum shift of only results in a change of the object distance of about 3 cm, which lies within the variance of the position of the backscatter in the time window.
6. Multi-class and single-class mixup
To account for the distance feature, we only mix scalograms where the associated distance values . In addition to multi-class mixup, we also perform single-class mixup, where only training samples with the same class label are mixed.
VI. CNN
CNNs are specialized neural networks that are widely used in computer vision to process image or video data. Meanwhile, CNNs are also applied to extract features from audio data19,52 or acoustic measurements.22 A fundamental part of CNNs are their convolutional layers, in which features are extracted from the input and transformed into so-called feature maps: Kernels are slid over the input, producing an output via element-wise multiplication with learned kernel weights. This results in the advantage of shared weights, namely, a significantly smaller number of trainable parameters compared to conventional neural networks and translation invariance in time or space.21,53,54
The proposed CNN architecture is shown in Fig. 8. We apply several convolutional and pooling layers to the scalogram input before we concatenate the flattened feature maps with the distance input. Note that we use batch normalization55 and the rectified linear activation function (ReLU)56 after each convolutional layer. A stride of 1 is used for the convolutional layers and a stride of 2 for the pooling layers. We use average pooling since we observed that it slightly outperformed max pooling in our experiments. Furthermore, zero-padding is applied for the convolutional layers to get the same input and output dimensions.
Proposed CNN architecture. Next to the boxes, indicating the layers, the corresponding output shapes are listed. For the convolutional and pooling layers, the kernel shapes are given in the parentheses. The seven neurons in the output layer and the softmax activation function are dedicated to the classification of the object classes. For the binary classification, the number of neurons can be adjusted to a single neuron using the sigmoid activation function.
Proposed CNN architecture. Next to the boxes, indicating the layers, the corresponding output shapes are listed. For the convolutional and pooling layers, the kernel shapes are given in the parentheses. The seven neurons in the output layer and the softmax activation function are dedicated to the classification of the object classes. For the binary classification, the number of neurons can be adjusted to a single neuron using the sigmoid activation function.
To evaluate the proposed architecture, we use a simpler CNN, adapted from LeNet-5,60 as a baseline. It differs from LeNet-5 only in the parameters of the convolution kernels (16 kernels with shape 5 × 5 in the first, 32 kernels with shape 3 × 3 in the second convolutional layer) and the number of neurons in the three fully connected layers, which are 2688, 488, and 7, respectively. In this work, deeper networks, such as ResNet or Inception nets, have not been considered as a baseline. The vast number of parameters might conflict with the hardware restrictions of automotive control units.
VII. RESULTS AND DISCUSSION
In the following, the baseline CNN is denoted as CNN#1 and our proposed architecture as CNN#2. Each training is performed for ten training rounds with different random seeds to deal with the CNN's stochastic behavior. The mean accuracies ± standard deviations are given in Table II. Note that balanced accuracies are calculated due to slightly different class sizes. In the table's columns, it is distinguished between lab and field data as well as between the aggregation to the object and traversability classes.
Mean accuracy ± standard deviation in percent on the test set aggregated to object and traversability classes in the lab and field environment. In each column, the best accuracy is marked in bold. Since CNN#2 outperforms CNN#1, CNN#2 is used for the trainings with augmentation.
. | Mean accuracies ± standard deviation (%) . | |||
---|---|---|---|---|
. | Lab environment . | Field environment . | ||
Object class . | Traversability class . | Object class . | Traversability class . | |
CNN#1: no augmentation | 82.9 ± 0.4 | 92.2 ± 0.4 | 61.1 ± 0.4 | 87.4 ± 0.4 |
CNN#2: no augmentation | 86.7 ± 0.4 | 94.9 ± 0.3 | 65.4 ± 0.1 | 90.6 ± 0.3 |
Offline | ||||
White noise | 88.1 ± 0.3 | 95.9 ± 0.2 | 65.5 ± 0.1 | 91.4 ± 0.2 |
Clutter | 86.8 ± 0.3 | 94.7 ± 0.2 | 64.9 ± 0.5 | 90.2 ± 0.2 |
Pitch shifting | 85.5 ± 0.6 | 93.5 ± 0.3 | 63.3 ± 0.4 | 89.1 ± 0.3 |
Online | ||||
Frequency masking | 87.6 ± 0.3 | 95.3 ± 0.3 | 65.6 ± 0.2 | 91.0 ± 0.3 |
Time masking | 87.7 ± 0.3 | 95.6 ± 0.1 | 65.8 ± 0.2 | 91.1 ± 0.2 |
Salt and pepper noise | 87.6 ± 0.4 | 95.4 ± 0.3 | 65.2 ± 0.1 | 91.2 ± 0.2 |
White noise | 86.7 ± 0.2 | 94.6 ± 0.2 | 65.2 ± 0.5 | 90.8 ± 0.2 |
Time shifting | 88.6 ± 0.5 | 95.5 ± 0.2 | 66.4 ± 0.2 | 91.3 ± 0.2 |
Single-class mixup | 87.4 ± 0.5 | 95.3 ± 0.4 | 65.3 ± 0.1 | 90.8 ± 0.2 |
Multi-class mixup | 87.1 ± 0.5 | 95.2 ± 0.2 | 65.4 ± 0.3 | 90.7 ± 0.3 |
Combinationa | 90.1 ± 0.3 | 96.4 ± 0.1 | 66.2 ± 0.2 | 91.5 ± 0.2 |
. | Mean accuracies ± standard deviation (%) . | |||
---|---|---|---|---|
. | Lab environment . | Field environment . | ||
Object class . | Traversability class . | Object class . | Traversability class . | |
CNN#1: no augmentation | 82.9 ± 0.4 | 92.2 ± 0.4 | 61.1 ± 0.4 | 87.4 ± 0.4 |
CNN#2: no augmentation | 86.7 ± 0.4 | 94.9 ± 0.3 | 65.4 ± 0.1 | 90.6 ± 0.3 |
Offline | ||||
White noise | 88.1 ± 0.3 | 95.9 ± 0.2 | 65.5 ± 0.1 | 91.4 ± 0.2 |
Clutter | 86.8 ± 0.3 | 94.7 ± 0.2 | 64.9 ± 0.5 | 90.2 ± 0.2 |
Pitch shifting | 85.5 ± 0.6 | 93.5 ± 0.3 | 63.3 ± 0.4 | 89.1 ± 0.3 |
Online | ||||
Frequency masking | 87.6 ± 0.3 | 95.3 ± 0.3 | 65.6 ± 0.2 | 91.0 ± 0.3 |
Time masking | 87.7 ± 0.3 | 95.6 ± 0.1 | 65.8 ± 0.2 | 91.1 ± 0.2 |
Salt and pepper noise | 87.6 ± 0.4 | 95.4 ± 0.3 | 65.2 ± 0.1 | 91.2 ± 0.2 |
White noise | 86.7 ± 0.2 | 94.6 ± 0.2 | 65.2 ± 0.5 | 90.8 ± 0.2 |
Time shifting | 88.6 ± 0.5 | 95.5 ± 0.2 | 66.4 ± 0.2 | 91.3 ± 0.2 |
Single-class mixup | 87.4 ± 0.5 | 95.3 ± 0.4 | 65.3 ± 0.1 | 90.8 ± 0.2 |
Multi-class mixup | 87.1 ± 0.5 | 95.2 ± 0.2 | 65.4 ± 0.3 | 90.7 ± 0.3 |
Combinationa | 90.1 ± 0.3 | 96.4 ± 0.1 | 66.2 ± 0.2 | 91.5 ± 0.2 |
Applied methods: white noise (offline), frequency masking, time masking, and time shifting.
A. Object discrimination
It can be seen from Table II that CNN#2 outperforms CNN#1 in both environments and both class aggregations. For the object classes, CNN#2 achieves 86.7% (lab) and 65.4% (field) accuracy. The accuracy for CNN#1 drops by 3.8% and 4.3%, respectively. For traversability, CNN#2 achieves 94.9% (lab) and 90.6% (field), whereas the accuracy drops by 2.7% and 3.2% when using CNN#1. Overall, lower accuracies are achieved in the field environment since the data are affected by more disturbances such as ground clutter. Higher accuracies are achieved for the traversability classes than for the object classes, as the model only has to perform binary classification. It can also be expected that the discrimination between traversable and non-traversable objects is easier to achieve, since most non-traversable objects have a larger acoustic cross section, resulting in greater echo amplitudes.
In Fig. 9, confusion matrices for CNN#2 without data augmentation are shown. In both lab and field environment, the highest accuracy is achieved for the pedestrian class. This is beneficial for autonomous driving applications since the detection of pedestrians has a higher priority than objects that can cause physical damage but not personal injury. Compared to other objects in our dataset, the pedestrian is characterized by an overall large acoustic cross section consisting of many highlights as well as by increased sound absorption due to clothing. The interference patterns and amplitudes can be assumed to be relevant features in the time-frequency images, allowing the model to easily discriminate the pedestrian against the other object classes.
(Color online) CNN#2: Confusion matrices for (a) lab data and (b) field data without data augmentation. The best accuracy is achieved for the pedestrian class in both environments. Using field data, the accuracies decrease the most for traversable object classes, including “no object.”
(Color online) CNN#2: Confusion matrices for (a) lab data and (b) field data without data augmentation. The best accuracy is achieved for the pedestrian class in both environments. Using field data, the accuracies decrease the most for traversable object classes, including “no object.”
In the lab environment, the model is even able to detect bags and small objects with accuracies above 70%. In the field data, on the other hand, it seems to be difficult to distinguish traversable objects from ground clutter. In both environments, bags are the most challenging objects to classify. In addition to a small target strength, bags consist of many sub-reflectors, making them difficult to distinguish from clutter or from small objects. We assume that the higher accuracy of curbs compared to the other traversable object classes can be ascribed to the more uniform geometry of the objects within the class and to the increased target strength, especially for curbs with a beveled edge. Comparing lab and field data, the accuracy for “no object” drops from 97.1% to 55.8% since there's more confusion with traversable objects. In the field data, “no object” mainly consists of ground clutter, masking echoes from traversable objects with small target strengths, especially at distant positions. However, the accuracies for non-traversable objects in the field data are more stable, ranging between 76.3% (tube/pole) and 93.9% (pedestrian). Better accuracies for non-traversable objects are probably achieved because of greater target strengths and, thus, better SNRs. A successful discrimination between “tree” and “tube/pole” has been expected, since tubes and poles are very simple scatterers, while trees are more complex objects with rough surfaces and more highlights.
B. Data augmentation
Since CNN#2 delivers better accuracies, it is used for the following experiments using the proposed augmentation methods. From the results in Table II, it is apparent that the augmentation methods vary in their ability to improve the classification results. While most of the methods positively affect the accuracy, pitch shifting reduces the accuracies in both environments. It may be assumed that this results from the inaccurate calculation of the Doppler shifts due to the interference of backscatter from different directions. It is neglected that the Doppler effect should not be applied equally to clutter or object backscatter from an angle of incidence ≠ 0°. Due to this simplification, classification improvements are assumed to be limited. This effect could be reduced by directional beamforming using a sensor array to suppress echoes apart from the direction of interest. Also, clutter augmentation does not lead to significantly improved results. Consideration of corresponding confusion matrices (not shown here) revealed that this can be ascribed to an increased confusion of traversable objects with “no object.” It is important to ensure that specific augmentations do not mask relevant features in the echo signals excessively, which is crucial for small objects with low target strengths. The offline white noise augmentation increases the accuracy by 1.4% in the lab and by 0.1% in the field environment. A smaller effect for the field data might result from the fact that the signals are already naturally noisier. In general, it can be found that the augmentation methods achieve greater improvements for the lab than for the field data. This could be due to the inherently more varied field data.
For the online augmentation methods, frequency masking, time masking, and time shifting improves the accuracies in both environments. We assume that frequency masking and time masking forces the CNN to also consider secondary interference patterns, resulting in a more robust behavior. Time shifting represents a natural variation in the signals since the estimation of the distance to an object and the corresponding time window is mostly not perfectly accurate. Salt and pepper noise, single-class mixup, and multi-class mixup only improves the results for the lab data. Most likely, relevant patterns in the time-frequency images are frequently masked using salt and pepper noise. It may be assumed that mixup performs better when backscattering of multiple objects are also given in the test set, which is true for many realistic situations. Therefore, it should be considered to include measurements with multiple objects in the scenery for future investigations. The online white noise augmentation does not improve the results. It can be concluded that the injection of white noise directly on the time signals is to be preferred to adding it to the time-frequency images. If a method can be applied to both time signals and time-frequency images, we suggest applying modifications to the raw time signals since the physical effects are generally reproduced more naturally. Overall, time shifting produces the best accuracies compared to other methods (+1.9% for the object classes and +0.6% for the traversability classes in the lab and +1.0% and +0.7% in the field, respectively).
Finally, a model has been trained using all augmentation methods that had resulted in accuracy improvements in both environments, namely offline white noise, frequency masking, time masking, and time shifting. Thereby, an increase in 3.4% for the lab data and 0.8% for the field data could be achieved for the object classes. For the traversability classes, the accuracies have been improved by 1.5% and 0.9%, respectively. It can be concluded that carefully selected augmentation methods can improve classification accuracies for both the aggregation to object classes and traversability. Furthermore, certain offline augmentation methods applied on raw time signals as well as online methods on time-frequency images are suitable.
Since it has been observed that many augmentation methods do not have the same impact on classification accuracies for all classes equally,24,61 we examined the class-specific accuracy improvements. In Fig. 10, the accuracy differences for each object class and each augmentation method are illustrated. It is apparent that for both Fig. 10(a) lab data and Fig. 10(b) field data, further accuracy improvements can be achieved by applying the methods on specific object classes only. There is no method that is suitable for all classes. For example, when using lab data, white noise (offline) should not be used on the classes “tree” and “pedestrian.” It is also the only method that significantly improves the performance for the class “small object.” The class “tube/pole” representing the simplest objects is the only class being improved by all augmentation methods using lab data. It also shows the most similar improvements over all methods. However, when using field data, the accuracy for “tube/pole” is decreased by all augmentation methods. In further confusion matrices, it has been found that objects of the class “tube/pole” are more frequently confused with the class “tree.” While the results of the class “small object” in the lab are improved only by white noise (offline), in the outdoor environment, several other methods lead to improvements. However, the accuracy for the class “bag” is increased by most of the methods using lab or field data. As shown in the confusion matrices in Fig. 9, this is also the class with the lowest overall accuracy. Finally, it can be concluded that the suitability of the single augmentation methods on individual classes also differs significantly for the different environments. A new training with a class-conditional selection of augmentation methods could be performed to increase the overall accuracy once the class-specific accuracies have been obtained.
(Color online) Difference in class-specific accuracy for the augmentation methods compared to using the training set without augmentation, based on (a) lab data and (b) field data.
(Color online) Difference in class-specific accuracy for the augmentation methods compared to using the training set without augmentation, based on (a) lab data and (b) field data.
VIII. CONCLUSION
In this paper, we have demonstrated the feasibility of classifying obstacles that can appear in parking or maneuvering situations, not only considering traversability but also in several object classes using a single low-cost ultrasonic sensor. The categorization into object classes is desirable because it is, e.g., of special interest to detect pedestrians or other obstacles that can move or are particularly worth protecting. Improved performance of ultrasonic sensors can contribute to future parking assistance systems and autonomous driving applications.
For the discrimination of objects by acoustic echoes, target strength according to the acoustic cross section and reflectivity, as well as the number of highlights and their spacing, have been considered as relevant features. In the received signal, the target strength can be found in the amplitudes, while interferences caused by multiple discrete highlights are indicated in the temporal structure and in spectral patterns. We have used the CWT to extract temporal as well as spectral features, represented in time-frequency images. We have proposed a CNN using not only the time-frequency images but also the distance to the object as an input feature. The proposed CNN architecture has outperformed the baseline, a LeNet-5-like CNN, in all environments. It can be concluded that CNNs are capable of classifying different objects by their acoustic backscatter using CWT-generated time-frequency representations and the object's distance as input features. For automotive ultrasonic sensing, high accuracies can be achieved discriminating between traversable and non-traversable objects. However, the discrimination between small objects, whose echoes are likely to be masked by clutter reflections, can be challenging.
Furthermore, we have shown that data augmentation is useful to train a more robust model. It is important to note that augmentation methods should be selected carefully as not all methods improved the accuracies. Conventional methods from computer vision when dealing with time-frequency representations should be considered critically. Moreover, domain knowledge is necessary for selecting or engineering appropriate augmentation methods. We suggest applying augmentation methods to raw time signals rather than to time-frequency images whenever a method is available in both domains. Augmentation methods such as adding white noise on the time signals, frequency masking, time masking, salt and pepper noise, and time shifting have increased the accuracies on our test sets. Future work may include Doppler-motivated pitch shifting in a sensor array setup and mixup augmentation when measurements of overlapping backscatter of multiple objects are included. Overall, a combination of selected augmentation methods in the time domain as well as methods applied to the scalograms have led to the best result for our datasets. We have achieved 90.1% (lab) and 66.2% (field) accuracy for the object classes and 96.4% (lab) and 91.5% (field) accuracy for traversability. Further, it has been observed that a class-conditional selection of augmentation methods can be meaningful to further improve classification accuracy.
In this study, we have only used one impulse echo iteration to perform classification. It can be expected that including multiple iterations for classification will lead to significant accuracy improvements. This can be achieved by a simple majority vote of CNN outputs or specialized architectures such as 3D-CNNs62 or convolutional recurrent neural networks (CRNNs).63
For future investigations, it is planned to expand the datasets to be able to test the model not only on a position-based train/test splitting but also with entirely new objects. Also, backscattering of multiple objects in a single measurement should be included as this is often true in realistic situations. Moreover, involving multiple sensors in the classification process should be considered.
ACKNOWLEDGMENTS
We would like to thank Professor Dr. Andreas Koch (Stuttgart Media University) for constructive discussions and support of our work. Also, we would like to thank Andrew Buchanan for proofreading.