Automotive ultrasonic sensors come into play for close-range surround sensing in parking and maneuvering situations. In addition to ultrasonic ranging, classifying obstacles based on ultrasonic echoes to improve environmental perception for advanced driver-assistance systems is an ongoing research topic. Related studies consider only magnitude-based features for classification. However, the phase of an echo signal contains relevant information for target discrimination. This study discusses and evaluates the relevance of the target phase in echo signals for object classification in automotive ultrasonic sensing based on lab and field measurements. Several phase-aware features in the time domain and time-frequency features based on the continuous wavelet transform are proposed and processed using a convolutional neural network. Indeed, phase features are found to contain relevant information, producing only 4% less classification accuracy than magnitude features when the phase is appropriately processed. The investigation reveals high redundancy when magnitude and phase features are jointly fed into the neural network, especially when dealing with time-frequency features. However, incorporating the target phase information facilitates the identification quality in high clutter environments, increasing the model's robustness against signals with low signal-to-noise ratios. Ultimately, the presented work takes one further step toward enhanced object discrimination in advanced driver-assistance systems.
I. INTRODUCTION
Automotive ultrasonic sensors are used in parking and maneuvering to calculate the distance to obstacles via the pulse-echo method.1 In addition to ultrasonic ranging, the classification of obstacles in the vehicle environment is desirable for driver-assistance systems and automated driving applications. Currently, there are many studies dealing with object classification in radar, lidar, and camera-based sensing.2 However, only a few studies have been published considering classification tasks in automotive ultrasonic sensing. Related works include the classification of simple shapes,3 more complex obstacles,4 object heights,5 and road surface conditions.6–8 Aside from automotive sensing, classifying acoustic echoes of in-air targets has also been addressed in more general studies.9–15 It is striking that most studies consider only magnitude-based features, e.g., envelopes, power spectra, or magnitude spectrograms. To the best of our knowledge, evaluating phase information for classifying ultrasonic echoes of in-air targets remains an open question and is missing in the literature. In underwater sonar applications, however, the phase information has already been found to carry relevant features for target discrimination.16–20 Furthermore, it has been found that bats and dolphins process phase information for echolocation and target discrimination.21–23 In other acoustic applications, such as speech processing, pitch detection, and transient detection, phase information is also being used as an important feature.24–29 In speech processing, a number of researchers report the effectiveness of phase information for signals with a low signal-to-noise ratio (SNR).30–32 In automotive ultrasonic sensing, low SNRs must often be dealt with in high clutter environments. Therefore, the potential to increase the robustness of classification models by adding phase information to the input features should be clarified. However, the benefits might be limited since redundancies are included in the magnitude and phase of the signals, especially when dealing with time-frequency features.33
This work examines the relevance of phase information for object classification in automotive ultrasonic sensing using convolution neural networks (CNNs). Based on time domain features and time-frequency images, the classification performance of the raw and processed phase of echo signals is evaluated and compared to magnitude features. Furthermore, features, including both magnitude and phase information, are presented and processed in a CNN. More concretely, in the CNN, time domain features are transformed into two-dimensional (2D) feature maps using a specialized one-dimensional (1D) convolution head. The same 2D CNN architecture is then used for processing the 2D feature maps and the time-frequency images. The impact of jointly feeding magnitude and phase information to the CNN is discussed and quantified regarding classification accuracy.
This paper is organized as follows. Section II discusses the characteristics of the target phase in acoustic echoes, followed by the description of data sets in Sec. III. The preprocessing of the captured signals is the primary content of Sec. IV. The processing of the raw phase and several feature extraction methods are presented in Sec. V. In Sec. VI, the authors propose a CNN architecture for classifying the 1D time domain and 2D time-frequency features. The classification results of magnitude and phase-only features and features, including both magnitude and phase information, are compared and discussed in Sec. VII. Finally, Sec. VIII concludes this work and provides suggestions for future work.
II. TARGET PHASE IN ACOUSTIC ECHOES
In addition to the effects of sub-echo interference, there are other minor influences on the target phase. From the physics of sound backscattering, it is known that the target phase also depends on the characteristic impedance of the scatterer.37 In the context of this work, most objects have high impedance differences with respect to air. However, sound is partially absorbed by porous textures, such as the clothing of pedestrians, affecting the target phase. Phase shifts may also be caused by interfering creeping waves, particularly emerging on cylindrical scatterers.38 Due to the Doppler effect, the phase of an echo signal is also affected when an object is moving. Thus, static object classes may be excluded in the classification process when Doppler shifts are recognized in the target phase. Overall, target phase may be considered a meaningful feature for classifying acoustic scatterers, including information about the size, shape, and texture of an object.
In Fig. 1, the real-valued time signal of an echo signal , the amplitude envelope , and the IF of a measured reflection of a pole are shown. The reflection of a pole mainly consists of two overlapping sub-echoes, the first emerging horizontally and the second diagonally at the base of the pole. The swelling amplitude, which can be seen in the time signal in Fig. 1(a) and in the envelope in Fig. 1(b), is induced by the transducer's resonance. The transient response of the transducer can be seen in the brief increase in the amplitude in Fig. 1(b) and the decrease in the IF in Fig. 1(c) at the beginning of the reflection (0.25–0.5 ms). In the time signal and the envelope, one can recognize the lower amplitude of the second sub-echo due to a higher time of flight. The origin of the second sub-echo can be estimated by finding the beginning of the interference patterns (ca. 1.2 ms) in the overlap of the sub-echoes. These interference patterns appear as ripples in the envelope and in the IF in Fig. 1(c). In the IF, the characteristics of the transmitted chirp regarding the increasing frequency over time can be seen in both sub-echoes. Analogous to the envelope, the echo overlap appears as ripples in the IF. However, the end of the first echo (ca. 1.9 ms) and the second echo (ca. 2.7 ms) can be recognized more clearly in the IF than in the time signal or the envelope.
(Color online) Time signal, envelope, and IF of the overall reflection of a pole, consisting of two overlapping sub-echoes.
(Color online) Time signal, envelope, and IF of the overall reflection of a pole, consisting of two overlapping sub-echoes.
III. DATA SET
We use the data set described in Eisele et al.,4 where a typical automotive ultrasonic sensor has been used to transmit frequency-modulated chirps ranging from 43.5 to 52.5 kHz. The captured echo signals are sampled at , satisfying the Nyquist theorem. The data set contains the backscattering of 30 objects, including pedestrians, measured in different orientations on 55 positions in the sensor's field of view. The measurements have been performed in two environments: a semi-anechoic chamber with low clutter (lab data) and an asphalt parking space with high clutter (field data). Overall, there are 498 300 labeled measurements, including stationary and dynamic scenes where the sensor approaches the object with a velocity of 0.5 m/s. The objects are aggregated into classes: no object, bag, small object, curb, tree, tube/pole, and pedestrian. The data set is split into training and test data based on different object positions, which are evenly distributed in the sensor's field of view.
IV. PREPROCESSING
In the following, the preprocessing steps that are applied to the digitized raw time signals are described. We divide the preprocessing into (a) steps to be applied on the sensor's application-specific integrated circuit (ASIC) to transmit the data at a lower data rate to a central electronic control unit (ECU) and (b) steps to be applied on the ECU to condition the signals for feature extraction. Typically, only necessary steps for data reduction are intended to be applied on the sensor's ASIC, while more hardware resources for further calculations are available on the ECU. The preprocessing steps are illustrated in Fig. 2.
Preprocessing of the digitized signals before feature extraction with (a) processing intended on the sensor's ASIC to transmit the data at a lower data rate and (b) processing on a central ECU to condition the signals for feature extraction.
Preprocessing of the digitized signals before feature extraction with (a) processing intended on the sensor's ASIC to transmit the data at a lower data rate and (b) processing on a central ECU to condition the signals for feature extraction.
A. Data reduction
B. Data conditioning
In the following, the preprocessing steps in Fig. 2(b) are described. First, we cut out the object-related backscatter in based on the known object distances in our data set. At the sample rate of 25 kHz, a window length of samples has been found to include the entire backscatter of even broad scatterers in the data set. To keep the relation between the echo amplitudes and the object distance when cutting out the windows, the object distance is stored as an additional scalar feature input and is defined as the position of the window's origin (cf. Section VI). In practice, the origin of the windows could be determined based on a sliding window approach or on conventionally calculated echo points.
V. FEATURE EXTRACTION
In the following, several feature extraction methods are presented, aiming to include target phase information in the classification process. Based on the real-valued time signals and the complex-valued IQ signals , we calculate time domain features on the one hand and time-frequency images on the other hand. The effectiveness of the features and their combinations as a classifier input is quantified afterward. An overview of the feature extraction is illustrated in Fig. 3.
Feature extraction steps being applied to the preprocessed signals , , and to calculate time domain features and time-frequency features using the continuous wavelet transform (CWT).
Feature extraction steps being applied to the preprocessed signals , , and to calculate time domain features and time-frequency features using the continuous wavelet transform (CWT).
The first approach is to use the time signals directly as a 1D input to the CNN since the entire information, including phase, is encoded. However, the CNN is not provided any prior knowledge of frequency relevance for echo discrimination. An appropriate representation of the signals has to be learned during training, making it prone to sparse or imbalanced data. Another time domain approach is to use the envelope and the IF as a dual-channel 1D input, which is quite convenient since amplitude and phase can be directly calculated from the complex-valued IQ signals without reconstructing a real-valued time signal in the preprocessing.
In Fig. 4, the extracted feature images of the CWT of the echo signal of a pole are shown. The first sub-echo of the pole relates to a horizontal reflection, and the second to a diagonal reflection at the base of the pole. In between, characteristic interference patterns emerge, depending on the object's geometric features. The sub-echoes of the reflected chirp and the interference patterns can be clearly seen in the magnitude scalogram SM(t) in Fig. 4(a). Compared to the time domain features, as shown in Fig. 1, the individual sub-echoes can be distinguished more clearly in the magnitude scalogram. While it is hard to identify relevant patterns in the wrapped phase scalogram and the continuous phase scalogram in Figs. 4(b) and 4(c), respectively, object-related patterns become evident in the phase gradients of the CIF scalogram in Fig. 4(d). Highlights in appear where the magnitude in is close to zero, emphasizing the exact positions of destructive interference in time and frequency. In regions of noise, the appearance of the notches in SCIF is randomly distributed. The real scalogram and the imaginary scalogram in Figs. 4(e) and 4(f) are dominated by a striped appearance relating to the trigonometric properties of the complex quantities. In these representations, object-related patterns seem rather backgrounded.
(Color online) Scalogram feature images of a pole's backscatter. The magnitude scalogram in (a) shows two sub-echoes of the chirp and characteristic interference patterns. The wrapped phase scalogram in (b) is hard to interpret due to the phase discontinuities. Also, in the continuous phase scalogram in (c), the target phase is not evident. In the CIF scalogram in (d), object-related phase patterns are revealed. In the real and imaginary scalograms in (e) and (f), object-related patterns are not clearly visible.
(Color online) Scalogram feature images of a pole's backscatter. The magnitude scalogram in (a) shows two sub-echoes of the chirp and characteristic interference patterns. The wrapped phase scalogram in (b) is hard to interpret due to the phase discontinuities. Also, in the continuous phase scalogram in (c), the target phase is not evident. In the CIF scalogram in (d), object-related phase patterns are revealed. In the real and imaginary scalograms in (e) and (f), object-related patterns are not clearly visible.
An overview of the described feature representations and the combinations that will be fed into the CNN is given in Table I. Each row represents a set of features that is quantified in terms of classification accuracy. The first dimension of the tensor shape is defined by the number of channels, relating to the number of input features that are considered. The 2D scalogram features have an image size of 64 × 64 pixels, and the 1D time domain features have a length of 178 samples.
Naming and tensor shape of feature representations and combinations.
Name . | Features . | Tensor shape . |
---|---|---|
SM | Magnitude scalogram [Eq. (8)] | (1, 64, 64) |
SP | Phase scalogram [Eq. (9)] | (1, 64, 64) |
SCP | Continuous phase scalogram | (1, 64, 64) |
SCIF | CIF scalogram [Eq. (10)] | (1, 64, 64) |
SMCIF | Magnitude scalogram , CIF scalogram | (2, 64, 64) |
SRI | Real-part scalogram , imaginary-part scalogram | (2, 64, 64) |
TS | Time signal | (1, 178) |
E | Envelope [Eq. (1)] | (1, 178) |
IF | Instantaneous frequency [Eq. (2)] | (1, 178) |
EIF | Envelope , IF | (2, 178) |
Name . | Features . | Tensor shape . |
---|---|---|
SM | Magnitude scalogram [Eq. (8)] | (1, 64, 64) |
SP | Phase scalogram [Eq. (9)] | (1, 64, 64) |
SCP | Continuous phase scalogram | (1, 64, 64) |
SCIF | CIF scalogram [Eq. (10)] | (1, 64, 64) |
SMCIF | Magnitude scalogram , CIF scalogram | (2, 64, 64) |
SRI | Real-part scalogram , imaginary-part scalogram | (2, 64, 64) |
TS | Time signal | (1, 178) |
E | Envelope [Eq. (1)] | (1, 178) |
IF | Instantaneous frequency [Eq. (2)] | (1, 178) |
EIF | Envelope , IF | (2, 178) |
VI. CNNS
Based on the data set, containing a ground truth label with the object class and the object distance for each sample, we perform supervised learning54 and use a CNN for the classification task. Compared with classical neural networks, CNNs have the advantage of shared weights in the convolutional layers. This results in a reduced number of trainable parameters, allowing an efficient processing of high-dimensional input data, such as time signals or images. Further, CNNs involve translation invariance due to the combination of convolutional and pooling layers. CNNs are successfully applied in many acoustic applications dealing with acoustic signals or time-frequency images.46,47,55,56
We use real-valued CNNs to perform the classification task. Studies also describe complex-valued neural networks (CVNNs) capable of directly processing complex-valued inputs.57 However, CVNNs are still in an early research phase and are not included in common deep learning libraries. Issues with CVNNs involve non-differentiable activation functions, weight initialization, and regularization.57,58 For that reason, and because we do not aim to address the evaluation of CVNNs, we stick to real-valued CNNs in this study.
We have adapted the CNN architecture that has been proposed in a previous work4 for processing time-frequency image inputs of shape 64 × 64. The architecture is shown in Table II. For each convolutional layer, batch normalization59 and the rectified linear activation function (ReLU)60 are used. The input layer is adapted to input channels, which is defined as the number of input feature images that are stacked before being fed into the network. In the first convolutional layer, 16 kernels of shape 7 × 7 are applied to extract low-level features. Average pooling is then used to reduce the dimensionality. The kernels in the subsequent two convolutional layers are of shape 1 × 5 and 5 × 1, respectively, to extract temporal and spectral features separately. In the following, two stacks of a convolutional layer with 64 kernels of shape 3 × 3 and average pooling are employed to extract high-level features. The extracted feature maps are flattened and concatenated with the scalar distance feature . Finally, a fully connected layer of 256 neurons with the ReLU activation function and a fully connected layer of seven neurons with the softmax activation function61 are used to map the flattened features to class probabilities.
2D CNN architecture.
Layer . | Output dimension . |
---|---|
Input layer | × 64 × 64 |
2D convolutional layer (7 × 7) | 16 × 64 × 32 |
Average pooling (2 × 2) | 16 × 32 × 16 |
2D convolutional layer (1 × 5) | 32 × 32 × 16 |
2D convolutional layer (5 × 1) | 32 × 32 × 16 |
Average pooling (2 × 2) | 32 × 16 × 8 |
2D convolutional layer (3 × 3) | 64 × 16 × 8 |
Average pooling (2 × 2) | 64 × 8 × 4 |
2D convolutional layer (3 × 3) | 64 × 8 × 4 |
Average pooling (2 × 2) | 64 × 4 × 2 |
Flatten | 512 |
Concatenate (+ distance) | 513 |
Fully connected layer | 256 |
Fully connected layer | 7 |
Layer . | Output dimension . |
---|---|
Input layer | × 64 × 64 |
2D convolutional layer (7 × 7) | 16 × 64 × 32 |
Average pooling (2 × 2) | 16 × 32 × 16 |
2D convolutional layer (1 × 5) | 32 × 32 × 16 |
2D convolutional layer (5 × 1) | 32 × 32 × 16 |
Average pooling (2 × 2) | 32 × 16 × 8 |
2D convolutional layer (3 × 3) | 64 × 16 × 8 |
Average pooling (2 × 2) | 64 × 8 × 4 |
2D convolutional layer (3 × 3) | 64 × 8 × 4 |
Average pooling (2 × 2) | 64 × 4 × 2 |
Flatten | 512 |
Concatenate (+ distance) | 513 |
Fully connected layer | 256 |
Fully connected layer | 7 |
1D CNN head with a 1D convolutional layer and dimension insertion.
Layer . | Output dimensions . |
---|---|
Input layer | N × 178 |
1D convolutional layer (1 × 7) | 64 × 60 |
Add dimension | 1 × 64 × 60 |
Layer . | Output dimensions . |
---|---|
Input layer | N × 178 |
1D convolutional layer (1 × 7) | 64 × 60 |
Add dimension | 1 × 64 × 60 |
(Color online) Proposed CNN architecture for (a) time domain input or (b) scalogram input with a 1D convolution layer (blue), 2D convolution layers (red), and average pooling (green).
(Color online) Proposed CNN architecture for (a) time domain input or (b) scalogram input with a 1D convolution layer (blue), 2D convolution layers (red), and average pooling (green).
VII. RESULTS AND DISCUSSION
The accuracies on the test sets for the defined features in Table I, using lab and field data, are shown in Table IV. In Figs. 6 and 7, boxplots are shown for the lab and field data results, respectively. Generally, lower accuracies are achieved for the field than for lab data due to lower SNRs in the field data, mainly caused by asphalt clutter. In the following, the discussion of the results is structured evaluating magnitude and phase-only features in Sec. VII A and features including both magnitude and phase information in Sec. VII B.
Mean accuracy ± standard deviation in percent on the test set of the lab and field data. The feature type indicates whether magnitude-only (M), phase-only (P), or both magnitude and phase information (B) is encoded. The number of input channels refers to the number of feature representations that are stacked. The highest accuracy per environment is marked in bold.
Input feature . | Type . | Channels . | Architecture . | Lab data . | Field data . |
---|---|---|---|---|---|
SM | M | 1 | 2D CNN | 85.87 ± 0.48 | 65.48 ± 0.52 |
SP | P | 1 | 2D CNN | 75.61 ± 0.56 | 56.66 ± 0.98 |
SCP | P | 1 | 2D CNN | 80.42 ± 0.47 | 57.99 ± 0.76 |
SCIF | P | 1 | 2D CNN | 82.16 ± 0.49 | 61.23 ± 0.62 |
SMCIF | B | 2 | 2D CNN | 86.64 ± 0.37 | 66.91 ± 0.42 |
SRI | B | 2 | 2D CNN | 85.11 ± 0.56 | 66.07 ± 0.52 |
TS | B | 1 | 1D CNN | 85.43 ± 0.79 | 66.78 ± 0.56 |
E | M | 1 | 1D CNN | 78.65 ± 0.66 | 57.66 ± 0.70 |
IF | P | 1 | 1D CNN | 74.93 ± 0.50 | 52.16 ± 0.60 |
EIF | B | 2 | 1D CNN | 84.06 ± 0.68 | 63.98 ± 0.46 |
Input feature . | Type . | Channels . | Architecture . | Lab data . | Field data . |
---|---|---|---|---|---|
SM | M | 1 | 2D CNN | 85.87 ± 0.48 | 65.48 ± 0.52 |
SP | P | 1 | 2D CNN | 75.61 ± 0.56 | 56.66 ± 0.98 |
SCP | P | 1 | 2D CNN | 80.42 ± 0.47 | 57.99 ± 0.76 |
SCIF | P | 1 | 2D CNN | 82.16 ± 0.49 | 61.23 ± 0.62 |
SMCIF | B | 2 | 2D CNN | 86.64 ± 0.37 | 66.91 ± 0.42 |
SRI | B | 2 | 2D CNN | 85.11 ± 0.56 | 66.07 ± 0.52 |
TS | B | 1 | 1D CNN | 85.43 ± 0.79 | 66.78 ± 0.56 |
E | M | 1 | 1D CNN | 78.65 ± 0.66 | 57.66 ± 0.70 |
IF | P | 1 | 1D CNN | 74.93 ± 0.50 | 52.16 ± 0.60 |
EIF | B | 2 | 1D CNN | 84.06 ± 0.68 | 63.98 ± 0.46 |
(Color online) Boxplots for the accuracies of magnitude-only features (purple), phase-only features (green), and magnitude and phase features (blue) for the lab data. The diamond symbols indicate outliers.
(Color online) Boxplots for the accuracies of magnitude-only features (purple), phase-only features (green), and magnitude and phase features (blue) for the lab data. The diamond symbols indicate outliers.
(Color online) Boxplots for the accuracies of magnitude-only features (purple), phase-only features (green), and magnitude and phase features (blue) for the field data. The diamond symbols indicate outliers.
(Color online) Boxplots for the accuracies of magnitude-only features (purple), phase-only features (green), and magnitude and phase features (blue) for the field data. The diamond symbols indicate outliers.
A. Magnitude-only and phase-only features
If the features in Figs. 6 and 7 were sorted from lowest to highest accuracy, the order would be equal for lab and field data, so many conclusions can be drawn equally for both environments. The best accuracy is achieved using SM. Comparing the phase scalograms, SP leads to the lowest accuracy, while the best accuracy is achieved using the CIF scalogram. Thus, it can be confirmed that unwrapping the phase discontinuities and calculating the phase gradients is effective in highlighting the target phase and increases classification accuracy. As expected, the revealed phase shifts in SCIF can be considered a meaningful feature for object discrimination, producing only about 4% lower accuracy than the magnitudes in SM. For the time domain features, IF produces about 4% lower accuracy than E using lab data and about 6% lower accuracy using field data.
It is concluded that more information that is relevant for object discrimination is included in the magnitudes of the echo signals for both time domain and time-frequency features. However, using the phase also produces considerable classification results when preprocessed properly. It is confirmed that the target phase contains relevant properties of the scatterer. As the best accuracy of magnitude and phase-only features is achieved with SM, we will use SM as a baseline in the following, where the effectiveness of using both magnitude and phase information as a feature input is evaluated.
B. Combining magnitude and phase features
The feature inputs including both magnitude and phase information (SMCIF, SRI, EIF, and TS) in Figs. 6 and 7 are marked in blue. Comparing these features, the best accuracies are achieved with SMCIF. It is concluded that calculating the phase gradients over the frequency channels of the CWT beneficially extracts target phase information, while SRI, TS, and EIF produce lower accuracies. Relevant patterns are less evident in the striped appearance of SRI, resulting in lower accuracies than for SMCIF and TS. When using time-frequency features, the magnitudes and phase derivatives should be preferred to the real and imaginary parts since higher accuracies are achieved with SMCIF than SRI. The lowest accuracy of features including magnitude and phase is achieved with EIF. Thus, reconstructing a real-valued time signal from the IQ signals is recommended. Using TS as a direct input to the CNN yields about 1% less accuracy than SMCIF for the lab data. For the field data, almost the same accuracy is achieved for TS and SMCIF. The 1D CNN head seems to be capable of learning features that are robust against noise, benefiting from encoded phase information in the time signal.
Overall, the highest accuracies are achieved using SMCIF. Using SMCIF, the accuracy is increased by about 0.8% (lab) and 1.4% (field) compared to SM. A high amount of redundant information in the magnitude and phase scalograms is revealed, as reasonable results have also been achieved by SM and SCIF, and the combination only slightly improved the accuracies. We would argue that the improvements are achieved by adding focus to the nulls in the interference patterns relating to the geometric properties of the scatterers.
When comparing the accuracies of TS with E (–6.8% and –9.1% for lab and field data, respectively) and IF (–10.5% and –14.6%), it can be seen that the relevance of including both magnitude and phase information is more crucial for time domain features than for scalogram images. Generally, adding phase information is more efficient for the field than for the lab data, highlighting the importance of phase information for noisy signals and high clutter environments. Except for EIF, all phase-aware features outperform SM in the field environment.
VIII. CONCLUSION
In this study, we have examined the relevance of target phase for classifying in-air targets using an automotive ultrasonic sensor. The distribution of sub-echoes in an overall backscattering and interference patterns caused by overlapping sub-echoes have been considered relevant features for target discrimination. In the phase of an echo signal, phase shifts emerge at the origin and the end of sub-echoes. Further, phase shifts are caused by interference when sub-echoes are overlapping. The target phase, relating to the geometric properties of a scatterer, is revealed by calculating the phase gradient of the unwrapped phase, which is defined as the IF of the echo signal.
We used a data set of 498 300 measurements including lab and field data as well as stationary and dynamic scenes. For the training of a CNN, time signals or time-frequency images can be used as an input. The CWT has been applied to extract time-frequency images of the echo signals. Based on the complex-valued CWT, magnitude and phase features have been calculated. The real and imaginary parts of the CWT have also been considered as a phase-encoded input feature. For the time domain features, the time signal as a direct CNN input, the amplitude envelope, and the IF have been considered. A 1D CNN head has been used to transform the time domain inputs to 2D feature images, allowing similar CNN architectures for the 1D and 2D features. The same 2D CNN architecture has been used to process the time-frequency images and the outputs of the 1D CNN head.
SCIF led to the best classification accuracies compared to other phase-only features, producing only about 4% less accuracy than SM. Unwrapping and calculating the phase gradients should be performed to preprocess the raw phase. A high amount of redundancy in the magnitude and phase scalogram has been noticed, as SMCIF has only led to small accuracy improvements compared to SM. When dealing with time domain features, it is more crucial to include phase information, as the phase-encoded time signal led to significantly higher classification accuracies than the amplitude envelope. Overall, jointly feeding magnitude and phase features to the classifier is helpful to add robustness especially when dealing with low SNRs, e.g., high clutter environments.
When using CWT-based input features, calculating the magnitude and phase of the CWT should be preferred to using the real and imaginary parts. For the field data, comparable results have been achieved when using SMCIF and TS, showing that the 1D CNN head is capable of extracting noise-robust features. When using TS as a direct input, calculating a time-frequency transform is not needed. EIF, which can be calculated directly from the IQ signal, produced significantly lower accuracies than TS or phase-aware scalogram features. Thus, reconstructing a real-valued time signal from the IQ signals is recommended.
Future work could include evaluating CVNNs to directly feed the complex-valued CWT features or even the IQ signals into the classifier. Regarding the potential of using raw time signals as a CNN input, the classification accuracies should be compared to time-frequency images using larger data sets and optimized hyperparameters. Further, consideration could be given to using both the raw time signal and time-frequency features as an input to the classifier. The phase of echo signals could also be considered as a feature for classifying ground types, as our results showed different performances of adding phase information in the lab and field environment, where the main difference is the amount of ground clutter.
ACKNOWLEDGMENTS
We would like to thank Professor Dr. Andreas Koch (Stuttgart Media University) for supporting our work. Further, we would like to thank the team of the Institute for Applied Artificial Intelligence at Stuttgart Media University for the inspiring discussions in the “Journal Club.” We would also like to express our thanks to the group for computational methods at the Chair of Vibroacoustics of Vehicles and Machines (TUM) for the valuable exchange at the regular meetings.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
DATA AVAILABILITY
The data that support the findings of this study are available from Robert Bosch GmbH. Restrictions apply to the availability of these data, which were used under license for this study. Data are available from the authors upon reasonable request and with the permission of Robert Bosch GmbH.