Automotive ultrasonic sensors come into play for close-range surround sensing in parking and maneuvering situations. In addition to ultrasonic ranging, classifying obstacles based on ultrasonic echoes to improve environmental perception for advanced driver-assistance systems is an ongoing research topic. Related studies consider only magnitude-based features for classification. However, the phase of an echo signal contains relevant information for target discrimination. This study discusses and evaluates the relevance of the target phase in echo signals for object classification in automotive ultrasonic sensing based on lab and field measurements. Several phase-aware features in the time domain and time-frequency features based on the continuous wavelet transform are proposed and processed using a convolutional neural network. Indeed, phase features are found to contain relevant information, producing only 4% less classification accuracy than magnitude features when the phase is appropriately processed. The investigation reveals high redundancy when magnitude and phase features are jointly fed into the neural network, especially when dealing with time-frequency features. However, incorporating the target phase information facilitates the identification quality in high clutter environments, increasing the model's robustness against signals with low signal-to-noise ratios. Ultimately, the presented work takes one further step toward enhanced object discrimination in advanced driver-assistance systems.

Automotive ultrasonic sensors are used in parking and maneuvering to calculate the distance to obstacles via the pulse-echo method.1 In addition to ultrasonic ranging, the classification of obstacles in the vehicle environment is desirable for driver-assistance systems and automated driving applications. Currently, there are many studies dealing with object classification in radar, lidar, and camera-based sensing.2 However, only a few studies have been published considering classification tasks in automotive ultrasonic sensing. Related works include the classification of simple shapes,3 more complex obstacles,4 object heights,5 and road surface conditions.6–8 Aside from automotive sensing, classifying acoustic echoes of in-air targets has also been addressed in more general studies.9–15 It is striking that most studies consider only magnitude-based features, e.g., envelopes, power spectra, or magnitude spectrograms. To the best of our knowledge, evaluating phase information for classifying ultrasonic echoes of in-air targets remains an open question and is missing in the literature. In underwater sonar applications, however, the phase information has already been found to carry relevant features for target discrimination.16–20 Furthermore, it has been found that bats and dolphins process phase information for echolocation and target discrimination.21–23 In other acoustic applications, such as speech processing, pitch detection, and transient detection, phase information is also being used as an important feature.24–29 In speech processing, a number of researchers report the effectiveness of phase information for signals with a low signal-to-noise ratio (SNR).30–32 In automotive ultrasonic sensing, low SNRs must often be dealt with in high clutter environments. Therefore, the potential to increase the robustness of classification models by adding phase information to the input features should be clarified. However, the benefits might be limited since redundancies are included in the magnitude and phase of the signals, especially when dealing with time-frequency features.33 

This work examines the relevance of phase information for object classification in automotive ultrasonic sensing using convolution neural networks (CNNs). Based on time domain features and time-frequency images, the classification performance of the raw and processed phase of echo signals is evaluated and compared to magnitude features. Furthermore, features, including both magnitude and phase information, are presented and processed in a CNN. More concretely, in the CNN, time domain features are transformed into two-dimensional (2D) feature maps using a specialized one-dimensional (1D) convolution head. The same 2D CNN architecture is then used for processing the 2D feature maps and the time-frequency images. The impact of jointly feeding magnitude and phase information to the CNN is discussed and quantified regarding classification accuracy.

This paper is organized as follows. Section II discusses the characteristics of the target phase in acoustic echoes, followed by the description of data sets in Sec. III. The preprocessing of the captured signals is the primary content of Sec. IV. The processing of the raw phase and several feature extraction methods are presented in Sec. V. In Sec. VI, the authors propose a CNN architecture for classifying the 1D time domain and 2D time-frequency features. The classification results of magnitude and phase-only features and features, including both magnitude and phase information, are compared and discussed in Sec. VII. Finally, Sec. VIII concludes this work and provides suggestions for future work.

An acoustic signal reflected from an object may differ from the incident signal in magnitude (target strength) and phase (target phase). Considering a complex-valued nonstationary signal
(1)
with the amplitude envelope A t = x t and the instantaneous phase φ t = arg x t, the change of the signal's waveform and the timing information of the signal components are encoded in φ t. In acoustic scattering, the timing of the sub-echoes in a combined delay-spread echo waveform depends on the scatterer's unique geometric features.34,35 Based on the length of the transmit signal and the spacing of the scatterer's facets, the sub-echoes often overlap in the overall backscattering. To increase the timing resolution via pulse compression, usually frequency-modulated pulses (chirps) are used as a transmit signal.36 When a transmitted chirp is reflected from a multi-facet scatterer, phase shifts are induced at the origin and the end of each of the sub-echoes. The length of the combined reflection indicates the size of an object, depending on the spacing of the object's sub-reflectors. Moreover, characteristic interference patterns result from overlapping sub-echoes, which also appear as phase shifts. The distribution of the sub-echoes is considered an essential feature for object discrimination, indicating the shape of an object.

In addition to the effects of sub-echo interference, there are other minor influences on the target phase. From the physics of sound backscattering, it is known that the target phase also depends on the characteristic impedance of the scatterer.37 In the context of this work, most objects have high impedance differences with respect to air. However, sound is partially absorbed by porous textures, such as the clothing of pedestrians, affecting the target phase. Phase shifts may also be caused by interfering creeping waves, particularly emerging on cylindrical scatterers.38 Due to the Doppler effect, the phase of an echo signal is also affected when an object is moving. Thus, static object classes may be excluded in the classification process when Doppler shifts are recognized in the target phase. Overall, target phase may be considered a meaningful feature for classifying acoustic scatterers, including information about the size, shape, and texture of an object.

The raw phase of an echo signal φ t is hard to interpret since the phase is typically defined in ( π , π ]. Therefore, phase unwrapping39 can be applied to eliminate the discontinuities. To focus on phase shifts, which can be attributed to the properties of the scatterer, it is convenient to calculate the gradient of the unwrapped phase. This relates to the instantaneous frequency (IF)
(2)

In Fig. 1, the real-valued time signal of an echo signal x t, the amplitude envelope A t, and the IF f t of a measured reflection of a pole are shown. The reflection of a pole mainly consists of two overlapping sub-echoes, the first emerging horizontally and the second diagonally at the base of the pole. The swelling amplitude, which can be seen in the time signal in Fig. 1(a) and in the envelope in Fig. 1(b), is induced by the transducer's resonance. The transient response of the transducer can be seen in the brief increase in the amplitude in Fig. 1(b) and the decrease in the IF in Fig. 1(c) at the beginning of the reflection (0.25–0.5 ms). In the time signal and the envelope, one can recognize the lower amplitude of the second sub-echo due to a higher time of flight. The origin of the second sub-echo can be estimated by finding the beginning of the interference patterns (ca. 1.2 ms) in the overlap of the sub-echoes. These interference patterns appear as ripples in the envelope and in the IF in Fig. 1(c). In the IF, the characteristics of the transmitted chirp regarding the increasing frequency over time can be seen in both sub-echoes. Analogous to the envelope, the echo overlap appears as ripples in the IF. However, the end of the first echo (ca. 1.9 ms) and the second echo (ca. 2.7 ms) can be recognized more clearly in the IF than in the time signal or the envelope.

FIG. 1.

(Color online) Time signal, envelope, and IF of the overall reflection of a pole, consisting of two overlapping sub-echoes.

FIG. 1.

(Color online) Time signal, envelope, and IF of the overall reflection of a pole, consisting of two overlapping sub-echoes.

Close modal

We use the data set described in Eisele et al.,4 where a typical automotive ultrasonic sensor has been used to transmit frequency-modulated chirps ranging from 43.5 to 52.5 kHz. The captured echo signals are sampled at f s = 200 kHz, satisfying the Nyquist theorem. The data set contains the backscattering of 30 objects, including pedestrians, measured in different orientations on 55 positions in the sensor's field of view. The measurements have been performed in two environments: a semi-anechoic chamber with low clutter (lab data) and an asphalt parking space with high clutter (field data). Overall, there are 498 300 labeled measurements, including stationary and dynamic scenes where the sensor approaches the object with a velocity of 0.5 m/s. The objects are aggregated into C = 7 classes: no object, bag, small object, curb, tree, tube/pole, and pedestrian. The data set is split into training and test data based on different object positions, which are evenly distributed in the sensor's field of view.

In the following, the preprocessing steps that are applied to the digitized raw time signals x raw t are described. We divide the preprocessing into (a) steps to be applied on the sensor's application-specific integrated circuit (ASIC) to transmit the data at a lower data rate to a central electronic control unit (ECU) and (b) steps to be applied on the ECU to condition the signals for feature extraction. Typically, only necessary steps for data reduction are intended to be applied on the sensor's ASIC, while more hardware resources for further calculations are available on the ECU. The preprocessing steps are illustrated in Fig. 2.

FIG. 2.

Preprocessing of the digitized signals before feature extraction with (a) processing intended on the sensor's ASIC to transmit the data at a lower data rate and (b) processing on a central ECU to condition the signals for feature extraction.

FIG. 2.

Preprocessing of the digitized signals before feature extraction with (a) processing intended on the sensor's ASIC to transmit the data at a lower data rate and (b) processing on a central ECU to condition the signals for feature extraction.

Close modal
The following describes the preprocessing steps in Fig. 2(a). Analogous to standard processing for transferring the digitalized data at lower data rates, an IQ-mixer40,41 is applied, providing the complex-valued IQ signals
(3)
with the carrier frequency f c. Considering the relation between complex exponential functions and the sine and cosine functions, x raw t can be mixed with two sinusoidal signals with a 90° phase difference, yielding the in-phase component x I t and the quadrature component x Q t of x I Q t as follows:
(4)
and
(5)
The signals are transformed into the complex baseband by a carrier frequency of f c = 50 kHz. Image frequencies are suppressed using a low-pass filter. Afterward, the signals are downsampled to 25 kHz without losing information in the considered bandwidth.

In the following, the preprocessing steps in Fig. 2(b) are described. First, we cut out the object-related backscatter in x I Q t based on the known object distances in our data set. At the sample rate of 25 kHz, a window length of n w = 89 samples has been found to include the entire backscatter of even broad scatterers in the data set. To keep the relation between the echo amplitudes and the object distance when cutting out the windows, the object distance is stored as an additional scalar feature input d and is defined as the position of the window's origin (cf. Section VI). In practice, the origin of the windows could be determined based on a sliding window approach or on conventionally calculated echo points.

For conventional echo detection and distance calculation, the amplitude envelope A t = x I Q t is usually calculated, discarding the phase information.42 For the classification task, however, we intend to calculate features based on full real-valued time signals, as proposed in a previous study.4 To reconstruct a real-valued time signal with a symmetric frequency spectrum from the complex IQ signals, we first resample x I Q t to f s = 50 kHz. The spectrum of the complex baseband is then shifted back relative to the new sample rate by multiplying it with a carrier of f r = 12.5 kHz, and the real-valued time signal is obtained, extracting the real part of the IQ signals:
(6)
Finally, x r t, x I t, and x Q t are given to the feature extraction part, where different phase- and magnitude-based features are calculated.

In the following, several feature extraction methods are presented, aiming to include target phase information in the classification process. Based on the real-valued time signals x r t and the complex-valued IQ signals x I Q t, we calculate time domain features on the one hand and time-frequency images on the other hand. The effectiveness of the features and their combinations as a classifier input is quantified afterward. An overview of the feature extraction is illustrated in Fig. 3.

FIG. 3.

Feature extraction steps being applied to the preprocessed signals x r t, x I t, and x Q t to calculate time domain features and time-frequency features using the continuous wavelet transform (CWT).

FIG. 3.

Feature extraction steps being applied to the preprocessed signals x r t, x I t, and x Q t to calculate time domain features and time-frequency features using the continuous wavelet transform (CWT).

Close modal

The first approach is to use the time signals x r t directly as a 1D input to the CNN since the entire information, including phase, is encoded. However, the CNN is not provided any prior knowledge of frequency relevance for echo discrimination. An appropriate representation of the signals has to be learned during training, making it prone to sparse or imbalanced data. Another time domain approach is to use the envelope A t and the IF f t as a dual-channel 1D input, which is quite convenient since amplitude and phase can be directly calculated from the complex-valued IQ signals x I Q t without reconstructing a real-valued time signal in the preprocessing.

As opposed to the time domain 1D CNN inputs, we calculate time-frequency images using the CWT as proposed in a previous work.4 Time-frequency images can be considered as meaningful features since the importance of spectral information for acoustic echo discrimination has been shown in several studies.10,35,43,44 Time-frequency images are also commonly used as input for CNNs in various other fields of acoustics.45–48 First, the CWT of x r t is calculated as
(7)
with the complex-conjugated wavelet function ψ *, the wavelet location τ, and the frequency-related wavelet scaling a.49,50 Most commonly, only the magnitude of the CWT
(8)
is considered for signal analysis or further processing in neural networks. In literature, S M ( τ , a ) is usually called a scalogram.49,50 In this paper, we use the term magnitude scalogram, as opposed to other scalogram features that are discussed in the following.
The instantaneous phase of the CWT
(9)
will be referred to as wrapped phase scalogram and is typically defined in ( π , π ]. We can expect that S φ τ , a performs poorly in classification accuracy because of the discontinuities of the wrapped phase. To eliminate the discontinuities, phase unwrapping39 is performed, yielding the continuous phase scalogram S φ , u τ , a. As a next step, calculating the phase gradients S φ , u τ , a is beneficial to highlight the target phase since S φ , u τ , a is dominated by the constant phase progression of the transmission signal. We can describe S φ , u τ , a as the temporal rate of change of the phase, revealing phase shifts caused by the properties of the scatterer. The gradient is calculated for each of the wavelet scaling channels a of the CWT, yielding the channelized instantaneous frequencies (CIF) as follows:
(10)
The concept of CIF is already applied, e.g., in speech processing, reassignment methods, and phase-aware audio processing.25,30,51,52 In S CIF ( τ , a ), phase shifts in the wavelet scaling channels over time are indicated. A constant CIF is given for scaling bins of sinusoidal content where the transmit signal is reflected without phase distortion. Highlights in the CIF are caused by phase shifts due to interferences and impulsive signal components. Typically, the highlights in a CIF image can be seen as adjacent minima and maxima at positions where the magnitude in S M ( τ , a ) is close to zero.53 

In Fig. 4, the extracted feature images of the CWT of the echo signal of a pole are shown. The first sub-echo of the pole relates to a horizontal reflection, and the second to a diagonal reflection at the base of the pole. In between, characteristic interference patterns emerge, depending on the object's geometric features. The sub-echoes of the reflected chirp and the interference patterns can be clearly seen in the magnitude scalogram SM(t) in Fig. 4(a). Compared to the time domain features, as shown in Fig. 1, the individual sub-echoes can be distinguished more clearly in the magnitude scalogram. While it is hard to identify relevant patterns in the wrapped phase scalogram S φ t and the continuous phase scalogram S φ , u in Figs. 4(b) and 4(c), respectively, object-related patterns become evident in the phase gradients of the CIF scalogram S CIF ( τ , a ) in Fig. 4(d). Highlights in S CIF ( τ , a ) appear where the magnitude in S M ( τ , a ) is close to zero, emphasizing the exact positions of destructive interference in time and frequency. In regions of noise, the appearance of the notches in SCIF is randomly distributed. The real scalogram S Re ( τ , a ) and the imaginary scalogram S Im τ , a in Figs. 4(e) and 4(f) are dominated by a striped appearance relating to the trigonometric properties of the complex quantities. In these representations, object-related patterns seem rather backgrounded.

FIG. 4.

(Color online) Scalogram feature images of a pole's backscatter. The magnitude scalogram in (a) shows two sub-echoes of the chirp and characteristic interference patterns. The wrapped phase scalogram in (b) is hard to interpret due to the phase discontinuities. Also, in the continuous phase scalogram in (c), the target phase is not evident. In the CIF scalogram in (d), object-related phase patterns are revealed. In the real and imaginary scalograms in (e) and (f), object-related patterns are not clearly visible.

FIG. 4.

(Color online) Scalogram feature images of a pole's backscatter. The magnitude scalogram in (a) shows two sub-echoes of the chirp and characteristic interference patterns. The wrapped phase scalogram in (b) is hard to interpret due to the phase discontinuities. Also, in the continuous phase scalogram in (c), the target phase is not evident. In the CIF scalogram in (d), object-related phase patterns are revealed. In the real and imaginary scalograms in (e) and (f), object-related patterns are not clearly visible.

Close modal

An overview of the described feature representations and the combinations that will be fed into the CNN is given in Table I. Each row represents a set of features that is quantified in terms of classification accuracy. The first dimension of the tensor shape is defined by the number of channels, relating to the number of input features that are considered. The 2D scalogram features have an image size of 64 × 64 pixels, and the 1D time domain features have a length of 178 samples.

TABLE I.

Naming and tensor shape of feature representations and combinations.

Name Features Tensor shape
SM  Magnitude scalogram S M τ , a [Eq. (8) (1, 64, 64) 
SP  Phase scalogram S φ τ , a [Eq. (9) (1, 64, 64) 
SCP  Continuous phase scalogram S φ , u τ , a  (1, 64, 64) 
SCIF  CIF scalogram S CIF τ , a [Eq. (10) (1, 64, 64) 
SMCIF  Magnitude scalogram S M τ , a, CIF scalogram S CIF τ , a  (2, 64, 64) 
SRI  Real-part scalogram S Re τ , a, imaginary-part scalogram S Im τ , a  (2, 64, 64) 
TS  Time signal x r t  (1, 178) 
Envelope A t [Eq. (1) (1, 178) 
IF  Instantaneous frequency f t [Eq. (2) (1, 178) 
EIF  Envelope A t, IF f t  (2, 178) 
Name Features Tensor shape
SM  Magnitude scalogram S M τ , a [Eq. (8) (1, 64, 64) 
SP  Phase scalogram S φ τ , a [Eq. (9) (1, 64, 64) 
SCP  Continuous phase scalogram S φ , u τ , a  (1, 64, 64) 
SCIF  CIF scalogram S CIF τ , a [Eq. (10) (1, 64, 64) 
SMCIF  Magnitude scalogram S M τ , a, CIF scalogram S CIF τ , a  (2, 64, 64) 
SRI  Real-part scalogram S Re τ , a, imaginary-part scalogram S Im τ , a  (2, 64, 64) 
TS  Time signal x r t  (1, 178) 
Envelope A t [Eq. (1) (1, 178) 
IF  Instantaneous frequency f t [Eq. (2) (1, 178) 
EIF  Envelope A t, IF f t  (2, 178) 

Based on the data set, containing a ground truth label with the object class and the object distance for each sample, we perform supervised learning54 and use a CNN for the classification task. Compared with classical neural networks, CNNs have the advantage of shared weights in the convolutional layers. This results in a reduced number of trainable parameters, allowing an efficient processing of high-dimensional input data, such as time signals or images. Further, CNNs involve translation invariance due to the combination of convolutional and pooling layers. CNNs are successfully applied in many acoustic applications dealing with acoustic signals or time-frequency images.46,47,55,56

We use real-valued CNNs to perform the classification task. Studies also describe complex-valued neural networks (CVNNs) capable of directly processing complex-valued inputs.57 However, CVNNs are still in an early research phase and are not included in common deep learning libraries. Issues with CVNNs involve non-differentiable activation functions, weight initialization, and regularization.57,58 For that reason, and because we do not aim to address the evaluation of CVNNs, we stick to real-valued CNNs in this study.

We have adapted the CNN architecture that has been proposed in a previous work4 for processing time-frequency image inputs of shape 64 × 64. The architecture is shown in Table II. For each convolutional layer, batch normalization59 and the rectified linear activation function (ReLU)60 are used. The input layer is adapted to N input channels, which is defined as the number of input feature images that are stacked before being fed into the network. In the first convolutional layer, 16 kernels of shape 7 × 7 are applied to extract low-level features. Average pooling is then used to reduce the dimensionality. The kernels in the subsequent two convolutional layers are of shape 1 × 5 and 5 × 1, respectively, to extract temporal and spectral features separately. In the following, two stacks of a convolutional layer with 64 kernels of shape 3 × 3 and average pooling are employed to extract high-level features. The extracted feature maps are flattened and concatenated with the scalar distance feature d. Finally, a fully connected layer of 256 neurons with the ReLU activation function and a fully connected layer of seven neurons with the softmax activation function61 are used to map the flattened features to class probabilities.

TABLE II.

2D CNN architecture.

Layer Output dimension
Input layer  N × 64 × 64 
2D convolutional layer (7 × 7)  16 × 64 × 32 
Average pooling (2 × 2)  16 × 32 × 16 
2D convolutional layer (1 × 5)  32 × 32 × 16 
2D convolutional layer (5 × 1)  32 × 32 × 16 
Average pooling (2 × 2)  32 × 16 × 8 
2D convolutional layer (3 × 3)  64 × 16 × 8 
Average pooling (2 × 2)  64 × 8 × 4 
2D convolutional layer (3 × 3)  64 × 8 × 4 
Average pooling (2 × 2)  64 × 4 × 2 
Flatten  512 
Concatenate (+ distance)  513 
Fully connected layer  256 
Fully connected layer 
Layer Output dimension
Input layer  N × 64 × 64 
2D convolutional layer (7 × 7)  16 × 64 × 32 
Average pooling (2 × 2)  16 × 32 × 16 
2D convolutional layer (1 × 5)  32 × 32 × 16 
2D convolutional layer (5 × 1)  32 × 32 × 16 
Average pooling (2 × 2)  32 × 16 × 8 
2D convolutional layer (3 × 3)  64 × 16 × 8 
Average pooling (2 × 2)  64 × 8 × 4 
2D convolutional layer (3 × 3)  64 × 8 × 4 
Average pooling (2 × 2)  64 × 4 × 2 
Flatten  512 
Concatenate (+ distance)  513 
Fully connected layer  256 
Fully connected layer 
The described CNN architecture is designed to process 2D inputs by applying 2D convolutional kernels. To process the 1D input features TS, E, IF, and EIF, another CNN architecture is needed. However, we want the CNN architectures to be similar, being able to focus on the performance of different input feature types instead of evaluating different architectures. Therefore, we add a 1D head that processes the 1D input and converts it to a format that can be forwarded to the 2D CNN architecture. The 1D head is shown in Table III. A 1D convolutional layer with 64 kernels as a first layer is used to extract low-level features. We use a kernel size of k = 7 and zero padding of p = k / 2 = 3. To approach the width of the scalogram images w S = 64, a stride of s = n w / w S 1 3 is used, producing a feature vector width of
(11)
As a next step, the feature vectors are reshaped by inserting a dimension, so we can now interpret the output as a single channel 2D image feature map. Then we use the 2D CNN architecture for further processing. In the pooling layers, one-sided padding is applied for odd input lengths, so the feature map shapes of the scalogram inputs and the 1D inputs are identical when the third pooling layer is passed. When using EIF of shape 2 × 178 as input, two 1D CNN heads are used in parallel to obtain one feature image each for the envelope and the IF representation. An overview of the CNN architecture with 1D or 2D input features is given in Fig. 5. For convenience, the combination of the 1D head and the 2D CNN is referred to as 1D CNN in the following.
TABLE III.

1D CNN head with a 1D convolutional layer and dimension insertion.

Layer Output dimensions
Input layer  N × 178 
1D convolutional layer (1 × 7)  64 × 60 
Add dimension  1 × 64 × 60 
Layer Output dimensions
Input layer  N × 178 
1D convolutional layer (1 × 7)  64 × 60 
Add dimension  1 × 64 × 60 
FIG. 5.

(Color online) Proposed CNN architecture for (a) time domain input or (b) scalogram input with a 1D convolution layer (blue), 2D convolution layers (red), and average pooling (green).

FIG. 5.

(Color online) Proposed CNN architecture for (a) time domain input or (b) scalogram input with a 1D convolution layer (blue), 2D convolution layers (red), and average pooling (green).

Close modal
Before feeding the features into the CNN, the z-normalization62 
(12)
is applied with the mean μ and standard deviation σ calculated over the entire training set.
For the training of the CNN, we use stochastic gradient descent (SGD) as an optimizer and the cross-entropy loss function63 
(13)
with the labels y, the target distribution p y, and the estimated distribution q y. We train ten models for each training configuration with different random seeds and calculate the mean accuracy to deal with the stochastic behavior of the CNN. For the evaluation of the classification results, it should be considered that the data set consists of slightly different class sizes, which would result in biased accuracies. To compensate for the different class sizes, we calculate balanced accuracies
(14)
with the true positive predictions T P and the false negative predictions F N for each of the C = 7 classes.

The accuracies on the test sets for the defined features in Table I, using lab and field data, are shown in Table IV. In Figs. 6 and 7, boxplots are shown for the lab and field data results, respectively. Generally, lower accuracies are achieved for the field than for lab data due to lower SNRs in the field data, mainly caused by asphalt clutter. In the following, the discussion of the results is structured evaluating magnitude and phase-only features in Sec. VII A and features including both magnitude and phase information in Sec. VII B.

TABLE IV.

Mean accuracy ± standard deviation in percent on the test set of the lab and field data. The feature type indicates whether magnitude-only (M), phase-only (P), or both magnitude and phase information (B) is encoded. The number of input channels refers to the number of feature representations that are stacked. The highest accuracy per environment is marked in bold.

Input feature Type Channels Architecture Lab data Field data
SM  2D CNN  85.87 ± 0.48  65.48 ± 0.52 
SP  2D CNN  75.61 ± 0.56  56.66 ± 0.98 
SCP  2D CNN  80.42 ± 0.47  57.99 ± 0.76 
SCIF  2D CNN  82.16 ± 0.49  61.23 ± 0.62 
SMCIF  2D CNN  86.64 ±0.37  66.91 ±0.42 
SRI  2D CNN  85.11 ± 0.56  66.07 ± 0.52 
TS  1D CNN  85.43 ± 0.79  66.78 ± 0.56 
1D CNN  78.65 ± 0.66  57.66 ± 0.70 
IF  1D CNN  74.93 ± 0.50  52.16 ± 0.60 
EIF  1D CNN  84.06 ± 0.68  63.98 ± 0.46 
Input feature Type Channels Architecture Lab data Field data
SM  2D CNN  85.87 ± 0.48  65.48 ± 0.52 
SP  2D CNN  75.61 ± 0.56  56.66 ± 0.98 
SCP  2D CNN  80.42 ± 0.47  57.99 ± 0.76 
SCIF  2D CNN  82.16 ± 0.49  61.23 ± 0.62 
SMCIF  2D CNN  86.64 ±0.37  66.91 ±0.42 
SRI  2D CNN  85.11 ± 0.56  66.07 ± 0.52 
TS  1D CNN  85.43 ± 0.79  66.78 ± 0.56 
1D CNN  78.65 ± 0.66  57.66 ± 0.70 
IF  1D CNN  74.93 ± 0.50  52.16 ± 0.60 
EIF  1D CNN  84.06 ± 0.68  63.98 ± 0.46 
FIG. 6.

(Color online) Boxplots for the accuracies of magnitude-only features (purple), phase-only features (green), and magnitude and phase features (blue) for the lab data. The diamond symbols indicate outliers.

FIG. 6.

(Color online) Boxplots for the accuracies of magnitude-only features (purple), phase-only features (green), and magnitude and phase features (blue) for the lab data. The diamond symbols indicate outliers.

Close modal
FIG. 7.

(Color online) Boxplots for the accuracies of magnitude-only features (purple), phase-only features (green), and magnitude and phase features (blue) for the field data. The diamond symbols indicate outliers.

FIG. 7.

(Color online) Boxplots for the accuracies of magnitude-only features (purple), phase-only features (green), and magnitude and phase features (blue) for the field data. The diamond symbols indicate outliers.

Close modal

If the features in Figs. 6 and 7 were sorted from lowest to highest accuracy, the order would be equal for lab and field data, so many conclusions can be drawn equally for both environments. The best accuracy is achieved using SM. Comparing the phase scalograms, SP leads to the lowest accuracy, while the best accuracy is achieved using the CIF scalogram. Thus, it can be confirmed that unwrapping the phase discontinuities and calculating the phase gradients is effective in highlighting the target phase and increases classification accuracy. As expected, the revealed phase shifts in SCIF can be considered a meaningful feature for object discrimination, producing only about 4% lower accuracy than the magnitudes in SM. For the time domain features, IF produces about 4% lower accuracy than E using lab data and about 6% lower accuracy using field data.

It is concluded that more information that is relevant for object discrimination is included in the magnitudes of the echo signals for both time domain and time-frequency features. However, using the phase also produces considerable classification results when preprocessed properly. It is confirmed that the target phase contains relevant properties of the scatterer. As the best accuracy of magnitude and phase-only features is achieved with SM, we will use SM as a baseline in the following, where the effectiveness of using both magnitude and phase information as a feature input is evaluated.

The feature inputs including both magnitude and phase information (SMCIF, SRI, EIF, and TS) in Figs. 6 and 7 are marked in blue. Comparing these features, the best accuracies are achieved with SMCIF. It is concluded that calculating the phase gradients over the frequency channels of the CWT beneficially extracts target phase information, while SRI, TS, and EIF produce lower accuracies. Relevant patterns are less evident in the striped appearance of SRI, resulting in lower accuracies than for SMCIF and TS. When using time-frequency features, the magnitudes and phase derivatives should be preferred to the real and imaginary parts since higher accuracies are achieved with SMCIF than SRI. The lowest accuracy of features including magnitude and phase is achieved with EIF. Thus, reconstructing a real-valued time signal from the IQ signals is recommended. Using TS as a direct input to the CNN yields about 1% less accuracy than SMCIF for the lab data. For the field data, almost the same accuracy is achieved for TS and SMCIF. The 1D CNN head seems to be capable of learning features that are robust against noise, benefiting from encoded phase information in the time signal.

Overall, the highest accuracies are achieved using SMCIF. Using SMCIF, the accuracy is increased by about 0.8% (lab) and 1.4% (field) compared to SM. A high amount of redundant information in the magnitude and phase scalograms is revealed, as reasonable results have also been achieved by SM and SCIF, and the combination only slightly improved the accuracies. We would argue that the improvements are achieved by adding focus to the nulls in the interference patterns relating to the geometric properties of the scatterers.

When comparing the accuracies of TS with E (–6.8% and –9.1% for lab and field data, respectively) and IF (–10.5% and –14.6%), it can be seen that the relevance of including both magnitude and phase information is more crucial for time domain features than for scalogram images. Generally, adding phase information is more efficient for the field than for the lab data, highlighting the importance of phase information for noisy signals and high clutter environments. Except for EIF, all phase-aware features outperform SM in the field environment.

In this study, we have examined the relevance of target phase for classifying in-air targets using an automotive ultrasonic sensor. The distribution of sub-echoes in an overall backscattering and interference patterns caused by overlapping sub-echoes have been considered relevant features for target discrimination. In the phase of an echo signal, phase shifts emerge at the origin and the end of sub-echoes. Further, phase shifts are caused by interference when sub-echoes are overlapping. The target phase, relating to the geometric properties of a scatterer, is revealed by calculating the phase gradient of the unwrapped phase, which is defined as the IF of the echo signal.

We used a data set of 498 300 measurements including lab and field data as well as stationary and dynamic scenes. For the training of a CNN, time signals or time-frequency images can be used as an input. The CWT has been applied to extract time-frequency images of the echo signals. Based on the complex-valued CWT, magnitude and phase features have been calculated. The real and imaginary parts of the CWT have also been considered as a phase-encoded input feature. For the time domain features, the time signal as a direct CNN input, the amplitude envelope, and the IF have been considered. A 1D CNN head has been used to transform the time domain inputs to 2D feature images, allowing similar CNN architectures for the 1D and 2D features. The same 2D CNN architecture has been used to process the time-frequency images and the outputs of the 1D CNN head.

SCIF led to the best classification accuracies compared to other phase-only features, producing only about 4% less accuracy than SM. Unwrapping and calculating the phase gradients should be performed to preprocess the raw phase. A high amount of redundancy in the magnitude and phase scalogram has been noticed, as SMCIF has only led to small accuracy improvements compared to SM. When dealing with time domain features, it is more crucial to include phase information, as the phase-encoded time signal led to significantly higher classification accuracies than the amplitude envelope. Overall, jointly feeding magnitude and phase features to the classifier is helpful to add robustness especially when dealing with low SNRs, e.g., high clutter environments.

When using CWT-based input features, calculating the magnitude and phase of the CWT should be preferred to using the real and imaginary parts. For the field data, comparable results have been achieved when using SMCIF and TS, showing that the 1D CNN head is capable of extracting noise-robust features. When using TS as a direct input, calculating a time-frequency transform is not needed. EIF, which can be calculated directly from the IQ signal, produced significantly lower accuracies than TS or phase-aware scalogram features. Thus, reconstructing a real-valued time signal from the IQ signals is recommended.

Future work could include evaluating CVNNs to directly feed the complex-valued CWT features or even the IQ signals into the classifier. Regarding the potential of using raw time signals as a CNN input, the classification accuracies should be compared to time-frequency images using larger data sets and optimized hyperparameters. Further, consideration could be given to using both the raw time signal and time-frequency features as an input to the classifier. The phase of echo signals could also be considered as a feature for classifying ground types, as our results showed different performances of adding phase information in the lab and field environment, where the main difference is the amount of ground clutter.

We would like to thank Professor Dr. Andreas Koch (Stuttgart Media University) for supporting our work. Further, we would like to thank the team of the Institute for Applied Artificial Intelligence at Stuttgart Media University for the inspiring discussions in the “Journal Club.” We would also like to express our thanks to the group for computational methods at the Chair of Vibroacoustics of Vehicles and Machines (TUM) for the valuable exchange at the regular meetings.

The authors have no conflicts to disclose.

The data that support the findings of this study are available from Robert Bosch GmbH. Restrictions apply to the availability of these data, which were used under license for this study. Data are available from the authors upon reasonable request and with the permission of Robert Bosch GmbH.

1.
M.
Noll
and
P.
Rapps
, “
Ultrasonic sensors for a K44DAS
,” in
Handbook of Driver Assistance Systems
, edited by
H.
Winner
,
S.
Hakuli
,
F.
Lotz
, and
C.
Singer
(
Springer
,
Cham, Switzerland
,
2016
), pp.
303
323
.
2.
A.
Pandharipande
,
C.-H.
Cheng
,
J.
Dauwels
,
S. Z.
Gurbuz
,
J.
Ibanez-Guzman
,
G.
Li
,
A.
Piazzoni
,
P.
Wang
, and
A.
Santra
, “
Sensing and machine learning for automotive perception: A review
,”
IEEE Sens. J.
23
,
11097
11115
(
2023
).
3.
P.
Kroh
,
R.
Simon
, and
S.
Rupitsch
, “
Classification of sonar targets in air: A neural network approach
,”
Sensors
19
,
1176
1193
(
2019
).
4.
J.
Eisele
,
A.
Gerlach
,
M.
Maeder
, and
S.
Marburg
, “
Convolutional neural network with data augmentation for object classification in automotive ultrasonic sensing
,”
J. Acoust. Soc. Am.
153
,
2447
2459
(
2023
).
5.
M.
Pöpperl
,
R.
Gulagundi
,
S.
Yogamani
, and
S.
Milz
, “
Capsule neural network based height classification using low-cost automotive ultrasonic sensors
,” in
Proceedings of the 2019 IEEE Intelligent Vehicles Symposium (IV)
, Paris, France (June 9–12,
2019
) (
IEEE
,
New York
, 2019), pp.
661
666
.
6.
A.
Bystrov
,
E.
Hoare
,
T.-Y.
Tran
,
N.
Clarke
,
M.
Gashinova
, and
M.
Cherniakov
, “
Road surface classification using automotive ultrasonic sensor
,”
Procedia Eng.
168
,
19
22
(
2016
).
7.
N.
Riopelle
,
P.
Caspers
, and
D.
Sofge
, “
Terrain classification for autonomous vehicles using bat-inspired echolocation
,” in
Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN)
, Rio de Janeiro, Brazil (July 8–13,
2018
) (
IEEE
,
New York
, 2018), pp.
1
6
.
8.
M.
Kalliris
,
S.
Kanarachos
,
R.
Kotsakis
,
O.
Haas
, and
M.
Blundell
, “
Machine learning algorithms for wet road surface detection using acoustic measurements
,” in
Proceedings of the 2019 IEEE International Conference on Mechatronics (ICM)
, Ilmenau, Germany (March 18–20,
2019
) (
IEEE
,
New York
, 2019), pp.
265
270
.
9.
C.
Barat
and
N.
Ait Oufroukh
, “
Classification of indoor environment using only one ultrasonic sensor
,” in
IMTC 2001: Proceedings of the 18th IEEE Instrumentation and Measurement Technology Conference: Rediscovering Measurement in the Age of Informatics (Cat. No.01CH 37188)
, Budapest, Hungary (May 21–23,
2001
) (
IEEE
,
New York
, 2001), pp.
1750
1755
.
10.
Sonia
,
A. M.
Tripathi
,
R. D.
Baruah
, and
S. B.
Nair
, “
Ultrasonic sensor-based human detector using one-class classifiers
,” in
Proceedings of the 2015 IEEE International Conference on Evolving and Adaptive Intelligent Systems (EAIS)
, Douai, France (December 1–3,
2015
) (
IEEE
,
New York
, 2015), pp.
1
6
.
11.
A. M.
Sabatini
, “
A digital-signal-processing technique for ultrasonic signal modeling and classification
,”
IEEE Trans. Instrum. Meas.
50
,
15
21
(
2001
).
12.
I. E.
Dror
,
M.
Zagaeski
, and
C. F.
Moss
, “
Three-dimensional target recognition via sonar: A neural network model
,”
Neural Netw.
8
,
149
160
(
1995
).
13.
M. I.
Ecemis
and
P.
Gaudiano
, “
Object recognition with ultrasonic sensors
,” in
Proceedings of the 1999 IEEE International Symposium on Computational Intelligence in Robotics and Automation: CIRA'99 (Cat. No.99EX375)
, Monterey, CA (November 8–9,
1999
) (
IEEE
,
New York
, 1999), pp.
250
255
.
14.
S. A.
Bouhamed
,
I. K.
Kallel
, and
D. S.
Masmoudi
, “
Stair case detection and recognition using ultrasonic signal
,” in
Proceedings of the 2013 36th International Conference on Telecommunications and Signal Processing (TSP)
, Rome, Italy (July 2–4,
2013
) (
IEEE
,
New York
, 2013), pp.
672
676
.
15.
B.
Zhu
,
T.
Geng
,
G.
Jiang
,
Z.
Guan
,
Y.
Li
, and
X.
Yun
, “
Surrounding object material detection and identification method for robots based on ultrasonic echo signals
,”
Appl. Bionics Biomech.
2023
,
1998218
.
16.
J.
Yong
,
Y.
Chen
,
Y.
Zhang
,
B.
Jia
, and
G.
Li
, “
Envelope phase shift feature extraction of underwater target echo
,”
J. Phys. Conf. Ser.
1438
,
012004
(
2020
).
17.
J.
Yong
,
Y.
Chen
,
B.
Jia
, and
Y.
Zhang
, “
Simulation of phase characteristics of underwater target acoustic scattering
,” in
Proceedings of the 2019 IEEE International Conference on Signal, Information and Data Processing (ICSIDP)
, Chongqing, China (December 11–13,
2019
) (
IEEE
,
New York
, 2019), pp.
1
5
.
18.
P.
Atkins
,
A.
Islas
, and
K. G.
Foote
, “
Sonar target‐phase measurement and effects of transducer‐matching
,”
J. Acoust. Soc. Am.
123
,
3949
(
2008
).
19.
F. G.
Mitri
,
J. F.
Greenleaf
,
Z. E. A.
Fellah
, and
M.
Fatemi
, “
Investigating the absolute phase information in acoustic wave resonance scattering
,”
Ultrasonics
48
,
209
219
(
2008
).
20.
Z.
Lian
and
T.
Wu
, “
Feature extraction of underwater acoustic target signals using Gammatone filterbank and subband instantaneous frequency
,” in
Proceedings of the 2022 IEEE 6th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC)
, Beijing, China (October 3–5,
2022
) (
IEEE
,
New York
, 2022), pp.
944
949
.
21.
J. A.
Simmons
, “
Evidence for perception of fine echo delay and phase by the FM bat, Eptesicus fuscus
,”
J. Comp. Physiol. A
172
,
533
547
(
1993
).
22.
S.
Schörnich
and
L.
Wiegrebe
, “
Phase sensitivity in bat sonar revisited
,”
J. Comp. Physiol. A
194
,
61
67
(
2008
).
23.
J. J.
Finneran
,
R.
Jones
,
R. A.
Guazzo
,
M. G.
Strahan
,
J.
Mulsow
,
D. S.
Houser
,
B. K.
Branstetter
, and
P. W.
Moore
, “
Dolphin echo-delay resolution measured with a jittered-echo paradigm
,”
J. Acoust. Soc. Am.
148
,
374
388
(
2020
).
24.
V.
Gnann
and
M.
Spiertz
, “
Transient detection with absolute discrete group delay
,” in
Proceedings of the 2009 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS)
, Kanazawa, Japan (January 7–9,
2009
) (
IEEE
,
New York
, 2009), pp.
311
314
.
25.
S.
Hidaka
,
K.
Wakamiya
, and
T.
Kaburagi
, “
An investigation of the effectiveness of phase for audio classification
,” in
ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
, Singapore (May 23–27,
2022
) (
IEEE
,
New York
, 2022), pp.
3708
3712
.
26.
I.
Paraskevas
and
E.
Chilton
, “
Combination of magnitude and phase statistical features for audio classification
,”
Acoust. Res. Lett. Online
5
,
111
117
(
2004
).
27.
L.
Navarro
,
G.
Courbebaisse
, and
J.-C.
Pinoli
, “
Continuous frequency and phase spectrograms: A study of their 2D and 3D capabilities and application to musical signal analysis
,”
J. Zhejiang Univ. Sci. A
9
,
199
206
(
2008
).
28.
I.
Paraskevas
and
M.
Rangoussi
, “
Feature extraction for audio classification of gunshots using the Hartley transform
,”
Open J. Acoust.
2
,
131
142
(
2012
).
29.
L.
Guo
,
L.
Wang
,
J.
Dang
,
L.
Zhang
,
H.
Guan
, and
X.
Li
, “
Speech emotion recognition by combining amplitude and phase information using convolutional neural network
,” in
Proceedings of INTERSPEECH 2018
, Hyderabad, India (September 2–6,
2018
) (
International Speech Communication Association
,
Baixas, France
, 2018), pp.
1611
1615
.
30.
N.
Zheng
and
X.-L.
Zhang
, “
Phase-aware speech enhancement based on deep neural networks
,”
IEEE/ACM Trans. Audio Speech Lang. Process.
27
,
63
76
(
2019
).
31.
H.-S.
Choi
,
J.-H.
Kim
,
J.
Huh
,
A.
Kim
,
J.-W.
Ha
, and
K.
Lee
, “
Phase-aware speech enhancement with deep complex u-net
,” in
Proceedings of the International Conference on Learning Representations
, New Orleans, LA (May 6–9,
2019
) (International Conference on Learning Representations, Appleton, WI, 2019), pp.
20
40
.
32.
G.
Shi
,
M. M.
Shanechi
, and
P.
Aarabi
, “
On the importance of phase in human speech recognition
,”
IEEE Trans. Audio Speech Lang. Process.
14
,
1867
1874
(
2006
).
33.
N.
Sturmel
and
L.
Daudet
, “
Signal reconstruction from STFT magnitude: A state of the art
,” in
Proceedings of the 14th International Conference on Digital Audio Effects (DAFx-11)
, Paris, France (September 19–23,
2011
) (IRCAM, Institut de Recherche et Coordination Acoustique/Musique, Paris, 2011), pp.
375
386
.
34.
D. W.
Ricker
, “
Spread scattering and propagation
,” in
Echo Signal Processing
(
Springer
,
Boston
,
2003
), pp.
319
405
.
35.
J. A.
Simmons
,
D.
Houser
, and
L.
Kloepper
, “
Localization and classification of targets by echolocating bats and dolphins
,” in
Springer Handbook of Auditory Research
, edited by
A.
Surlykke
,
P. E.
Nachtigall
,
R. R.
Fay
, and
A. N.
Popper
(
Springer
,
New York
,
2014
), pp.
169
193
.
36.
D. W.
Ricker
, “
Waveforms
,” in
Echo Signal Processing
(
Springer
,
Boston
,
2003
), pp.
225
317
.
37.
U.
Ingård
and
R. H.
Bolt
, “
A free field method of measuring the absorption coefficient of acoustic materials
,”
J. Acoust. Soc. Am.
23
,
509
516
(
1951
).
38.
W. G.
Neubauer
, “
Theory and demonstration of creeping waves
,” in
Acoustic Reflection of Surfaces and Shapes
(
Naval Research Laboratory
,
Washington, DC
,
1986
), pp.
35
52
.
39.
A. V.
Oppenheim
,
R. W.
Schafer
, and
J. R.
Buck
, “
Transform analysis of linear time-invariant systems
,” in
Discrete-Time Signal Processing
(
Prentice Hall
,
Upper Saddle River, NJ
,
1999
), pp.
240
339
.
40.
J.
Kirkhorn
,
Introduction to IQ-Demodulation of RF-Data
(
Norwegian University of Science and Technology
,
Trondheim, Norway
,
1999
).
41.
J.
Speidel
, “
Transmission system with quadrature amplitude modulation
,” in
Introduction to Digital Communications
(
Springer
,
Cham, Switzerland
,
2021
), pp.
3
15
.
42.
B.
Barshan
, “
Fast processing techniques for accurate ultrasonic range measurements
,”
Meas. Sci. Technol.
11
,
45
50
(
2000
).
43.
D.
Neupane
and
J.
Seok
, “
A review on deep learning-based approaches for automatic sonar target recognition
,”
Electronics
9
,
1972
(
2020
).
44.
C.
Ming
and
J. A.
Simmons
, “
Target geometry estimation using deep neural networks in sonar sensing
,” arXiv:2203.15770v1 (
2022
).
45.
X.
Xia
,
R.
Togneri
,
F.
Sohel
,
Y.
Zhao
, and
D.
Huang
, “
A survey: Neural network-based deep learning for acoustic event detection
,”
Circuits Syst. Signal Process.
38
,
3433
3453
(
2019
).
46.
H.
Purwins
,
B.
Li
,
T.
Virtanen
,
J.
Schluter
,
S.-Y.
Chang
, and
T.
Sainath
, “
Deep learning for audio signal processing
,”
IEEE J. Sel. Top. Signal Process.
13
,
206
219
(
2019
).
47.
M. J.
Bianco
,
P.
Gerstoft
,
J.
Traer
,
E.
Ozanich
,
M. A.
Roch
,
S.
Gannot
, and
C.-A.
Deledalle
, “
Machine learning in acoustics: Theory and applications
,”
J. Acoust. Soc. Am.
146
,
3590
3628
(
2019
).
48.
K. B. T.
Shaikh
,
N. P.
Jawarkar
, and
V.
Ahmed
, “
Machine diagnosis using acoustic analysis: A review
,” in
Proceedings of the 2021 IEEE Conference on Norbert Wiener in the 21st Century (21CW)
, Chennai, India (July 22–25,
2021
) (
IEEE
,
New York
, 2021), pp.
1
6
.
49.
V. C.
Chen
and
H.
Ling
, “
Time-frequency transforms
,” in
Time-Frequency Transforms for Radar Imaging and Signal Analysis
(
Artech House
,
Boston
,
2002
), pp.
25
46
.
50.
B.
Boashash
, “
Advanced time-frequency signal and system analysis
,” in
Time-Frequency Signal Analysis and Processing
(
Elsevier
,
Amsterdam
,
2015
), pp.
141
236
.
51.
D. J.
Nelson
, “
Cross-spectral methods for processing speech
,”
J. Acoust. Soc. Am.
110
,
2575
2592
(
2001
).
52.
D.
Haider
,
P.
Balazs
, and
N.
Holighaus
, “
Phase-based signal representations for scattering
,” in
Proceedings of the 2021 29th European Signal Processing Conference (EUSIPCO)
, Dublin, Ireland (August 23–27,
2021
) (
IEEE
,
New York
, 2021), pp.
6
10
.
53.
P.
Balazs
,
D.
Bayer
,
F.
Jaillet
, and
P.
Søndergaard
, “
The pole behavior of the phase derivative of the short-time Fourier transform
,”
Appl. Comput. Harmon. Anal.
40
,
610
621
(
2016
).
54.
K. P.
Murphy
, “
Supervised learning
,” in
Machine Learning: A Probabilistic Perspective
(
MIT
,
Cambridge, MA
,
2012
), pp.
3
9
.
55.
G.
Peeters
and
G.
Richard
, “
Deep learning for audio and music
,” in
Multi-Faceted Deep Learning
, edited by
J.
Benois-Pineau
and
A.
Zemmari
(
Springer
,
Cham, Switzerland
,
2021
), pp.
231
266
.
56.
I.
Goodfellow
,
Y.
Bengio
, and
A.
Courville
, “
Convolutional networks
,” in
Deep Learning
(
MIT
,
Cambridge, MA
,
2016
), pp.
330
372
.
57.
A.
Hirose
,
Complex-Valued Neural Networks
(
Springer
,
Berlin
,
2012
).
58.
J.
Bassey
,
L.
Qian
, and
X.
Li
, “
A survey of complex-valued neural networks
,” arXiv:2101.12249 (
2021
).
59.
S.
Ioffe
and
C.
Szegedy
, “
Batch normalization: Accelerating deep network training by reducing internal covariate shift
,” arXiv:1502.03167 (
2015
).
60.
I.
Goodfellow
,
Y.
Bengio
, and
A.
Courville
, “
Deep feedforward networks
,” in
Deep Learning
(
MIT
,
Cambridge, MA
,
2016
), pp.
168
227
.
61.
I.
Goodfellow
,
Y.
Bengio
, and
A.
Courville
, “
Numerical computation
,” in
Deep Learning
(
MIT
,
Cambridge, MA
,
2016
), pp.
80
97
.
62.
E.
Alpaydin
, “
Multivariate methods
,” in
Introduction to Machine Learning
(
MIT
,
Cambridge, MA
,
2014
), pp.
93
114
.
63.
C. M.
Bishop
, “
Linear models for classification
,” in
Pattern Recognition and Machine Learning
(
Springer
,
New York
,
2006
), pp.
179
224
.