A paper by the current authors Paul and Nelson [JASA Express Lett. **3**(9), 094802 (2023)] showed how the singular value decomposition (SVD) of the matrix of real weights in a neural network could be used to prune the network during training. The paper presented here shows that a similar approach can be used to reduce the training time and increase the implementation efficiency of complex-valued neural networks. Such networks have potential advantages compared to their real-valued counterparts, especially when the complex representation of the data is important, which is the often case in acoustic signal processing. In comparing the performance of networks having both real and complex elements, it is demonstrated that there are some advantages to the use of complex networks in the cases considered. The paper includes a derivation of the backpropagation algorithm, in matrix form, for training a complex-valued multilayer perceptron with an arbitrary number of layers. The matrix-based analysis enables the application of the SVD to the complex weight matrices in the network. The SVD-based pruning technique is applied to the problem of the classification of transient acoustic signals. It is shown how training times can be reduced, and implementation efficiency increased, while ensuring that such signals can be classified with remarkable accuracy.

## I. INTRODUCTION

The purpose of this paper is to show how complex-valued neural networks (CVNNs) can be efficiently designed by using a novel algorithm based on the singular value decomposition (SVD) of the weight matrices in the network. There has been growing interest in CVNNs with recent reviews (Bassey , 2021; Lee , 2022) showing the variety of tasks to which CVNNs have already been applied. Although the theory underlying the backpropagation algorithm using complex-valued data was first discussed in the early 1990s (Benvenuto and Piazza, 1992; Georgiou and Koutsougeras, 1992; Leung and Haykin, 1991), CVNNs have recently received increasing interest, with some of the most popular applications being MRI fingerprinting (Virtue , 2017), wireless communications (Marseet and Sahin, 2017), and image processing (Cao , 2019; Popa, 2017). Some work has been undertaken in the audio signal processing field, typically in the context of speech processing (Hayakawa , 2018; Lee , 2017) and source localization (Paul and Nelson, 2022; Tsuzuki , 2013).

One of the main advantages of CVNNs, as noted by Hirose (2009), is that they can treat the real and imaginary parts of a number as a single component, thus keeping the relation between the magnitude and phase at a given frequency. This property is especially useful for applications involving the processing of acoustic signals, where the frequency-domain is often used as an input feature and where the real and imaginary parts are statistically dependent upon one another. This topic has been discussed in detail in Hirose (2011), where the author presented an analysis showing that a real-valued network, where the real and imaginary parts are concatenated into a single vector, is not equivalent to using the complex-valued number directly. While the addition of the real and imaginary parts separately is the same as the addition of the complex number, the multiplication is different, since the multiplication of complex numbers introduces an angle rotation and an amplitude attenuation/amplification. More about the convergence and merits of CVNNs can be found in Hirose (2009), Nitta (2003), and Zhang (2014).

In addition to the properties of the CVNNs described above, complex-valued activation functions and learning rates can be used to enhance training, as discussed in Bassey (2021), Scardapane (2020), and Zhang and Mandic (2015). Despite their potential benefits, CVNNs have not been extensively used in acoustic signal processing, as shown in Bassey (2021). One possible reason for the lack of popularity could be the increased computational cost needed during training, since, as discussed below, the gradients of complex-valued functions require the computation of more terms than for real-valued functions.

The work in this paper will present a technique to remove some of the computational cost during training, ideally without losing any performance. This is known as network pruning and has been intensively researched (Augasta and Kathirvalavakumar, 2013; Blalock , 2020; Choudhary , 2020), given that computational power has become a concern when employing high dimensional networks with a substantial number of neurons. The use of the SVD as a low-rank approximation technique for removing training parameters is only one of the available pruning techniques and Bermeitinger (2019) discussed in detail its use on machine learning models. The use of the SVD in network models can be linked to an early investigation by Psichogios and Ungar (1994), where the SVD was employed to reduce overfitting and enhance generalization error. After the training was completed, Psichogios and Ungar (1994) discarded redundant singular values along with their corresponding hidden layer nodes.

With a focus on acoustics, Cai (2014) and Xue (2013) used SVD-based approaches to reduce the training parameters of feed-forward networks. In Xue (2013), the authors replaced the matrix of weights **W** by two smaller matrices computed from the SVD matrices of **W** once during training. They evaluated the pruning technique on a LVCSR speech recognition task and showed a reduction of model size by 73% with less than 1% relative accuracy loss. In the work by Cai (2014), the authors argued against the direct use of the SVD on the randomly initialized weight matrices, showing that the pruning performance can be improved if the SVD is applied only after the model has trained for a couple of iterations. In both of these examples, only real-valued networks (RVNNs) were considered. In a recent study, Singh and Plumbley (2022) investigated the use of a different pruning technique on a Convolutional Neural Network trained for acoustic scene classification. The authors remove filters with similar content, assuming that such filters yield similar responses and are thus redundant for the overall training process.

The technique presented here follows that presented previously (Paul and Nelson, 2023) in order to efficiently design real-valued multilayer perceptrons (MLPs). In this work, it is shown that such an approach can be successfully applied to complex-valued networks, even if here all operations are in the complex domain. Compared to other SVD-based pruning techniques (e.g., Psichogios and Ungar, 1994; Yang , 2020), this approach discards weights as the learning progresses and does not need a full training of the model before reducing its dimensions. Other approaches (e.g., Cai , 2014; Xue , 2013) apply the SVD once at the beginning or during the training and do not change the dimension of the hidden layer when singular values are discarded. In the technique presented here, it is shown that removing singular values at several consecutive points during the training is beneficial. Furthermore, the resizing of the hidden layer ends up being an adaptive process that depends on the task and network structure, as will be discussed later.

The example used here consists of an MLP with two layers (a hidden layer and an output layer), and through the iterative discarding of singular values during training, it is shown that a network can be designed such that it can be implemented with high computational efficiency. The problem of classifying the complex spectra associated with some model transient signals is used to illustrate the method. The transient signals used are the impulse responses of some bandpass filters of the type used in the authors' previous work (Paul and Nelson, 2021b) on the classification of acoustic power spectra. By analogy with the previous work, the use of such models also enables a good estimate of the accuracy to which very similar transient signals can be classified. It is demonstrated that, using an appropriately trained network, two transient signals that can be barely distinguished using conventional spectral analysis can be classified accurately from single time history samples. The work presented also establishes the limits to classification accuracy determined by the level of noise added to the signals. The theoretical background will first be presented and the equations governing the behavior of the network will be derived from first principles.

## II. THE COMPLEX-VALUED MLP (cMLP)

### A. Background

The derivation of the complex-valued backpropagation algorithm requires an understanding of the theoretical basis for the use of complex numbers and their derivatives. The analysis presented here is based on the work of Kreutz-Delgado (2009) and Amin (2011), both of whom draw on the fundamental work of Wirtinger (1927). A detailed discussion of complex numbers and their use for signal processing applications can be found in Adali (2011). This paper also helpfully summarizes Wirtinger calculus, its derivative identities, and deals with other issues, such as the treatment of proper and circular complex numbers and how these can enhance performance of algorithms, such as the independent component analysis for source separation.

*h*(

**g**), where

**g**is a complex vector, the partial derivatives with respect to the complex vector

**z**and its complex conjugate

**z*** are given by the chain rules

*h*(

**g**) is given by the following:

### B. Forward propagation in the cMLP

*l*th layer, (where

*l*= 1, 2 in this case) produce a vector of complex outputs

**z**

^{(l)}that are related to the vector of complex inputs

**a**

^{(l)}by a complex activation function

*h*(

**a**

^{(l)}). Note that the layers are counted from the output backward, such that

*l*= 1 defines the output layer. The vector

**b**

^{(l)}is the complex bias associated with the neurons in the

*l*th layer. Since the task to be solved here involves a classification problem, the cross-entropy function extended for complex numbers can be defined (Cao , 2019) as follows:

*k*th estimated output $y\u0302k$ and target output

*y*. Note that $y\u0302k=zk(1)$. This expression reduces to the cross-entropy function discussed in Paul and Nelson (2023) for the real-valued MLP case. For a classification task using complex-valued outputs, the target outputs

_{k}**y**were defined as one-hot encoded vectors, where for the correct class, the target output was defined as 1 + 1

*j*, which is a straightforward transformation from the real case. This way, a phase term is enforced when estimating the output, which could improve the convergence due to the additional constraint. It should be noted that, in dealing with classification tasks, the activation function in the output layer is usually a softmax function. One of the existing approaches that extend the softmax function to complex-valued data can be defined (Cao , 2019) as follows:

*j*.

### C. Backpropagation in the cMLP

Although the work presented here will focus on the use of a two-layer MLP (where *l* = 1, 2), it is simple enough to present the derivation of the backpropagation algorithm for the general case of a number of layers (where *l* = 1, 2, 3,…,*l _{max}*). Thus, assume that the MLP consists of a number of layers of neurons where the variables are designated starting with the output layer and the layers are designated from

*l*= 1 at the output to

*l*=

*l*at the input.

_{max}**W**

^{(l)}can be evaluated by using the

*vec*operator that sequentially orders the columns of a matrix into a single vector. This gives composite vectors of weights

**w**

^{(l)}=

*vec*(

**W**

^{(l)}). First, the gradient of the network outputs with respect to the weight vector

**w**

^{(l)}is computed, and then one can work sequentially through the other layers. When dealing with complex-valued networks, it follows from the Wirtinger calculus (Amin , 2011) that the gradient of the loss function with respect to the weights

**w**

^{(l)}can be written as the product

**d**will depend on whether the network is aimed at either a regression or a classification task. The composite matrix on the right side of the above equation can be written by using the identity in Eq. (4) above such that

**z**

^{(2)}with respect to

**w**

^{(l)}by using identical reasoning. It then follows that

*l*, which comprises the input to the network. The gradient at any single layer can be written using the composite matrix notation as

_{max}**I**

^{(l)}is the identity matrix of dimensions equal to the length of the vector

**z**

^{(l+1)}. It therefore follows that

*l*th layer is then given by the product of composite matrices

*l*th weight vector

**w**

^{(l)}can be written as

**W**

^{(l)}as

**b**

^{(l)}is identical to Eq. (20), but with the term $z(l+1)H$ omitted. Note that if

*l*=

*l*, then

_{max}**z**

^{(l+1)H}=

**x**

^{H}. In the work that follows, the above equations were used as the basis of code written in matlab, the matrix formulation given enabling a clear understanding of the performance of the algorithms.

### D. SVD-based pruning with reduction in hidden layer dimensions

*N*is the total number of discarding points,

*n*is the iteration index, and

*a*and

*b*define, respectively, the lower and higher bounds of the sequence of discarding points. If the lower bound is defined to be 3 for example, the value of

*a*in the equation above would be

*a*= log

_{10}(3). The value of the index

*τ*at which discarding takes place during training is given by the nearest integer value of

*d*(

*n*).

*n*th discarding point, before removing any singular values, the SVD of $Wn\u22121(2)$ is given by $Wn\u22121(2)=Un\u22121\Sigma n\u22121Vn\u22121H$. After small singular values have been removed, the SVD matrices are replaced by $Un\Sigma nVnH$, respectively. Following the same notation, assuming the

*n*th discarding point, the hidden layer $an\u22121(2)$ can be multiplied by $UnH$ every time singular values are discarded and the new hidden layer with fewer neurons is denoted as $an(2)$. Following this, the forward propagation for the hidden layer becomes $UnHan\u22121(2)=(UnHUn\Sigma nVnH)x+UnHbn\u22121(2)$. Since $UnHUn=I$, the identity matrix, and by multiplying the remaining matrices $\Sigma nVnH=Wn(2)$, the forward propagation for cMLP-SVD is given by the following:

**Σ**

_{n}. Note that since the number of neurons in $zn(2)=h(an(2))$ can change after every discarded singular value, the dimensions of $Wn(1)$ also have to be adapted. This can be done by removing as many last columns of $Wn(1)$ as singular values were removed at that stage. It has been found that this process yields good results; however, it is not impossible that other approaches might prove to be more effective. Finally, since the forward propagation of cMLP-SVD is identical to that from the cMLP, the gradient equations are also identical to Eq. (20), with the observation that the dimensions of the gradients will depend on the number of singular values discarded.

## III. RESULTS

### A. Description of classification task

Previous work by the current authors (Paul and Nelson, 2021b, 2023) explored the ability of real-valued MLPs to discern acoustic spectra that may be challenging to differentiate using traditional power spectral analysis. Here, however, the objective is to evaluate the extent to which complex MLPs, once taught, can identify small differences between transient acoustic signals. The model used here is based on that used in the previous paper (Paul and Nelson, 2021b). In this case, the transient signals considered are based on the impulse responses of the bandpass filters used in the previous work. A unit impulse signal is passed through bandpass filters having different center frequencies and bandwidths in order to generate impulse responses of very similar bandpass filters. Once the impulse responses have been generated, different white noise signals having a Gaussian distribution were added to the signals based on a signal-to-noise ratio (SNR) in order to investigate the limits of the cMLP to discriminate between small changes in the complex spectra associated with the model transient acoustic signals.

The impulse responses were generated using a filter bandwidth of 200 Hz and a difference in center frequencies of 20 Hz. The network models were trained for two different classification tasks. For the first task, five different classes of impulse responses were generated using center frequencies between 900 and 980 Hz. For the second task, the number of classes was increased from five to ten, the center frequencies being between 800 and 980 Hz. The reason for generating two datasets is that the structure of the network is changing, increasing the number of output neurons from five to ten. The change of structure is expected to influence the number of discarded singular values during training. The length of the impulse responses was defined to be 512 samples. The signal was long enough to capture the ringing of the response for these particular bandpass filters, which decreases by more than 60 dB. The time histories of around 20-ms-long at a sampling frequency of ƒ*s* = 24 kHz were then transformed into the frequency domain. For a fast Fourier transform (FFT) length (*N*) of 512 samples, the spectral resolution can be calculated as the fraction *fs*/*N*, which for the cases presented here was around 47 Hz. In other words, signals generated from bandpass filters with a difference in center frequency smaller than 47 Hz will have closely related spectra and their differences will be difficult to detect by inspection only, especially if white noise is added. Figure 2 shows a comparison between the moduli of the complex spectra of two impulse responses with 10 dB SNR added white noise, where in the upper plot the bandpass filters have a difference in center frequencies of Δƒ = 20 Hz, while the lower plot, Δƒ = 50 Hz. The training dataset consisted of 500 signals in each class, where each class contained the impulse response of one bandpass filter with white noise snippets having the same duration as the impulse response to which it was added. The input into the network was the complex FFT of the noisy impulse response, as shown in Fig. 2. The size of the input layer was 257 samples long, corresponding to the positive frequencies in the FFT spectrum.

### B. Network parameters

The 500 samples in each class were split into 60% training, 20% validation, and 20% test datasets and a batch size of 32 was used during the training. Both network architectures were trained using the Adam optimizer (Kingma and Ba, 2014), where the only difference from the real-valued approach is that here the gradients are complex. A real-valued learning rate of 0.002 was chosen, since with a complex-valued learning rate, the training was observed to be less smooth. The use of a complex-valued learning rate is worthy of further investigation, as discussed by Zhang and Mandic (2015). The network architectures had one hidden layer with 50 neurons and an output layer with five or ten neurons corresponding to the different classes. The activation function used in the hidden layer was the complex cardioid function (Virtue , 2017), which reduces to rectified linear unit if the numbers are real. The activation function in the output layer was the softmax function applied individually to the real and imaginary values of the estimated output. Since the estimated output is complex valued, the classification accuracy is computed by comparing the moduli of the *K* elements in the estimated vector $y\u0302$ with those in the target vector **y**. If, for example, the maximum of the moduli of $y\u0302$ is at index *k* = 1 and the correct label 1 + 1*j* in the target output vector is also at *k* = 1, the classification is considered to be correct.

The training was stopped after 150 iterations. The results shown below were computed by averaging the performance of ten trials. The results were averaged, since network weights are initialized with random numbers and therefore the performance can differ slightly between trials. For the cMLP-SVD approach, a discarding threshold of 0.2 was chosen empirically, which means that all singular values with magnitude smaller than 20% of the largest singular value were removed at each discarding point. Based on Eq. (21), three discarding points were defined starting with iteration 3 and stopping at one-fourth of the total number of iterations.

The performance of the cMLP and cMLP-SVD models is compared to that of their real-valued counterparts proposed by Paul and Nelson (2023), denoted here as rMLP and rMLP-SVD. For the RVNNs, the real and imaginary parts of the FFT spectrum are concatenated into a one-dimensional vector. The hidden layer of the RVNNs is therefore also doubled to account for the concatenation and the output layers are kept the same for both network types. The discarding threshold for the rMLP-SVD was set to 0.2, keeping the threshold the same for both RVNNs and CVNNs.

The pruning approaches are further compared with a benchmark method adapted from Han (2015) (denoted here as cMLP pruned), which is one of the most popular magnitude-based pruning techniques. The method prunes weights in an unstructured manner by replacing them with zeros if their magnitude is smaller than a threshold. For the work presented here, the modulus of the weight was chosen as a threshold. Using this approach, the structure of the network is not changed; however, the weight matrices become sparse and need less storage and fewer computations. By setting weights to zero, the model ignores certain connections during training and can focus on the more important connections. To enable a fair comparison with the SVD approaches, the same percentage of weights are set to zero as number of singular values are removed by the end of training. The pruning is performed once at the same iteration as the third (final) discarding point in the SVD method. This way, the pruned models have a significant number of iterations to fine-tune the weights and to converge. All the network models have been trained on a MacBook Pro M3 Max chip with 14-core CPU and 30-core GPU.

### C. Comparison of performance between the cMLP and cMLP-SVD

Table I shows the test accuracy for the first dataset, together with the averaged training time for the full training and the number of floating point operations (FLOPs) needed to compute a forward propagation after the network models finished training. The number of FLOPs can be computed for multiplications and additions of matrices and vectors by evaluating their dimensions (Golub and Van Loan, 2013). A complex-valued operation requires more FLOPs than a real operation. For example, a multiplication of two complex numbers needs four real multiplications and two additions. Note that the number of FLOPs is a rough estimate of the number of operations needed by the network model, since the operations required, for example by the activation functions, are not included.

. | Accuracy (%) . | Training time (s) . | Remaining neurons . | Number of FLOPs . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

SNR (dB) . | 1 . | 5 . | 10 . | 1 . | 5 . | 10 . | 1 . | 5 . | 10 . | 1 . | 5 . | 10 . |

cMLP | 71 | 93 | 100 | 25 | 25 | 25 | 50 | 50 | 50 | 104 910 | 104 910 | 104 910 |

cMLP-SVD | 75 | 94 | 100 | 7 | 7 | 5 | 5 | 5 | 4 | 10 500 | 10 500 | 8402 |

cMLP pruned | 74 | 84 | 92 | 26 | 25 | 25 | 50 | 50 | 50 | 10 800 | 10 800 | 8664 |

rMLP | 66 | 88 | 100 | 18 | 19 | 17 | 100 | 100 | 100 | 103 905 | 103 905 | 103 905 |

rMLP-SVD | 67 | 85 | 88 | 11 | 10 | 9 | 14 | 5 | 5 | 14 551 | 5200 | 5200 |

. | Accuracy (%) . | Training time (s) . | Remaining neurons . | Number of FLOPs . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

SNR (dB) . | 1 . | 5 . | 10 . | 1 . | 5 . | 10 . | 1 . | 5 . | 10 . | 1 . | 5 . | 10 . |

cMLP | 71 | 93 | 100 | 25 | 25 | 25 | 50 | 50 | 50 | 104 910 | 104 910 | 104 910 |

cMLP-SVD | 75 | 94 | 100 | 7 | 7 | 5 | 5 | 5 | 4 | 10 500 | 10 500 | 8402 |

cMLP pruned | 74 | 84 | 92 | 26 | 25 | 25 | 50 | 50 | 50 | 10 800 | 10 800 | 8664 |

rMLP | 66 | 88 | 100 | 18 | 19 | 17 | 100 | 100 | 100 | 103 905 | 103 905 | 103 905 |

rMLP-SVD | 67 | 85 | 88 | 11 | 10 | 9 | 14 | 5 | 5 | 14 551 | 5200 | 5200 |

All network architectures are able to classify a single time history corresponding to a noisy impulse response with an accuracy of 66% or above, depending on the SNR value. The CVNNs have a better performance than their real-valued counterparts on average, and the reasons for this behavior will be discussed in Sec. IV. The training time of the cMLP models is slightly higher than that of the rMLPs due to the complex multiplications in the forward and backward propagation. The SVD-based networks finish training with a similar number of neurons in the hidden layer, even if the rMLP-SVD model starts with twice the number of neurons in the hidden layer.

The proposed pruning technique is outperforming the benchmark method adapted from Han (2015) in all three cases. The main reason for this is that the proposed SVD-based pruning technique is more robust to the different datasets and training procedures. This will be discussed further in Sec. IV. Interestingly, as the SNR value becomes higher and the classification task becomes easier to solve, the SVD models discard more neurons in the hidden layer by the end of training. This automatically leads to a reduced training time. Due to the removal of neurons in the hidden layer, the number of FLOPs is also drastically reduced compared to the basic MLP networks for a forward propagation of the trained model. The rMLP-SVD approach needs overall the smallest number of FLOPs; however, the performance is lower on average and the rMLP-SVD models are less robust to the pruning of neurons, as will be discussed in Sec. IV.

Figure 3 shows the behavior of the validation accuracy of the five models for the middle task (SNR 5 dB). The performance is also compared to a cMLP network that starts training with five neurons in the hidden layer, in order to investigate the need of the pruning algorithm (see the curve labeled cMLP5 in Fig. 3). The discarding points where the network is changing its structure can be seen at the iterations where the validation accuracy drops significantly. Note that every time singular values are discarded, a new weight matrix with different values is computed.

With the new weight matrices and hidden layer dimensions, the networks learn quickly and the validation accuracy recovers in a couple of iterations for the tasks investigated here. The benchmark method is not able to fully recover after a large number of weights are set to zero; therefore, on average over ten trials, it achieves a lower accuracy than the cMLP-SVD method. The validation accuracy shows a couple of interesting behaviors. First, overfitting occurs for both rMLP and cMLP models. This is shown by the validation accuracy which for the rMLP achieves its maximum at an early stage and starts decreasing as training progresses. The cMLP model is able to reduce this effect and can generalize better. A similar observation was found, for example, by Grinstein and Naylor (2022). A second important observation is that, due to the pruning of the models, both rMLP-SVD and cMLP-SVD models can enhance performance and reduce overfitting. This observation has been found in other work, such as Shmalo (2023) and has been discussed in detail in Hirose (2009). A brief discussion about this behavior is mentioned in Sec. IV. A final important factor to note is that training directly the cMLP with five neurons in the hidden layer leads to a slightly lower validation and test accuracy on average compared to the cMLP-SVD. While the difference in performance is not significant, the advantage of using the SVD-based pruning method is that it offers a more robust training behavior on average and gives the user an estimate of the number of neurons that are needed by the model during training. Averaged over the three scenarios discussed in Table I, the cMLP-SVD had a 2%–3% better accuracy than the network models that started with only four or five neurons in the hidden layer.

Moving on to the second scenario, where the number of output classes is increased to ten, Table II shows the performance of the five network models. In this case, the CVNNs have a similar performance to the RVNNs, but do not outperform them as clearly as in the previous scenario. However, the same pattern occurs that the SVD-based approaches outperform the regular cMLP and rMLP models. For the SNR = 1 dB case, the benchmark method outperforms the proposed SVD-based approach, suggesting that the cMLP-SVD model was not able to learn the right patterns after singular values have been discarded. For the other two datasets, the benchmark method shows less robustness to the pruning of weight connections and achieves on average a lower performance.

. | Accuracy (%) . | Training time (s) . | Remaining neurons . | Number of FLOPs . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

SNR (dB) . | 1 . | 5 . | 10 . | 1 . | 5 . | 10 . | 1 . | 5 . | 10 . | 1 . | 5 . | 10 . |

cMLP | 66 | 89 | 100 | 57 | 57 | 56 | 50 | 50 | 50 | 106 920 | 106 920 | 106 920 |

cMLP-SVD | 71 | 94 | 100 | 23 | 21 | 13 | 9 | 7 | 3 | 19 262 | 14 986 | 6434 |

cMLP pruned | 77 | 91 | 92 | 57 | 55 | 58 | 50 | 50 | 50 | 19 344 | 15 072 | 6528 |

rMLP | 65 | 91 | 100 | 36 | 36 | 36 | 100 | 100 | 100 | 104 910 | 104 910 | 104 910 |

rMLP-SVD | 74 | 93 | 86 | 21 | 20 | 18 | 11 | 8 | 7 | 11 549 | 8 402 | 7353 |

. | Accuracy (%) . | Training time (s) . | Remaining neurons . | Number of FLOPs . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

SNR (dB) . | 1 . | 5 . | 10 . | 1 . | 5 . | 10 . | 1 . | 5 . | 10 . | 1 . | 5 . | 10 . |

cMLP | 66 | 89 | 100 | 57 | 57 | 56 | 50 | 50 | 50 | 106 920 | 106 920 | 106 920 |

cMLP-SVD | 71 | 94 | 100 | 23 | 21 | 13 | 9 | 7 | 3 | 19 262 | 14 986 | 6434 |

cMLP pruned | 77 | 91 | 92 | 57 | 55 | 58 | 50 | 50 | 50 | 19 344 | 15 072 | 6528 |

rMLP | 65 | 91 | 100 | 36 | 36 | 36 | 100 | 100 | 100 | 104 910 | 104 910 | 104 910 |

rMLP-SVD | 74 | 93 | 86 | 21 | 20 | 18 | 11 | 8 | 7 | 11 549 | 8 402 | 7353 |

The training time is drastically reduced due to the discarding of singular values. This is shown in Fig. 4, where the remaining number of singular values after every discarding point are compared for the two network models for the same task of classifying five or ten impulse responses with a 1 dB SNR. Increasing the number of output classes leads to a faster reduction of neurons in the hidden layer as training progresses but does not necessarily lead to a bigger reduction in training time. As shown in Fig. 4, the SVD models with ten output classes discard more neurons at the first and maybe second discarding points, but fewer at the third discarding point. This behavior can be observed for all scenarios with different SNR values investigated here. These results suggest that the chosen discarding threshold of 0.2 was large, so that it made the SVD models finish training with a similar number of neurons in the hidden layer, regardless of the structure of the network. This observation is particularly valuable as it describes the varying behavior of the SVD approach across different cMLP models. Note that the reduction in the number of network parameters resulted from the pruning may not reveal the “effective dimensionality” of the network, since, as shown by Maddox (2020), “simple parameter counting can be a misleading proxy for model complexity and generalization performance.”

## IV. DISCUSSION

### A. Performance improvements

The pruning method proposed here for cMLPs shows excellent performance, similarly to the work presented in Paul and Nelson (2023) for the real-valued models. The detailed investigation of multiple scenarios with different datasets shows the robustness of the cMLP-SVD approach for the scenarios investigated here. The method can achieve the same or even better accuracy than the regular cMLP in less training time and needs fewer FLOPs. The pruning approach can adapt to different network structures and removes neurons in the hidden layer based on an empirically defined threshold. The multiple discarding points spaced logarithmically show the importance of discarding neurons several times during the training.

Compared to the benchmark method adapted from Han (2015), the proposed SVD-based technique shows a higher classification accuracy in most of the cases. The benchmark method is not always able to fully recover after a large number of weights are set to zero; and therefore, on average over ten trials, it achieves a lower accuracy than the cMLP-SVD method. Since the method is setting weights to zero, but is not changing the size of the model, one would need to perform sparse matrix multiplications in order to save training time and FLOPs. The advantage of the proposed SVD technique is that it changes the network structure as training progresses and finishes training with a smaller number of neurons in the hidden layer. At least for the scenarios investigated here, evaluating the singular values and removing those that are small seem to be a more robust way to eliminate neurons and weight connections. Of course, both the cMLP-SVD and the benchmark methods can be developed further and improvements have been already proposed for unstructured weight pruning based on a threshold (Cheng , 2023).

Note that the benchmark technique has been used with prior information from the cMLP-SVD approach. For example, if the cMLP-SVD model finished training with five neurons in the hidden layer, the same percentage of weight connections in the cMLP was set to zero when implementing the benchmark method. However, if one would blindly select a percentage for setting weight connections to zero, the classification accuracy would be slightly different. For example, for the SNR = 1 dB and ten output classes, a pruning percentage of 50% leads to a classification accuracy of 70% on average, which is slightly lower than that of the cMLP-SVD. Similarly, if the pruning percentage is increased to 70%, the classification accuracy increases to 75%. One of the reasons for a better performance when more weight connections are removed in the benchmark technique is that the model complexity is reduced and the chance of overfitting is smaller. However, if too many weight connections are set to zero, the chance increases for the models to fail to learn the right patterns after the pruning. This behavior can be observed, especially for the easier tasks (SNR = 5 dB and SNR = 10 dB), where the benchmark method performs worse than the cMLP-SVD. Interestingly, if the discarding threshold of the cMLP-SVD is increased to 0.3, the performance for the SNR = 1 dB and ten output classes is increased from 71% to 81%. However, if the benchmark technique (cMLP pruned) is implemented by eliminating as many weight connections as the cMLP-SVD (94%), the accuracy decreases from 77% to 65%.

### B. Reduction of overfitting effect

While the main purpose of the presented technique is to reduce the size of the model in an efficient way, the fact that pruning the network models can reduce overfitting and enhance performance is worth discussing further. The proposed pruning method has a regularizing effect on the model in the sense that it discards small singular values. It ignores certain neurons during training due to the discarding of singular values, thus leading to a reduced overfitting effect in the case investigated here. Compared to a classical approach of reducing overfitting, such as the dropout method (Hinton , 2012), the SVD-based pruning technique permanently changes the network structure as training progresses and the final trained model consists of fewer parameters and a reduced complexity. During dropout, the training weights are kept the same (apart from those that are masked), while in the proposed pruning technique, all the values in the weight matrix are changed at every discarding point. The potential advantage of the pruning method to reduce overfitting and thus improve performance is task-dependent and depends strongly on the training process. If certain neurons that are considered redundant learn irrelevant information from the training dataset, discarding them will automatically reduce overfitting. This is indeed the case here, where due to the small dataset and large SNR values, the original network models were prone to overfit on the patterns in the training dataset. In order to investigate this behavior further, it would be helpful to undertake a systematic ablation study (see, e.g., Meyes , 2019), which aims to better understand the inner representations of network models. Following such an approach, one can determine the importance of specific parts of the network model and which of these representations are redundant.

### C. CVNNs outperform RVNNs

When compared to the real counterparts, the CVNNs outperform the RVNNs in most of the cases, being able to reduce overfitting and enhance the performance. Both networks need similar training times and number of FLOPs; however, due to real operations, the rMLP and rMLP-SVD models need slightly less FLOPs, if enough neurons have been removed during the pruning. One of the main reasons for the performance enhancement is that the complex multiplication reduces the degree of freedom in the CVNNs compared to a multiplication of the real and imaginary parts independently. As discussed in detail by Hirose (2011), reducing a “possibly harmful” part of the freedom during training can result in a better generalization, since the arbitrariness of the solution is reduced. Moreover, as noted above, the use of complex numbers during training imposes additional constraints on the network, which are beneficial in this case. For example, the use of a phase constraint in the target output (1 + 1j) also demonstrates an improved convergence.

### D. Removal of neurons using the SVD

Compared to other pruning techniques that remove neurons based on a magnitude threshold [see Cheng (2023) for a comprehensive recent review], the SVD-based method creates a new matrix of weights using the information from the low-rank approximation. Therefore, the technique is not only removing redundant neurons, keeping all other weights the same, but rather rebuilds the matrix of weights and the hidden layer using less information that is considered to be more important for the training process. This is the main reason for the large drops in accuracy as training progresses and, depending on the task to be solved, the time needed to recover the accuracy will vary. During the simulations, it was found that the discarding threshold strongly depends on the training process. Based on the distribution of singular values of the matrix of weights at every discarding point, a larger or smaller number of singular values will be discarded. The main influential factors include the training data, weight initialization using random numbers, activation functions, learning rates, and network structures. The substantial dependence on these factors emphasizes the potential for implementing an adaptive process for the discarding threshold. Such an adaptive process would enable the network to autonomously determine the optimal discarding threshold based on the distribution of singular values at any given discarding point.

Finally, this work focused on the use of three discarding points during training, although this number can be varied, possibly allowing for the removal of more singular values or enhancing the performance. However, the computational time may lengthen with each additional discarding point, especially for large weight matrices, where the SVD can become computationally expensive. Future work could involve exploring the scalability of SVD-based pruning with larger networks containing multiple hidden layers or large weight matrices. Initial simulations on cMLP models with several hidden layers showed great potential, but further investigation is required if more general conclusions can be reached. In general, there will be a trade-off between the number of SVDs during training and the number of discarded singular values. If not enough singular values are discarded during the training, the training time might increase. However, even in this case, the advantage of having a pruned network model at the end of training that requires a smaller number of FLOPs when used on a device can be beneficial.

## V. CONCLUSIONS

This paper has analyzed from first principles a cMLP with any number of hidden layers and non-holomorphic activation functions. The analysis presented enables the use of the SVD to observe the behavior during training of the singular values of the weight matrices in the network. It is shown how the removal of small singular values during training enables a reduction of the number of neurons in the hidden layer. The discarding of singular values is undertaken sequentially during the training and the time that is saved depends mostly on the size and shape of the MLP network. The proposed cMLP-SVD approach has been successfully applied to a classification task using transient model signals, where the network was trained to distinguish between very small changes in the signals. The effect of SNR on classification accuracy was also established. The performance was compared to the regular cMLP, and it has been shown that the cMLP-SVD can achieve the same or higher accuracy compared to the regular cMLP, using less training time and fewer FLOPs to implement. When compared to the real-valued network models, both cMLP and cMLP-SVD outperform their counterparts in most of the cases and are less prone to overfitting. The method proposed for the cases considered here is likely to be extended to other network architectures and multiple hidden layers, although additional investigations will be necessary.

## ACKNOWLEDGMENTS

This work was supported by the Engineering and Physical Sciences Research Council (EPSRC, UKRI) EP/R513325/1.

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts of interest to disclose.

## DATA AVAILABILITY

The data that support the findings of this study are available from the corresponding author upon reasonable request.

### APPENDIX A

*k*th neuron is given by

*R*and

*I*denote the real and imaginary parts of the complex number, respectively. Using the identities from Adali (2011), it follows that

*L*with respect to the vector of outputs $z(1)$, the expression becomes $\u2202Lk/\u2202z(1)=(\u22121/4)[d1*,d2*,\u2026,dK*]=dH$. Similarly, $\u2202Lk/\u2202z(1)*=(\u22121/4)[d1,d2,\u2026,dK]=dT$.

_{k}### APPENDIX B

*m*th output of the softmax function $zm(1)$ and due to the

*n*th input into the function $an(1)$, the derivative is given by

*K*output neurons. Since the derivative of the imaginary part with respect to $anR(1)$ is zero, if one applies the quotient rule, the resulting expression for the two cases $m=n,m\u2260n$ is very similar to the real-valued softmax derivative and is given by

*m*=

*n*derivatives on the diagonal and all $m\u2260n$ derivatives on the off diagonal positions.

### APPENDIX C

*m*th term $am(1)$ with respect to $zm(2)$ is given by

### APPENDIX D

*m*th element of the derivative can be written as

## REFERENCES

*Lecture Notes in Computer Science*

*Matrix Computations*

*Proceedings of the 28th International Conference. on. Neural Information Processing Systems*

*Proceedings of the International Joint Conference on Neural Networks*

*2017 International Joint Conference on Neural Networks (IJCNN), Anchorage*

*Proceedings of the IEEE International Conference on Image Processing*