A previous paper by Paul and Nelson [(2021). J. Acoust. Soc. Am. 149(6), 4119–4133] presented the application of the singular value decomposition (SVD) to the weight matrices of multilayer perceptron (MLP) networks as a pruning strategy to remove weight parameters. This work builds on the previous technique and presents a method of reducing the size of a hidden layer by applying a similar SVD algorithm. Results show that by reducing the neurons in the hidden layer, a significant amount of training time is saved compared to the algorithm presented in the previous paper while no or little accuracy is being lost compared to the original MLP model.
1. Introduction
The trend of applying machine learning techniques to solve problems in acoustics has increased rapidly in recent years. A number of different applications are discussed in Bianco (2019), where the authors present a detailed review of machine learning models and their use in acoustics. Similarly, Purwins (2019) presents a summary of deep learning approaches applied to audio signal processing. More recently, Grumiaux (2021) presented a review of deep learning approaches focused for audio source localisation applications. Drawing from these reviews, among other sources, it becomes clear that there has been a consistent trend of building network models of increasing complexity to handle more difficult tasks. Notably, as the network model size escalates, it inherently demands a higher computational power, thereby leading to longer training times.
The so-called “pruning” of neural networks is a very active research topic (Blalock , 2020; Choudhary , 2020) because the use of computational power has become an issue in the use of high dimensional networks with large numbers of neurons. It has been presented in Denil (2013) that a large number of weight parameters in the usual network architectures can be observed as redundant, suggesting that network layers are usually over-parameterized. This redundancy implies that numerous weight values encode similar or indistinguishable patterns, thereby offering opportunities for optimization and compression of the network without compromising its overall performance. There are various proposed techniques to implement network pruning (e.g., Augasta and Kathirvalavakumar, 2013; Blalock , 2020; Suzuki , 2018), and some of these methods are based on a low-rank approximation of the weight matrices, where, for example, a weight matrix is approximated by two or more low-rank matrices (e.g., Jaderberg , 2014; Shmalo , 2023). In Yang (2020), for instance, the authors propose the use of the singular value decomposition (SVD) on the weight matrix at the beginning of the training, decomposing the matrix into two smaller matrices and performing a full-rank SVD training before removing small singular values on the trained model. Similarly, Shmalo (2023) use the same low-rank approximation of the weight matrix using two smaller matrices with the aim of reducing the overfitting of networks and improving the accuracy. In acoustics, some previous work (Cai , 2014; Xue , 2013) used SVD-based approaches to reduce the training parameters of feed-forward networks. In Xue (2013), the authors computed two small matrices from the SVD matrices of the weights at one point during training and showed that for a speech recognition task, they can remove a large proportion of the network without losing any accuracy. In Cai (2014), the authors applied a similar technique to a speech recognition task, arguing that the use of the SVD directly on the randomly initialised weight matrices is not beneficial, such that they apply the pruning technique to a model that trained for a couple of iterations. More recently, Singh and Plumbley (2022) applied a different pruning technique (not based on the SVD) to a convolutional neural network (CNN) for acoustic scene classification. The authors remove convolutional filters by assuming that similar filters produce similar responses and are, therefore, redundant for the overall training.
The technique presented previously by Paul and Nelson (2021) is based on decomposing the weight matrices into their component SVD matrices. This differs from the methods used previously in Cai (2014) and Xue (2013) in that it enables the user to reduce training parameters in an iterative way during training and uses all three component matrices of the SVD during training with the backpropagation. It has been shown in Paul and Nelson (2021) that by using the SVD approach, significant training time can be saved (up to 2/3) with little or no loss in generalization accuracy. This is achieved by removing training parameters between the input and hidden layer iteratively without changing the shape of the network. In this paper, the authors present an extension of the SVD technique to decrease the size of the hidden layer. The SVD approach applied to network models can be traced back to an early paper by Psichogios and Ungar (1994), where the authors use the SVD to reduce overfitting and improve the generalization error. These authors removed the redundant singular values and corresponding hidden layer nodes after the training was completed, which is in contrast to the current approach.
In the work presented here, the focus is on the pruning capabilities of the application of the SVD by iteratively removing neurons in the hidden layer during training such that the network size can be gradually reduced over time. The technique is applied here to multilayer perceptrons (MLPs) with only one hidden layer, but the method can be easily extended to MLPs with multiple layers. Compared to other previously proposed pruning techniques that use the SVD, the work presented here does not require a full training of the model before discarding training weights (e.g., Psichogios and Ungar, 1994; Yang , 2020), it is discarding neurons in the hidden layer progressively, not just one time at the beginning or at some point during the training (e.g., Cai , 2014; Xue , 2013) and, overall, it is able to adapt to the training data and task to be solved as will be discussed later. In addition, the authors believe that use of the SVD approximation on the matrix between the input and hidden layer could potentially offer useful information about what audio content from the training samples (for example, frequency content) is found by the network model to be more important during the training. This way, one could potentially better understand the patterns found by the MLP in the input data while discarding redundant information.
2. Reduction of network dimensions using the SVD
Beginning with a simple MLP network with one hidden layer, the forward and backward propagations can be written using vector and matrix notation as discussed by Paul and Nelson (2021). The matrices denote the matrices of weights relating the input to the hidden layer and the hidden layer to the output layer, respectively, the vectors are the vectors of bias weights at each layer and , respectively, denote the inputs and outputs of the hidden layer. Note that the layers are counted from the output layer backward and, thus, denotes the input into the output layer, and is the output of the network. The network is trained by using the method of steepest descent (or one of its variants) such that the weight matrices are updated at every iteration. The general equation for the steepest descent is given by , where L denotes the loss function to be minimised, τ denotes the index defining the update of the matrices during backpropagation, and W is replaced by either or when training the above model. The equations for the relevant matrix derivatives are given in the previous paper (Paul and Nelson, 2021).
3. Illustrative application of the method
The same problem of acoustic spectral classification that was used in the previous paper (Paul and Nelson, 2021) can be employed to illustrate the application of this method. The training data were synthesised from white noise signals that were passed through bandpass filters having different centre frequencies and bandwidths. Different datasets were generated by changing three main parameters. First, the bandwidth of the bandpass filter was changed between 10 and 100 Hz, then was changed from 30 to 60 Hz, and the number of output classes was changed between three and nine classes. Using a sampling frequency of 16 kHz and a fast Fourier transform (FFT) length of 256 samples when transforming the signals into the frequency-domain, the spectral resolution available is around 62 Hz, which suggests that a difference between centre frequencies of Hz might be more difficult to observe compared to Hz. When each parameter was changed, all the others were kept the same such that there were a total of 12 comparisons between the 3 different network implementations. The training database was generated with 1000 signals from each class of bandpass filtered white noise transformed into the frequency-domain, and both training and validation datasets were shuffled before starting the training. All three networks had a single hidden layer, and the number of output nodes in the network corresponded to the number of classes of bandpass filtered white noise. The number of neurons in the hidden layer had different values to test the robustness of the time saving algorithms. The input layer contained 129 neurons corresponding to the magnitude of the FFT of the signals. The learning rate used for the following simulations was 0.001, and the networks were trained using the Adam optimizer, which is a variant of the method of steepest descent.
MLP-SVD approach to reduce the number of neurons in the hidden layer.
1: |
2: |
3: {Define discarding points vector} |
4: for do |
5: if then |
6: {Train regular MLP for first iterations} |
7: else |
8: if OR then |
9: {Matrix of weights } |
10: {Remove singular values} |
11: |
12: {Train MLP with new } |
13: else |
14: {If no discarding point, continue to train} |
15: end if |
16: end if |
17: end for |
1: |
2: |
3: {Define discarding points vector} |
4: for do |
5: if then |
6: {Train regular MLP for first iterations} |
7: else |
8: if OR then |
9: {Matrix of weights } |
10: {Remove singular values} |
11: |
12: {Train MLP with new } |
13: else |
14: {If no discarding point, continue to train} |
15: end if |
16: end if |
17: end for |
4. Results
The algorithm presented above is compared in terms of time reduction to the technique presented in Paul and Nelson (2021) and the regular MLP using the same model problem of classifying closely related spectra. The time reduction will be expressed directly as time needed to finish training but also in terms of FLOPs (floating point operations) performed when the network model is implemented in its reduced form. It should be noted that the FLOP value is a rough estimate of the number of multiplications and additions needed to classify one test sample. The comparison between the three different networks will be made based on the validation accuracy at the end of training (how well the networks can generalize), but is also based on the training time being saved using the SVD methods. For simplicity, the technique presented in the previous paper will be denoted as MLP-SVD1 and the new technique as MLP-SVD2. The results shown below are averaged over ten trials.
Figures 1(a) and 1(b) show a comparison between the three techniques for Hz and Hz using three and nine output classes. It can be observed in both plots that all three network architectures have a similar performance in terms of accuracy. As expected, all networks perform worse if Hz [Fig. 1(a)], and if nine output classes are used, the networks achieve a lower accuracy than if only three output classes are used. For both SVD approaches, the drop in accuracy can be clearly observed whenever singular values are discarded. A similar trend can be observed when increasing the hidden layer to 100 neurons. The performance of the networks remains the same, however, the training time increases and, therefore, the MLP-SVD2 approach saves more time. Table 1 shows a comparison between the different techniques for two hidden layer dimensions (20 and 100 neurons) using various dataset parameters.
Comparison of accuracy performances between the three techniques for (a) Hz using (i) three output classes and (ii) nine output classes and (b) Hz using (i) three output classes and (ii) nine output classes using a filter bandwidth of 100 Hz. All 3 networks had 20 neurons in the hidden layer.
Comparison of accuracy performances between the three techniques for (a) Hz using (i) three output classes and (ii) nine output classes and (b) Hz using (i) three output classes and (ii) nine output classes using a filter bandwidth of 100 Hz. All 3 networks had 20 neurons in the hidden layer.
Comparison of training times and performances between MLP, MLP-SVD1, and MLP-SVD2 using a network with 20 and 100 neurons in the hidden layer, Hz, and a filter bandwidth of 10 Hz. Results are averaged over ten trials. The numbers in bold represent the best performance in terms of accuracy, training time, and FLOPS.
. | Accuracy (%) . | Training time (s) . | FLOPS . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Hidden layer . | 20 neurons . | 100 neurons . | 20 neurons . | 100 neurons . | 20 neurons . | 100 neurons . | ||||||
Output classes | 3 | 9 | 3 | 9 | 3 | 9 | 3 | 9 | 3 | 9 | 3 | 9 |
MLP | 80.33 | 72.64 | 80.20 | 71.62 | 11.79 | 39.51 | 65.37 | 165.66 | 5200 | 5400 | 27 000 | 28 000 |
MLP-SVD1 | 80.50 | 72.59 | 80.73 | 71.38 | 9.09 | 30.18 | 23.69 | 65.51 | 5000 | 4200 | 2600 | 5600 |
MLP-SVD2 | 80.47 | 73.06 | 79.27 | 71.81 | 7.63 | 23.02 | 5.34 | 16.85 | 4100 | 4300 | 1000 | 2200 |
. | Accuracy (%) . | Training time (s) . | FLOPS . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Hidden layer . | 20 neurons . | 100 neurons . | 20 neurons . | 100 neurons . | 20 neurons . | 100 neurons . | ||||||
Output classes | 3 | 9 | 3 | 9 | 3 | 9 | 3 | 9 | 3 | 9 | 3 | 9 |
MLP | 80.33 | 72.64 | 80.20 | 71.62 | 11.79 | 39.51 | 65.37 | 165.66 | 5200 | 5400 | 27 000 | 28 000 |
MLP-SVD1 | 80.50 | 72.59 | 80.73 | 71.38 | 9.09 | 30.18 | 23.69 | 65.51 | 5000 | 4200 | 2600 | 5600 |
MLP-SVD2 | 80.47 | 73.06 | 79.27 | 71.81 | 7.63 | 23.02 | 5.34 | 16.85 | 4100 | 4300 | 1000 | 2200 |
It can be observed that for all different datasets, the three networks perform very similarly, on average, in terms of accuracy. When it comes to training time, the newly proposed technique (MLP-SVD2) needs the least training time in all cases. Especially when the hidden layer has 100 neurons at the start of training, both MLP-SVD techniques save more than 50% of the total training time. The MLP-SVD2 approach trains in around 1/4 of the time needed by the MLP-SVD1 network. In terms of singular values discarded, it is interesting to observe that both SVD techniques discard a similar number of singular values during training, however, due to the fact than MLP-SVD2 changes the size of the hidden layer, more training time is reduced. In terms of FLOPs, it can be observed that both SVD techniques have a smaller number of operations than the original MLP, and the MLP-SVD2 technique ends up having the least number of FLOPs in most of the cases. On average, over all 10 trials, the MLP-SVD1 technique ends up having between 4 and 15 singular values at the end of training, whereas the MLP-SVD2 method has between 4 and 16 remaining singular values. Another interesting observation is that when the network has only 20 neurons in the hidden layer, both SVD approaches end up having more singular values at the end of training (12–16 singular values) compared to when the dimension of the hidden layer is increased to 100 neurons (4–8 singular values). Interestingly, this suggests that starting with a larger number of neurons and allowing the MLP-SVD2 algorithm to prune the network may result in superior network designs. This will be investigated further by the authors.
5. Discussion and limitations
The results presented above confirm the potential of the proposed pruning technique. By progressively discarding singular values logarithmically, the network model has enough time between the discarding points to adapt to the new reduced architecture. The main limitation of this technique is that the predefined singular value threshold determines the number of discarded parameters. A solution would be to introduce an adaptive threshold as proposed, for example, very recently in Ke (2023). In addition, the SVD approach has been tested thus far on small MLP models with only one hidden layer. Further work will be investigated using multiple hidden layers, where the SVD technique could be applied to each matrix, relating two layers, or only on the same matrix, relating the input with the first hidden layer. When it comes to larger matrices, where the SVD is more time-consuming, the authors believe that the proposed SVD approach could be useful as it can be adapted to any model architecture, having the option to determine the number of discarding points required during the training. Fewer discarding points result in fewer SVD computations on the weight matrices but, potentially, also fewer discarded parameters. On the other hand, the choice of a higher threshold might result in reducing the hidden layer dimensions once only, thus, minimising the use of the SVD on a large matrix. Finally, it should also be emphasised that the same technique could be applied to any fully connected layer that comprises part of more advanced network models, many of which contain one or more such fully connected layers.
6. Conclusion
This work presented an extension of the approach described by Paul and Nelson (2021) to reduce the training time of a MLP with one hidden layer by discarding iteratively singular values during training. The novelty compared to the technique in Paul and Nelson (2021) is that, here, the dimensions of the hidden layer are reduced depending on the number of singular values discarded. Following this, the MLP network is able to reduce its dimensions until the training stops. The presented technique could be extended to more layers or other network architectures provided the SVD can be applied to the weight matrices, however, the time reduction will be task dependent, and the discarding parameters will have to be adjusted correspondingly.
Acknowledgment
This work was supported by the Engineering and Physical Sciences Research Council (EPSRC,UKRI) EP/R513325/1.