We propose a novel, general-purpose framework for cavitation detection in a wide variety of hydraulic machineries by analyzing their acoustic emissions with convolutional neural networks. The superiority of our system lies in the fact that it is trained exclusively with data from model turbines operated in laboratories and can directly be applied to different prototype turbines in hydro-power plants. The challenge is that the measurements to train and test the neural network stem from machines with various turbine designs. This results in train and test data with different data distributions, so-called multi-source and multi-target domains. To handle these domain shifts, two core methods are provided. First, an advanced pre-processing pipeline is used to narrow the domain shift between data from different machines. Second, a domain-alignment method for training neural networks under domain shifts is used, resulting in a classifier that generalizes well to a wide range of prototypes. The outcome of this work is a generic framework capable of detecting cavitation in a wide range of applications. We explicitly do not try to obtain the highest accuracy on a single machine, but rather to achieve as high as possible accuracy on many machines.
Solving the recognition of cavitation in hydraulic machinery without optical access will have a large impact on the sustainability of hydraulic machinery. As renewable energies from wind and sun are massively expanded, energy production is subject to strong fluctuations due to environmental influences. Hydro-power as compensation must, therefore, also become more flexible. This leads to increased start/stop operations and shorter transition times and, thus, longer operating times at operating points for which the machine is not designed. Operation at off-design conditions leads to an increased dynamic loading of the machine. The occurrence of cavitation, in particular, can lead to problems, such as power reduction, vibration, and erosion, on the surfaces of turbine components.1 In order to reduce the influence of cavitation on machines, some methods, such as air injection, are already known in the literature.2 However, this is not always possible or the cavitation cannot be eliminated completely. Therefore, the detection of cavitation is important for a long-lasting and smooth operation of the machines as well as the identification of possible cavitation erosion damage. These results are also valuable to perform predictive maintenance and avoid unnecessary machine downtime.
There are several concepts3 to access the cavitation state by analyzing the net positive suction head (NPSH), visualizing the flow through a transparent model machine casing, or by measurements of the static pressure within the flow. However, these concepts are often not feasible on prototypes. In addition to this, common techniques for cavitation detection are measuring the structural vibrations of hydraulic machines and analyzing those signals. Relevant are, e.g., low frequency vibration signals collected by accelerometers or high frequency signals, which are called acoustic emissions (AEs) measured by AE sensors. Analyzing these signals includes analytical approaches to extract characteristics from the vibration signals, such as signal demodulation, evaluation of the frequency spectral content, and other time–frequency features to serve as indicators for the occurrence of cavitation.3–10 Nevertheless, they are often tailored to a specific machine or problem setup or cannot be evaluated fully automatically. Instead of analytical approaches, data-driven models are used to process vibration data. These cavitation monitoring systems are built on support vector machines,11–14 decision trees,15 or different neural network architectures.13,15–24 They mainly differ in the way the signals are represented as input for the neural network, i.e., different manually extracted features or spectral representations of the vibration signals. So far, most data-driven approaches are fitted and evaluated on the same machine and, thus, do not generalize across multiple machines. Also, they are often only using model machines tested in the laboratory environment or test rigs to develop their approaches and do not test them on prototypes during real power plant operation conditions.
In this work, we also employ a data-driven approach, where the basic concept is shown in Fig. 1. In particular, convolutional neural networks (CNNs) are employed, which achieved great success in computer vision tasks in the last few decades. An advanced pre-processing pipeline is used to transform the raw acoustic emission signals into spectrograms, which serve as input for the CNN. This data representation is necessary that the CNN is able to extract the important information from the data, while not being disturbed by noise and other side effects inside the data. The core element of this work is the handling of cavitation data with respect to the domain shift contained therein. The cavitation data stem from various machines with different designs, which causes their acoustic emissions to differ as well. Thus, the data set per machine have an individual distribution, a so-called domain. In order to achieve a generalization despite the shift of the domains, a key property of this method is the pre-processing chain of the raw data. The second key property is domain alignment during training with the goal to learn domain invariant features to increase the transferability between the domains.
The cavitation detection in this work is set up as a binary classification problem with the two classes: cavitation-free and cavitation. Instead of a binary classification problem, an out-of-domain (OOD) detection would be a possible approach. The OOD detection is trained only with a single class and, thus, detects all other samples as anomalies but without predicting their class label. This is especially useful if only labels from one class are available. In our case, both classes are equally represented in the train data set, and hence, a classification approach is better suited than an OOD detection.
Our method is superior to other presented approaches in several aspects. To our knowledge, all other approaches so far have to be trained on a specific machine and subsequently are only useful for this machine. Our approach extends further and generalizes not only to the data of the training machine or training domain. We are able to create a representation of the data and combine it with the domain-aligned training to generalize the network to a variety of hydraulic machines (Kaplan, Francis, and Pump-turbines). Furthermore, our method offers the advantage that the neural network is trained exclusively with data from physically small model machines and is still transferable to prototypes, which are running under real power plant operating conditions. Last, our approach is able to evaluate the reliability of the predictions. For this purpose, an ensemble of eight CNNs is used for the classification. The variance of the predictions is used to determine the uncertainty of the classification result. This can enhance the expressiveness of the network by evaluating both the confidence and the uncertainty of the predictions.
The main contributions are as follows:
Advanced pre-processing pipeline to generate suitable data representations;
Data analyses with respect to their domain shifts;
Domain-alignment training to get strong generalization;
In-depth evaluation of the cavitation classification results.
To train a successful deep learning model, an extensive data set is necessary. This chapter first describes how the cavitation data are collected and constructed. Then, the data subsets for training, validation, and testing the network are explained. Next, the main issue regarding the domain shift within the data distributions is discussed. Last, the pre-processing pipeline is described, which is used to transform the raw signals into spectrograms.
A. Cavitation data
In this work, the acoustic emissions of hydraulic machinery are analyzed to detect the occurrence of cavitation. Thus, the acoustic emissions of several turbines have to be measured to build a suitable database. The data set is composed of six model machines for training the neural network and five prototypes for evaluation. To enable an extensive investigation and to gain expressiveness about cavitation, the acoustic emissions are measured by several sensors at different positions. The models are mainly horizontal shaft Kaplan turbines together with one Pump-turbine. As shown in Fig. 2(a), the sensors are placed upstream and downstream of the impeller. Additionally, signals are directly recorded at the impeller and on the draft tube (DT). However, they are not considered further, as they do not lead to an improvement of the training process, because of their low signal quality. In contrast, the prototypes are vertical Kaplan and Francis turbines with sensors placed on the head cover (HC), inner head cover, or draft tube, as shown in Figs. 2(b) and 2(c), respectively. For the prototypes, the draft tube sensor position is well suited to analyze cavitation. Due to the mechanical design, these positions are not accessible on every machine meaning that no uniform positioning is possible on the prototypes. The signals measured close to the impeller of the prototypes overdrive so that no diagnosis is possible there.
The general measurement setup is almost the same for all machines. Minor changes in the measurement series are due to the long period of time needed to collect the data. The collapse of the cavitation bubbles causes broadband vibrations, which are mainly pronounced in the high frequency range.25 Often the low frequency range is excluded to avoid influences from machine noise and other environmental noises. Thus, all signals are measured with acoustic emission (AE) sensors with a bandwidth of 100 kHz to 1 MHz. In general, a sampling rate of 1–2 MHz is used to capture a frequency range of at least 500 kHz. With an extension to higher sampling rates for some measurement series, higher frequency ranges are also investigated but show no significant amplitudes there. The exact sampling rates are stated in Table I. In addition to the varying sampling rates, the measurement duration may vary. A major part of the data are measured for 2 s per sample, but also shorter durations are present such that only a few impeller rotations are captured by the measurement.
|Train data .|
|ID .||Design .||Orientation .||Size .||.||#Samples .||Ratio .|
|Train data .|
|ID .||Design .||Orientation .||Size .||.||#Samples .||Ratio .|
The measurement series of the models are all collected on test rigs. By varying flow rate and pressure level, the cavitation-free operation points are changed stepwise until the cavitation limit is reached. Afterwards, the operation conditions are chosen to gradually increase the cavitation intensity as well. Since the cavitation detection is implemented as binary classification problem, labels are necessary for both classes, also referred to as ground truth. The cavitation states of the model machines are determined via optical accesses to provide the true labels for training the network. Thus, all samples below the cavitation limit are defined to be cavitation-free and labeled with the class free or class 0. Samples above the cavitation limit are then cavitating and labeled with the class cavitation or class 1.
In contrast to the laboratory measurements, the prototypes are running under real power plant operating conditions to collect the AE data. The signals are measured automatically over a defined period of time at intervals of 15 min. The selected samples are shown in the normalized H–Q operating range chart in Figs. 3(a)–3(c), indicating the individual operation points. The flow rate Q and the head H, which is the total energy difference at the hydraulic machine, are normalized to their respective maximum values. The operation of the first two machines is monitored over more than one year, while the third one is monitored for nine months. Under daily operation conditions, many measurements contain redundant information, so only several hundred samples from the entire measurement period are randomly selected for evaluation. Such a long measurement period does not exist for the other two machines. Only shorter sections of the daily operation are available, which nevertheless cover most parts of the operating range chart. The corresponding samples of those machines are shown in the charts in Figs. 3(d) and 3(e).
Also the samples of the prototypes have to be labeled to use them for the evaluation of the classification. However, true labels as for the model machines are not available, since optical access is not possible for prototypes. Therefore, manufacturer information and knowledge about the operating range charts with the cavitation limits are used to infer the labels. Thus, they are just estimated labels with a blurry cavitation limit. The allocation of the labels is encoded by colors in Fig. 3, where cavitation-free samples are marked in blue and cavitation samples are marked in red. The cavitation boundaries are drawn as black lines. In the case of the Kaplan turbines in (a), (c), and (d), the cavitation in the overload region is beginning slightly below the cavitation limit, as well as for (a) in the part load region. In the case of the Francis turbine in (b), the ground truth just outside of the cavitation boundaries is labeled with class 1. In the case of the Pump-turbine in (e), an exact cavitation limit in the operating range charts is not given. The Pump-turbine exhibits a cavitation limit at low power, which is approximately sketched. Some samples below and above the limit can, therefore, belong to the opposite class.
In addition to the sensors at the five prototypes described in this section, data are available from additional sensors on those machines. They were excluded in this work, since their signal quality is worse, i.e., an overdriven signal or signals with a low signal-to-noise ratio (SNR). From both types of signals, it is difficult to extract useful features. During the inference, such test data lead to bad classifications, since these signal characteristics are not used for the training and thus are unknown for the neural network.
An important note on the test setup concerns the runner speed, which is varying among the different machines used to build the data set. Especially, the difference between the faster running models and the slower running prototypes leads to strong differences in the signals and thus to difficulties in the generalization of the trained networks. The variable speed and sampling rates lead to challenges that are compensated to a certain level by the signal pre-processing.
B. Data sets
An overview of the available data samples for training and evaluation of the deep learning model is presented in Table I. The table is split into train and test data showing the machine design, the orientation of the machine, whether it is a model machine or prototype, the sampling rate given in MHz, the number of measurement samples, and the respective ratio of cavitation-free samples. For further comparison, IDs (M1 to M11) are associated with the machines. We want to mention again that the train data only include samples from model machines, while the test data stem from prototypes. Moreover, the machines used for collecting the train data are mostly horizontal shaft oriented machines, compared to the vertical shaft oriented prototypes, which reveals the strong generalization capability of our method.
In the training of machine learning models, it is common to further split the train data into a train set and a validation set. The train set has a larger ratio and is directly used in the training process to optimize the parameters or weights of the model. The validation set is only used indirectly to monitor the progress of the training to find the best generalizing model and to prevent overfitting.
In this work, the data from all model machines are split into train (70%) and validation (30%) set. Therefore, the samples are randomly selected from the whole training data set including samples from all available models. In every training run of an individual ensemble member, the random selection is repeated to reduce the influence of a badly selected train–validation split.
C. Domain shift in data distributions
Handling the cavitation data is very challenging due to the domain shift within the data from different hydraulic machines. To analyze and verify the suitability of the data and to visualize the domain shift within the data, it is common to apply cluster algorithms. The goal of the clustering is to find natural cluster within the data but without information about their class membership, so-called unlabeled data. The data are processed by a cluster algorithm and divided into groups based on the features contained in the data. If it is possible to find clusters according to the class labels, the data set and its data representation are suitable to train a classifier. Ideally, the entire cavitation data set should form two clusters corresponding to the classes: cavitation-free and cavitation.
The AE signals are converted to spectrograms for using them as input to the CNN. Thus, this data representation is also used for clustering. Spectrograms are two-dimensional representations, thus precluding the utilization of well-known algorithms such as principal components analysis (PCA) or T-distributed stochastic neighbor embedding26 (T-SNE). Instead, an autoencoder27 (AE) network is used in this work, which is based on CNNs and is used to cluster 2D-image data. AEs are composed of two CNNs, where the first one is used to compress the input data into a latent space, while the second CNN tries to reconstruct the original input from the latent space. The latent space in this work is a vector with 32 entities, which are further compressed by a PCA into a 2D-representation. Then, the latent space representation of each sample is plotted and analyzed for the desired clustering. A detailed explanation to AEs is found in Ref. 27.
The data distributions of the cluster analysis for the whole cavitation data set are plotted dimensionless in Fig. 4. The color coding of the samples in a) indicates the class membership, i.e., cavitation-free in blue and cavitation in red. The cluster analysis does not show clusters according to the class labels, preventing a clear separation of the classes.
Instead of that, the individual machines are much more distinguished from each other as visible in Fig. 4(b). The color coding specifies the machine assignment. Some of the samples overlap, but nevertheless a clear separation of the individual machines is recognizable. Clusters according to the machines instead of the class labels demonstrate that the difference between each machine is more dominant than the difference due to the cavitation conditions. This shows that the data set of each machine has its individual data distributions, and thus, every machine can be considered as a separate domain. This is called a multi-source multi-target problem. A more detailed explanation of this problem setup can be found in Appendix B. To handle this multi-source multi-target setup within the cavitation data, a special class of machine learning methods has to be used, which can deal with multiple domain shifts. In this work, a domain-alignment method for training deep neural networks, called domain-adversarial training (DAT),28 is exploited.
One reason for the machine-specific data distributions is the different mechanical designs and layouts of the machines. It impacts the development of the cavitation bubbles and affects the signal propagation inside the fluid and the machine structure. Thus, the acoustic emissions vary among the machines. Even the sensors on a single machine can cause a domain shift, as is evident for machine M9, where two orange clusters correspond to two different sensors. From Fig. 4(b), it is also visible that especially model machines (train) and prototypes (test) are separated. In other words, there is a stronger domain shift between models and prototypes. Of course, the varying sampling rates could motivate the domain shifts. However, down-sampling is used in the signal pre-processing, which compensates this effect. In contrast, the runner speed has a stronger impact on the domain shift. The models are physically small and have a high runner speed in a range of about 1500 min−1. In comparison, the larger prototypes have a slower runner speed below 375 min−1. Depending on the runner speed, the blade passing frequency (BPF) is different per machine. The acoustic emissions of the blade passing are measured, resulting in signal characteristics that are significant for the occurrence of cavitation. Due to the large gap between the BPF, the amount of information about cavitation within the measurements is very different. Thus, the signals have to be modified in the pre-processing phase to align the amount of information per measurement. This will reduce the domain shift clearly. However, as visible in Fig. 4, a domain shift still remains, which is considered by the domain-alignment training.
In addition to an extensive data set, proper data pre-processing is also crucial for successful deep learning model training. The raw signals are transformed into a representation such that the neural network can extract appropriate features from the data in order to achieve a good generalization. Also the pre-processing is able to reduce the domain shift to a certain level. During the pre-processing, the digital raw signal measured by the AE sensors is transformed in several steps into a spectrogram. The complete pipeline is shown as a block diagram in Fig. 5. The basic steps of the pre-processing follow standard processes for acoustic scene classification.29 Then, the pipeline is adapted and extended to fit the problem setup of this work.
First, the raw signal is processed in the time domain. Because of the minimal sampling frequency of 1 MHz as mentioned in Sec. II A and the Nyquist frequency, a maximum frequency range up to 500 kHz can be resolved. In our data set, frequencies higher than 450 kHz do not have relevant amplitudes. Also frequencies below 100 kHz are excluded to reduce the influence of machine noise and other environmental noises. Thus, a bandpass filter with a frequency band between 100 and 450 kHz is applied. Then, the varying sampling frequencies are aligned by downsampling them to a common value of 1 MHz. Afterward, the amplitudes of the time-signal x are normalized to zero-mean unit variance via . This reduces the impact of amplitudes with different magnitudes, caused by various signal paths due to the sensor placement. Next, a Hilbert transformation is applied to get the envelope of the processed time signal. In the following steps, both signals, the envelope and the time signal, are processed separately as two parallel channels. The additional use of the envelope increases the expressiveness of the signal, which is a common approach in vibration analysis.30 In the last step, the signals are split into snippets in a way that the signal snippets of different machines contain the same amount of information. We have found that a time period of four runner revolutions is a proper setup for our application. With four runner revolutions, the signal captures enough information about the stochastically occurring cavitation. Longer time horizons increase the size of the individual input samples but do not improve the results. An example is shown in the bottom row of Fig. 5, where the time signal consists of five major pulses, indicating five rotations of the runner. The signal snippet in the time–frequency domain displayed in the middle shows only four pulses after cutting the signal. Because of the discrepancy regarding the runner speed of the different machines, the length of the signal snippets is strongly varying, which is challenging for the transformation into the time–frequency domain.
In the next step, the one-dimensional signal from the time domain is transformed into the time–frequency domain as two-dimensional representation, called a spectrogram. The height or y-axis of the spectrogram represents the frequency f, while the length or x-axis carries the time information t. Considering the two channels of the time-domain signal, this results in a representation of shape . This transformation is commonly done by a short-time Fourier transformation (STFT). A standard STFT has a constant window size on which the fast Fourier transformation (FFT) is applied and a constant step size to move the window over the time signal. Due to the strong varying length of the signal snippet, a standard STFT is not reasonable. Applying a STFT with a constant window and step size to signal snippets with varying length results in spectrograms with varying length over the time axis to cover the period of four runner rotations. Since neural networks require a constant input size, such a spectrogram is not suitable as input data. Moreover, the time resolution of the spectrogram is strongly affected by its varying length. The time signal of a measurement at a prototype with a slow runner speed includes much more time bins than that of a fast running model. Thus, the spectrogram of a prototype would have a higher resolution than the spectrogram of the model when using a standard STFT. This is one reason for the domain shift between models and prototypes. To handle this resolution problem, an adaptive short-time Fourier transformation (aSTFT) is used, which is one essential element of this work. The aSTFT still has a constant window size of 2048 which results in a one-sided spectrogram with 1024 frequency bins. However, the aSTFT is using an adaptive step size, which is calculated from the total signal length and the runner speed, such that the resulting spectrogram has exactly 1024 time bins to cover four runner revolutions of the time signal with a variable length. After applying the aSTFT, the resulting spectrograms now have a constant shape with and a more comparable resolution along the time axis.
Afterward, a harmonic/percussive source separation31 (HPSS) is applied to the spectrograms to extract more information from them.32 The HPSS filters a spectrogram to extract the horizontal (harmonic) or vertical (percussive) dominant part, respectively. This yields an additional spectrogram representation for the harmonic and percussive part, which increases the total amount of channels from two to six. Then, the magnitude of the amplitudes of all six channels is transformed from linear to logarithmic scale. Still, the resolution of the spectrogram with pixels is high and shows many details of the signal. Small time–frequency patterns and the measurement noise are spread over the whole spectrogram without holding relevant information. This impedes the extraction of useful features and, thus, the generalization. To solve this, triangular filter banks are used to unite groups of attached pixels by a triangular weighted average. Using 64 frequency filters and 128 time filters reduces the size of the spectrogram, resulting in the final representation shape of . This smooths the resolution of the spectrogram, while keeping the significant amplitudes, as visible in the spectrogram in Fig. 5. Also, this representation of the spectrograms is now better comparable between different machines, which strongly reduces the domain shift. Further examples of the resulting spectrograms for the cavitation and cavitation-free condition corresponding to Kaplan, Francis, and Pump-turbine are shown in Fig. 12 in Appendix A. This demonstrates that the representation of the spectrograms is comparable, but each machine still has its unique patterns, which yields the domain shift within the data.
This chapter first provides an overview of the used network architecture. Then, the basic training setup is described with data augmentation and the usage of network ensembles. Afterward, the domain-alignment training is presented as a core element of this work. Last, different evaluation metrics and methods are presented.
A. Network architecture
For the classification of the acoustic emissions, a common state-of-the-art convolutional neural network with residual connections33 (ResNet) is applied. The standard ResNet architecture is modified to fit the problem setup of this work. The used architecture is shown in Fig. 6, where each layer is specified by its layer name, the used settings, and the resulting output shape of the respective layer.
For neural networks, it is common to normalize the inputs to zero-mean and unit variance as used here or in the range of . In general, the normalization statistics are calculated from the whole training data set and are transferred to the test data. According to Ref. 34, it is beneficial to use instance normalization35 (IN) as the first network layer in the case of strong domain shifts. For IN, the normalization statistics (mean and standard deviation) are calculated individually for each input sample and separately per channel of the input. Applying IN to the input layer helps the classifier being invariant to channel-wide shifts and scaling of the pixel intensities. Thus, IN can help to reduce the divergence between the data domains, while keeping a globally optimal classifier.
Afterward, four identical blocks are used, where each block has a convolutional layer, followed by a batch normalization, an activation function, and a pooling layer. All four convolution layers are using the same settings to extract features from the data. Each has filters with a size of , where the filters are shifted with a step size of in both spatial dimensions across the data samples. This increases the number of the input channels from to a hidden representation with channels. Afterward, the hidden representation is again normalized by the batch norm layer to further reduce the internal covariate shifts, which otherwise causes slower convergence and unstable training. Then, the rectified linear unit (ReLU) is applied as non-linear activation function. Last, a max-pooling layer with a window and a step size of is used to aggregate neighboring pixels by just passing the pixel with the maximum value and discarding the remaining ones. This reduces the size of the hidden representation by a factor of two in both spatial dimensions. With increasing depth of the network, this operation increases the receptive field of the convolution filters with respect to the input data.
The residual connections of the ResNet are displayed in Fig. 6 as skip connections. This operation is formulated as , where is the hidden representation of the respective layer, and indicates a layer block. Thus, the information from the previous hidden representation is skipping this layer block and directly added to the next block without modification. This operation improves the gradient flow during the backpropagation through the network.
The convolution blocks are just used to extract visual features from the data samples. The actual classification is performed afterward in the last layers. The resulting feature representation from the last convolution block with the shape of  is carrying the high level features. These feature maps are then transformed to single values in the average pooling layer forming a feature vector of size 64 by taking the average value of the  feature maps.
Then, a linear layer is used to assign the feature vector to a class. This raw output is called logit. In the case of a binary classification, a single output neuron is sufficient. A sigmoid function is applied to get the class probability out of the logits with values in the range of [0, 1]. Finally, the class label is obtained by a decision boundary dividing the class probability into a binary class label with or . The probability has to be above 50% to assign class 1, which corresponds to a threshold of for the training samples. A higher threshold of is used for the test data. So the confidence of the network must be higher when predicting a test sample as cavitation. The second branch after the average pool is only used in the domain-alignment training, being described in Sec. III C. The domain branch has two linear layers and six output neurons for the respective domain, resulting in a multi-class classification. This requires a softmax function to transform the output logits into a class probability distribution. In this case, an argmax function has to be used instead of a decision boundary to get the predicted domain label as the argument of the neuron with the highest probability.
B. Training setup
Neural networks in general are optimized via gradient descent methods. The optimizer only uses a random subset (mini-batch) of the whole data set per iteration to estimate the gradients. Thus, it is called stochastic gradient descent (SGD). In this work, an adaptive SGD method called Adam36 is applied.
The network is trained in a batch-wise fashion with the binary cross-entropy as the loss function
where y is the true label, and is the predicted probability. The batch size gives the number of samples used for a single iteration step in the optimizer.
Training is performed for 100 epochs, which means that the whole training data are passed 100 times through the network to optimize the weights. Within 100 epochs, the training converges to a minimum of the loss function. Larger numbers of epochs result in overfitting of the model. A cosine annealing37 is chosen for the optimizer to reduce the learning rate from an initial value of to over the first 75 epochs, followed by a constant phase over the last 25 epochs. Since the cavitation data set used in this work is a comparatively small data set, overfitting is an issue that has to be considered. To reduce the overfitting, a strong weight decay of is applied. Weight decay adds a L2-regularization to the loss function. This restricts the parameter of the model and can lead to an increased generalization. All hyperparameters used for training are derived by a grid search.
To further prevent overfitting, several data augmentation techniques are applied to the input data. In the case of spectrograms as input data, the augmenting methods have to be carefully chosen in comparison with other natural images. We found that the following three augmentations work well for the cavitation data.
Flip: The spectrogram is flipped along the frequency axis, which is comparable to reverse the time horizon.
Time mask: One part of the spectrogram is randomly masked with zeros. The mask is covering the whole frequency range and has a random width in the range of [1,40] pixels.
Frequency mask: Mask of zeros, which covers the whole time range and has a height in the range of [1,20] pixels.
An example for each augmentation together with the original spectrogram is shown in Fig. 7. All augmentations are randomly applied to the input spectrogram with a probability of 50%, respectively. Data augmentation virtually increases the amount of data by increasing the variance of the data. This helps the model to overcome the overfitting to a certain level.
To improve the classification results, a network ensemble with eight ensemble members is used in this work. Ensemble in this sense means a collection of independent networks. The number of eight ensemble members is chosen to keep the inference time as small as possible while having a sufficiently large number of networks to provide robust results. Each network of the ensemble is trained individually, but with the identical setup as described above. The training of a neural network is not a deterministic process. In each of the eight training runs, the splitting of the training and validation set is selected randomly, and thus, different train data are available in every run. Also the initial values of the network weights are chosen randomly such that the training starts from different points in each run. At last, the SGD uses randomly selected batches of data samples per iteration. In general, the training runs and the resulting models have the same behavior, but for these reasons, the models are not exactly the same. Each of them converges to another optimum with a slightly varying behavior especially at the border regions of the data distributions. Thus, a combined prediction from all eight networks leads to a more robust prediction. In this work, the output class labels from all networks are averaged to form the collective prediction , with members.
C. Domain alignment
Section III B just mentions the standard training procedure, but does not consider the domain shift during the training. To handle the gap between the data domains, the domain-adversarial training28 (DAT) is adopted to align multiple domains during training. The basic concept is displayed in Fig. 8, which has the same network architecture as shown in Fig. 6. The convolutional blocks are summarized in the feature extractor . The classifier for the cavitation detection is given as , called label predictor hereafter, and outputs the cavitation class label . Both parts and of the networks are depending on the respective network weights and . So far, this branch is identical to the standard training described above.
The goal of the DAT is to train a feature extractor that finds domain invariant features. Therefore, a domain classifier , parameterized by the weights , is added as a second branch to the network. The exact architecture is outlined in Fig. 6. The domain classifier is trained to predict the domain of the input. Since the train set is built from six machines, there are six classes representing the domains of the train data. Thus, each data sample has two labels, the cavitation label and the domain label . Being able to predict the domain correctly implies that the networks can separate the domains perfectly. Thus, the feature extractor is trained to find individual features per domain. This is exactly the opposite of the desired behavior, since domain invariant features are required, i.e., features, which are not able to separate the domains. Therefore, a gradient reversal layer (GRL) is placed between and . During training, the gradients of the loss functions are calculated with respect to network weights to optimize the weights. The GRL now multiplies the gradients from the domain loss with respect to the weights of the feature extractor by . This leads to a saddle point problem, since tries to separate the domains, while the inverted gradients lead to features in that are not able to separate the domains. Thus, the features in become domain invariant. In addition to the domain classification, is used in parallel to classify the cavitation. Thus, has to find joint features, which are domain invariant and can be used to detect cavitation together.
The DAT uses an extended loss function
where is the binary cross-entropy loss as stated in Eq. (1), used for the cavitation labels. The second term is the cross-entropy loss used for multi-class problems as in the case of multiple domains. The loss averages the loss of the label predictor and the domain classifier, while the domain loss is weighted by factor 2. This is necessary to enforce a stronger focus on domain invariance. Otherwise, the influence of is too weak to learn invariant features. A stronger weighting, on the other hand, worsens the label predictor.
During the inference, only the feature extractor and the label predictor are used to get the cavitation label . The second path with the domain classifier is ignored, since the domain prediction has no relevance for the test data or live data during daily usage.
D. Model evaluation
For the evaluation of the classifier, different metrics and methods are used. To measure the performance of the classifier, three metrics are applied, i.e., balanced accuracy, precision, and recall. These metrics quantify the classification result of the whole data set with a single value. Different metrics are required to highlight different aspects of the classification result. For example, the accuracy can be low, but the precision is high, which means that many detections are missed, but those that are detected are mostly correct.
So far, only the binary classification results are considered for evaluation. To recap, the network outputs probabilities which are transformed into binary class labels by a decision boundary. Looking at these predictive probabilities of the individual test samples is another evaluation method. This will give insight into the confidence of the network predictions. Samples with a probability close to 0% or 100% are very confident. Values closer to the decision boundary indicate lower confidence. Thus, regions can be highlighted where the classification results are not fully reliable or to find transition regions from one class to the other class.
In addition to looking at the confidence of the network, it can be helpful to investigate the predictive uncertainties. This is somehow related to the predictive probabilities but provide an additional way to evaluate the expressiveness of the network. Comparable to a sample with a probability close to the decision boundary indicating low confidence, a critical sample can have high uncertainty. A high predictive uncertainty indicates that a prediction may not be trusted. Thus, the predictive uncertainty can be seen as reasoning for a misclassification. As mentioned above, an evaluation metric can be misleading. The same holds for the predictive probabilities, where, e.g., all samples are misclassified but all with high confidence. Such samples may be identified by high uncertainties. The evaluation metrics and methods are explained in more detail below.
Since the labels in the test sets of the individual machines are imbalanced, a balanced accuracy given by the equation
has to be used, where the number of true positives (TP) and true negatives (TN) means that the correctly classified samples belong to the positive (cavitation) and negative (cavitation-free) class, respectively. The TP and TN are divided by the total number of samples labeled as positives (P) and negatives (N). Thus, each class contributes with a ratio of 0.5 to the accuracy, which neglects the label imbalance. The precision is defined as
where the number of false positives (FPs) means the false alarms for the positive (cavitation) class. Therefore, the TP are divided by the total number of samples classified as positive class. The precision rates the accuracy of all positive predicted samples. The recall is given by the equation
where the number of false negatives (FN) means the missed positive classes. In the case of recall, the TP are divided by the total number of samples truly belonging to the positive class. The recall is equal to the first term of the balanced accuracy, which is the accuracy only of the positive labeled samples.
According to Ref. 38, an ensemble of networks is a simple and scalable method for estimating the predictive uncertainty. A scoring rule is necessary to measure the quality of the predictive uncertainty by a numerical value to reward better calibrated predictions over worse ones. This can be, for example, the log-likelihood or the Brier Score, which are all based on the predictive probability. Another common measure is the entropy of the predictive probability distribution. When using the entropy for uncertainty quantification, this can be approximated by the variance of the predictive probabilities of the network ensemble.39 The predictive uncertainty can be used to verify the confidence in an estimate, e.g., a corrupted measurement will result in a high uncertainty. Even though the variance of the predictive probabilities results in low variance values, this is still meaningful. To distinguish the variance values in low and high uncertainties, they are calibrated with the use of the validation and test data sets.
After successfully training the CNN with the DAT, the network is evaluated in this section. First, the classification results for all prototypes are presented and evaluated with different metrics. Then, the predictive probabilities are investigated, which help to interpret the predictions of the network. Last, the predictive uncertainties of the network are shown to give more insights into predictions and especially into misclassifications.
A. Classification results
As stated in Sec. III A, a decision boundary is used to get the binary class labels . For the test data from the prototypes, a threshold of instead of as for the train data is used, which means that a probability of 80% is necessary to predict a sample as cavitation. A probability close to 0% indicates cavitation-free samples. With this threshold, the network has to be more confident to predict a sample as cavitation, which reduces the number of false alarms within the test data. We found this threshold to be most suitable for all prototypes producing an overall good result.
The overall classification results for all prototypes are plotted in their respective H–Q operating range charts in Fig. 9. Each column represents the results of a separate machine with the naming below the column. The bottom row represents the ground truth used as reference for the evaluation. The classified labels for two different sensors per machine are displayed in the top and middle row, where the sensor position is indicated in their diagram title. Some machines have sensors in the same machine section, but with a circumferential offset. In the fourth column (d), the sensors are placed on the draft tubes of two different units from the same power plant. They have the same cavitation limit and similar operation points, and hence, their ground truth is jointly plotted. However, their classification results look different and, thus, are plotted separately. The results of the evaluation metrics, i.e., balanced accuracy, precision, and recall, are listed in Table II. The ordering of the columns is equal to those in Fig. 9, so a unique naming can be used in the following. The individual results are discussed below.
|.||Kaplan 1 .||Francis .||Kaplan 2 .||Kaplan 3 .||Pump-turbine .|
|.||Kaplan 1 .||Francis .||Kaplan 2 .||Kaplan 3 .||Pump-turbine .|
The Kaplan 1 machine in the first column (a) has two sensors both mounted at the inner head cover, but shifted by 180°. Both sensors have a similar behavior with respect to their cavitation detection. The samples without cavitation are well separated from the cavitation samples slightly above the cavitation limit shown in the upper part of the respective plots. The classification boundary is close to the true cavitation limit. Also in part load operation, located in the lower right corner, cavitation is detected by the CNN. However, the cavitation is not constantly present in the part load region, which is difficult to define in the ground truth. Therefore, a certain amount of missed detections is valid in this region. However, this is still enough to identify this as a critical region with respect to the occurrence of cavitation. This result is confirmed by the accuracies of 71% and 73%. Especially, the exact detection of the cavitation limit is verified by the high precision of 88% and 92%, indicating that there are nearly no false alarms. Indeed, a recall of 47% and 50% shows that a lot of samples with cavitation are missed. However, for our application, this effect is not that relevant, since collecting more data over time will anyway identify the cavitation regions.
The Francis turbine in the second column (b) has two sensors: one at the head cover (HC) and the other one at the draft tube (DT). It is obvious that the strong varying sensor position leads to diverging classification results. The HC sensor again detects the cavitation limit well for both overload and part load operation, except in the center-right area, where the cavitation is overestimated, leading to an increased false alarm rate. An accuracy of 77% is achieved, but with a lower precision of 61% because of the false alarms in the center-right area. Therefore, the recall is higher at 68%. The DT sensor cannot extract much information from the AE signals, and thus, the cavitation is detected only in deep part load operation. At least, a precision of 100% shows that there are no false alarms.
The Kaplan 2 turbine in the third column (c) has two sensors mounted at the inner head cover, but shifted by 90°. Even though the sensors are placed at the same machine section, it is visible that their classification results are differing. So far, we do not have an explanation for this behavior. The sensor in the upper row in general does not detect much cavitation samples. Mainly, cavitation is detected in an area slightly below the cavitation limit. The accuracy is still at 50% because of the TN, but the precision and recall are worse. In contrast, the sensor in the middle row detects the cavitation limit and samples above well. An accuracy of 63% is achieved, since the recall with 30% is not that high, which means that many cavitation samples are missed. However, the precision of 82% is high, which is more important to identify cavitation regions.
The Kaplan 3 machine in the fourth column (d) shows the results of two draft tube sensors mounted at two different units of the same power plant. A similar classification result for both units is expected, since the operation points are similar, but not exactly the same. Again, the sensor in the upper row has an accuracy of just 57% because the recall is low at 16%. However, this sensor still identifies cavitation regions very well with a precision of 97%. The sensor of the second unit in the middle row achieves the overall best result with an accuracy of 92% and both high precision of 98% and recall of 89%. Not only the cavitation region is properly identified, but also almost all samples are captured, while having very few false alarms.
The Pump-turbine in the fifth column (e) has the fewest samples. The sensors are mounted at the HC and DT. Their classification results are comparable. Both have an accuracy of 74% and 69%. Also, false alarms are not an issue with a nearly perfect precision of 98% and 100%. Again, the cavitation limit is identified correctly. However, with a recall of 50% and 36%, both sensors are missing a lot of cavitation samples.
In summary, the cavitation boundaries can be accurately detected and the cavitation regions are identified at all five prototypes. This is graphically visualized by plotting the classification results in the H–Q operating range charts and confirmed by the high precision values. The goal is not to capture every cavitation sample because the effect of missed detections can be neglected by collecting more data over time to identify the cavitation regions.
It has to be mentioned that a uniform threshold of is chosen as a trade-off to achieve a high generalization to a broad range of hydraulic machines. This results in a plug-and-play system that can be deployed directly to identify critical cavitation regions at other machines without the need for any calibration. By adapting the threshold, the results for individual machines and sensors can be improved.
B. Predictive probabilities
So far, a decision boundary is used to get binary class labels from the output probabilities of the CNN. This is indeed necessary if the cavitation detection is connected to another system, which requires binary classification values for its processes. However, the prediction of binary class labels is not well interpretable. Instead, the predicted probabilities provide helpful information to get insights into the predictions of the network. Also, a threshold is strict by its decision making and does not consider the distance to the threshold. Thus, it can be beneficial to look at the predictive probabilities, which are seen as confidence of the network. Especially, in our application, where the threshold with has a high value, probabilities slightly below the threshold offer information about the cavitation. As a reminder, a probability of 0% indicates that a sample is cavitation-free, while a probability of 100% stands for cavitation. By looking at the probabilities, it is possible to identify transition regions and in some cases to rank the cavitation intensity, because the probability is a continuous value. Similar to the classification results, the predictive probabilities for all prototypes are also plotted in their respective H–Q charts, Fig. 10, where the columns indicate different machines.
Both sensors of Kaplan 1 show high probabilities above the cavitation limit, indicating that the network predicts cavitation with high confidence. Also samples from part load operation have raised probabilities, but not as high as at the cavitation limit. Thus, samples in part load are classified less confident. This behavior is matching with the fact that cavitation is not constantly present during part load operation. In addition, the probabilities for this machine are seen as an indicator for the cavitation intensity, which would imply that the cavitation intensity above the cavitation limit is higher than in part load.
For the head cover sensor of the Francis turbine, it is observed that the overall level of the probabilities is high. This is an indication that the threshold has to be increased, which would result in fewer false alarms in the area between the cavitation boundaries. By looking at the probabilities, the cavitation-free areas are clearly separable from those with cavitation. In contrast, the draft tube sensor would benefit from a lower threshold. However, it is also visible that this sensor still would not be able to detect the overload area.
The inner head cover sensor in the upper row of Kaplan 2 mainly detects samples slightly below the cavitation threshold, which is also confirmed by the high probabilities in this region. However, looking at the trend of the probabilities, it is visible that they have higher values in the upper part of the plot. However, the probabilities are not high enough to trigger the threshold. The other inner head cover sensor displayed in the middle row is showing a clear trend. The probabilities are increasing toward the cavitation limit. Raised values slightly below the cavitation limit indicate that the cavitation will already occur before the cavitation limit is reached. After passing the limit, the probabilities are further increasing. In this case, again the probabilities can be seen as an indicator for the cavitation intensity.
The draft tube sensor of Kaplan 3 in the upper row shows low probabilities for low flow rates and high probabilities for high values of . However, in the range of , the probabilities are also high, although no cavitation should occur there. These samples are below the threshold and, thus, do not cause many false alarms; however, this area should be observed further. Also this area does not offer many samples, which hinders an extensive investigation. The second sensor in the middle row is showing an ideal continuous increasing trend of the probabilities toward the cavitation limit and above. In this case, the probabilities are a strong indicator for the cavitation intensity.
Both sensors of the Pump-turbine have low probabilities spread over the whole H–Q chart. This leads to several missed detections in the low flow rate and low head region. Decreasing the threshold could not improve the performance in this case. Only the number of TP would increase slightly, but instead the number of false alarms would increase even more.
In summary, it is shown that the predictive probability is helpful to interpret the predictions of the network. It provides confidence of the prediction instead of binary class labels and can be used to tune the threshold, being able to improve the performers for some test cases. Furthermore, it can be seen as an indicator for the cavitation intensity and to identify transition regions.
C. Predictive uncertainties
Another useful tool to investigate the predictions of a neural network is the predictive uncertainty. As explained in Sec. III B, the predicted class label is defined as the average from the outputs of eight network ensemble members. By calculating the variance of the individual predictions, the predictive uncertainty for this sample is obtained. Both the predictive probability and the predictive uncertainty are an indication for the reliability of the prediction. Compared to the predictive probability, which shows how close the prediction is to 0% or 100%, the predictive uncertainty defines the disagreement of the individual ensemble members. The predictive uncertainty thus captures only the uncertainty with respect to the neural networks of the ensemble, but does not capture the uncertainty about the data.
Once again, the results of the predictive uncertainties for all prototypes are plotted in their respective H–Q charts in Fig. 11. In the case of the predictive uncertainties, low values indicate that the neural network is certain about its prediction and high values stand for an uncertain prediction.
Beginning with the Kaplan 1 machine, both inner head cover sensors have high probabilities and low uncertainties close to the cavitation limit and in the part load region. In contrast, the cavitation-free region has increased uncertainties, which indicates that the predictions in the region could be wrong. The probabilities in this region have values in the range of 40–60%, which are not close to 0%. Thus, this region is still classified correctly, but the network is uncertain about those predictions.
The same behavior is observed for the head cover sensor of the Francis turbine. The cavitation regions have high probabilities and low uncertainties. However, the probabilities in the cavitation-free region in the midsection of the H–Q chart are also increased tending to produce false alarms. Again, this region is identified as a critical section by the uncertainties. The result of the draft tube sensor is less clear. The deep part load region is correctly classified with low uncertainty. Closer to the part load limit, many detections are missed, which is correctly assessed by high uncertainties. The overload region is classified as cavitation-free but has high uncertainty. Also, the network is certain that the center-left region is cavitation-free. However, the center-right region cannot be fully interpreted by the probabilities and uncertainties.
Both inner head cover sensors of the Kaplan 2 machine are missing cavitation detection above the cavitation limit, which is identified by the uncertainties. Close to the cavitation limit, the probabilities are higher, confirmed by low uncertainties. In the lower half of both H–Q charts, the network is certain about the samples being cavitation-free. For both sensors, the probabilities start to increase in the region of , which should be cavitation-free. Particularly, the sensor in the middle row indicates this through increased uncertainties.
The draft tube sensor at the first unit of the Kaplan 3 machine displayed in the upper row detects the cavitation limit well, but the probabilities are just slightly above the decision boundary. In this case, also the associated uncertainties are raised. The region between and is cavitation-free, but the probabilities are too high. Also the region below is cavitation-free but still has increased probabilities. The whole part is clearly identified by high uncertainties to be critical. The draft tube sensor at the second unit is well classified and confirmed by low uncertainties. Only the transition region has raised uncertainties, but this region is not clearly cavitation-free, so this is a valid result.
The head cover sensor at the Pump-turbine shows increased uncertainties in the cavitation-free region, as well as for the missed detections below the cavitation limit. This result matches the behavior of the probabilities. For the draft tube sensor, nearly all the uncertainties are low, even though there are many missed detections in the cavitation region. In this case, the uncertainties cannot capture this behavior.
In summary, the predictive uncertainty provides a deeper insight into the predictions of the network. The predictive uncertainty is used to prove the reliability of the predictions and to identify weak points of the network. Thus, samples classified as cavitation or cavitation-free can be confirmed, and missed detections and false alarms can be identified. It is shown that the predictive uncertainties are matching the statements of the predictive probabilities.
We propose a framework for cavitation detection, which is applicable to a variety of hydraulic machineries. The focus of this work is to obtain a strong generalization to many machines, instead of a maximum accuracy for an individual machine. To enable this, the domain shift within the data has to be handled, which is realized by an advanced pre-processing pipeline and a domain-alignment training to find domain-invariant features. It is shown that the properly trained network can be transferred to five different prototypes and analyze their daily operational process. These results suggest that this framework can be applied as a diagnostic tool to other unknown machines. Furthermore, the results are investigated by the use of the predictive probability and the predictive uncertainty. Both methods help to interpret the predictions with respect to many aspects, e.g., provide the confidence of the predictions, identify false alarms and missed detection, or indicate the cavitation intensity.
As mentioned in Sec. II A, data from additional sensors at the five prototypes are available, but were excluded in this work because of their worse signal quality. In the next step, a method for a priori detection of corrupted signals has to be implemented. Further, preliminary investigations show that wavelet filters can be applied in the pre-processing to improve the signal quality. This enables the usage of sensors even with a low SNR.
The authors would like to thank Voith Hydro Holding GmbH & Co. KG for receiving partial funding and providing a part of the measurement data. A special thank you is directed to Dr. Alexander Jung and Dr. Jörg Necker for their collaboration and support.
Conflict of Interest
The authors have no conflicts to disclose.
Lukas Gaisser: Conceptualization (lead); Data curation (lead); Formal analysis (lead); Investigation (lead); Methodology (lead); Resources (lead); Software (lead); Validation (lead); Visualization (lead); Writing – original draft (lead). Oliver Kirschner: Project administration (equal); Supervision (equal); Writing – review & editing (equal). Stefan Riedelbauch: Funding acquisition (lead); Project administration (equal); Supervision (equal); Writing – review & editing (equal).
The data that support the findings of this study are available from measurements by the University of Stuttgart and Voith Hydro Holding GmbH & Co. KG. Restrictions apply to the availability of these data, which were used under license for this study. Data are available from the authors upon reasonable request and with the permission of University of Stuttgart and Voith Hydro Holding GmbH & Co. KG.
APPENDIX A: SPECTROGRAMS
Examples of the resulting spectrograms for the cavitation and cavitation-free condition corresponding to a Kaplan, Francis, and Pump-turbine are shown in Fig. 12. This demonstrates that the representation of the spectrograms is comparable, but each machine still has its unique patterns, which yields the domain shift within the data.
APPENDIX B: MULTI-SOURCE MULTI-TARGET PROBLEM
For a classic machine learning concept, one key assumption is that the data distributions of the train and test data are similar. Two simplified distributions with a fictitious two-dimensional feature space are shown in Fig. 13(a). The train and test distributions are marked in blue and red, respectively. It is visible that both distributions are similar, so the features learned from the train data can directly be transferred to the test data. In contrast, Fig. 13(b) shows several distributions in the same fictitious feature space, which are only partially overlapping. In this case, not only the train distribution differs from the test distribution, but also the train distributions itself is split and differs from each other. The same holds for the test distributions. This is called multi-source multi-target problem.