The recent rise of machine learning (ML) has revolutionized many fields since its advent, leading to remarkable advances in data science, medical research, and many engineering fields. The vortex induced vibration problem being a complex amalgamation of fluid dynamics, fluid-structure interaction, and structural vibration fields of engineering, has always been a costly nut to crack experimentally while being a highly time-consuming problem to solve through numerical simulations. The current study is aimed at bridging the gap by the use of recent advances in AI and ML through the application of various recent techniques applied to the same problem for a better prediction of the results. The dataset used for training and testing models was self-generated, validated, published, and hence considered suitable for further research into identification of suitable techniques for the effective and efficient prediction of the vortex-induced vibrations phenomenon. The current study delves into the application of a host of supervised learning techniques, including artificial neural networks (ANNs), support vector machine (SVM), decision trees, ensemble methods, and Gaussian Process Regression (GPR), on the same dataset. The ANN was analyzed using multiple training–testing ratios. Three different variations of decision trees were analyzed i.e., course, medium, and fine. Six different algorithms for SVM were tested including: linear, quadratic, cubic, coarse Gaussian, medium Gaussian, and fine Gaussian. Both bagging and boosting type ensemble methods were also tested while four different algorithms of GPR were examined, namely, exponential, squared exponential, rational quadratic, and Matern 5/2. The results are analyzed on a parametric basis using mean squared error (MSE), root mean squared error (RMSE), R-squared (R2), and mean absolute error primarily. The results show that even a training–testing ratio of 30:70 may provide sufficiently credible predictions although for a ratio of 50:50, the accuracy of predictions shows diminishing returns and hence is a sufficiently high training–testing ratio. Fine decision trees, fine Gaussian SVM, boosting ensemble method, and Matern 5/2 GPR algorithms showed the best results within their own techniques while the GPR techniques provided the best predictions of all the different techniques tested.
INTRODUCTION
Fluid dynamics is a rapidly evolving field with numerous applications in civil engineering and beyond, necessitating the generation and analysis of vast amounts of data through experiments, fieldwork, and simulations.1 These traditional methods are often time-consuming and resource-intensive. A key phenomenon in fluid dynamics is vortex-induced vibrations (VIV), which occur when structures such as cylinders experience oscillations due to vortex formation and shedding in fluid flow. Recently, machine learning (ML) techniques have been applied to VIV, enhancing our understanding, prediction, and control of these vibrations by providing more efficient and precise fluid flow predictions with fewer resources. This research explores the implementation of machine learning techniques on datasets produced from simulation work to predict fluid dynamics. The advent of big data has underscored the need for processing large volumes of data, and advancements in data analysis have facilitated easier storage, compression, and analysis of such data. Machine learning accelerates this process by offering quick and accurate data analysis, with techniques generally categorized into supervised, unsupervised, and semi-supervised learning. Supervised learning involves training a model on labeled data, where each input has a corresponding output, whereas unsupervised learning works without predefined labels, categorizing data based on inherent attributes. Semi-supervised learning combines these approaches, utilizing both labeled and unlabeled data. The focus of this research is on two supervised learning techniques, classification and regression, to predict the Reynolds number (Re) from input data. Classification involves training models, such as neural networks and support vector machines, to categorize inputs into predefined classes. Clustering, an unsupervised learning technique, is applied to segment data into clusters based on their values, enhancing the analysis and understanding of fluid dynamics. Figure 1 shows all the ML techniques with their major categorization.
LITERATURE REVIEW
VIV phenomena are prevalent in diverse fields, such as civil engineering, marine applications, and pipeline systems. Analyzing these complex fluid dynamics has traditionally been challenging due to the intricate nature of structures involved.1–5 Historically, combining theoretical and experimental methods has been used to model fluid dynamics, leading to semi-theoretical and semi-experimental models. However, these approaches often demand substantial computational resources and time, alongside generating large datasets.6,7 To address the inefficiencies of traditional methods, machine learning algorithms have been introduced to VIV studies, effectively managing large datasets and improving prediction accuracy.8–11
The integration of deep neural networks (DNNs) with embedded invariance into turbulence modeling in computational fluid dynamics (CFD) has shown superior performance compared to traditional Reynolds-averaged Navier–Stokes (RANS) models, enhancing accuracy in capturing complex flow dynamics.12 Machine learning models based on the minimum description length (MDL) principle demonstrate effective turbulence characterization across various Reynolds numbers, offering improved robustness and generalization over traditional models.14 A comprehensive survey highlights the transformative role of machine learning in CFD, addressing challenges such as data availability and model interpretability while suggesting future research directions and emerging trends.15 Novel deep learning techniques are applied to quantify turbulence predictability in fluid dynamics, showing the ability to accurately predict turbulent flow behavior and identify regions with higher predictability.16 Deep learning techniques effectively model turbulent flow separation over airfoils, capturing complex interactions between airfoil geometry and flow conditions to enhance aerodynamic performance predictions.17 Deep learning methods have been applied to predict VIV in circular cylinders, demonstrating accuracy in capturing the dynamics of vortex shedding and cylinder oscillations across various flow conditions and geometries.18 Machine learning algorithms, including regression and neural networks, have been utilized to predict VIV in circular cylinders, achieving reliable predictions by learning complex fluid–structure interactions.19–21
Artificial neural networks (ANNs) are computational systems modeled after the neural networks in the human brain. They excel at identifying and learning from intricate, nonlinear patterns in data. However, optimizing their hyperparameters can be quite demanding and typically necessitates advanced algorithms. ANNs find significant applications in areas such as predicting flow, modeling turbulence, controlling flow, and creating reduced-order models.22
Support vector machines (SVM) are a supervised learning technique used for classifying data in high-dimensional spaces by identifying the optimal hyperplane that separates different classes. They are particularly effective for binary classification tasks. However, the computational cost can be quite high when dealing with large datasets. SVMs are commonly applied in flow classification and anomaly detection.23
Decision trees are a supervised learning algorithm that partitions data based on if-else rules. This approach allows for easy interpretation and visualization of data. However, decision trees can suffer from overfitting, especially when applied to large datasets. Despite this limitation, they are effectively used in turbulence modeling and flow classification.24
Random forests are an ensemble method that enhances accuracy by combining multiple decision trees. This technique is robust and less prone to overfitting compared to individual decision trees. However, computational cost can be significant when processing large datasets. Random forests are widely used in turbulence modeling, flow classification, and uncertainty quantification.13
Gaussian processes are non-parametric, probabilistic models used to represent complex functions while providing uncertainty estimates for predictions. Although they offer valuable insights into prediction uncertainties, their computational cost can become high with large datasets. Gaussian processes are particularly useful in surrogate modeling, uncertainty quantification, and flow prediction.25
The k-nearest neighbors (k-NN) algorithm is a straightforward classification method that assigns data points to a class based on the majority class among its k closest neighbors. It is easy to understand and has relatively low computational costs. However, its accuracy can be affected by outliers, making it crucial to have clean data. In addition, k-NN is commonly used in flow classification and pattern recognition.26
Convolutional neural networks (CNNs) are machine learning algorithms that excel at processing image-based data and are also effective for statistical data analysis. They are particularly adept at capturing spatial patterns and features. However, they require large amounts of training data to achieve optimal results. CNNs are widely used in image-based flow analysis and modeling turbulence from velocity fields.27
Recurrent neural networks (RNNs) are a type of neural network designed to process sequential and time-dependent data effectively. They excel at modeling temporal dependencies in dynamic systems. However, they can suffer from the issue of vanishing gradients. RNNs are frequently applied in time series prediction, flow simulation, and turbulence modeling in unsteady flows.28
Long Short-Term Memory (LSTM) is a specialized type of recurrent neural network (RNN) that effectively mitigates the vanishing gradient problem, making it suitable for handling long-range dependencies in sequential data. It is particularly adept at capturing and monitoring long-term dependencies in time series data. Despite being complex and computationally expensive, LSTMs are extensively utilized in tasks such as time series prediction, simulating unsteady flows, and modeling turbulence in dynamic fluid systems.29
RESEARCH FLOW
Machine learning is basically classified into two categories: supervised and un-supervised learning. In supervised learning techniques, labels are provided with input data and the model trained on this input data gives us labels as output, whereas in unsupervised learning, the model divides the data into categories. To be more precise, supervised learning techniques are further divided into classification techniques and regression techniques.
In classification techniques, the model gives a label, whereas in regression techniques, the results are in the form of numbers. In this research, different supervised learning techniques are tried and tested on a particle dataset to check what kind of technique gives the best results. Neural networks are tried in classification method, whereas support vector machine (SVM), decision tress, ensemble methods (bagging and boosting), and Gaussian process regression (GPR) techniques are tested from regression methods on this particular dataset. Figure 2 shows the categorization of all the machine learning techniques tested in this research.
NEURAL NETWORKS
The process begins with the dataset, which is meticulously recorded in an Excel file containing all the data generated through simulation. Initially, the data undergo a preprocessing phase that involves cleaning and feature extraction to ensure that only relevant information is processed by machine learning techniques.
For parameter estimation in this nonlinear problem, the Levenberg–Marquardt algorithm is employed, which is an optimization method specifically designed for solving nonlinear least squares problems. This algorithm is particularly favored in numerical optimization and curve fitting due to its robustness and efficiency. The primary goal of the Levenberg–Marquardt algorithm is to determine the parameters of a nonlinear model that minimize the sum of squared differences between the observed data and the model's predicted values, an objective known as the least squares function. This iterative approach combines elements of both the steepest descent and Gauss–Newton methods, utilizing a damping parameter (also referred to as the regularization parameter) to balance between these directions in each iteration, thus ensuring effective and stable convergence.
The algorithm follows these general steps.
Initialization: begin with an initial guess for the model parameters.
Jacobian calculation: compute the Jacobian matrix, which contains partial derivatives of the model with respect to each parameter.
Residual evaluation: calculate the residuals by finding the difference between the observed data and the model’s predictions.
Gradient and Hessian calculation: derive the gradient and the Hessian matrix using the Jacobian and residuals.
Damping parameter adjustment: modify the damping parameter to determine the step size for the subsequent iteration.
Parameter update: adjust the model parameters using the damping-adjusted step.
Iteration: repeat steps 2 to 6 until convergence criteria are met.
After preprocessing, the data are divided into three subsets: training, testing, and validation. A machine learning model is then constructed with the initial parameter and hyperparameter settings and trained on the training data. The model is subsequently validated using a specified percentage of validation data. Based on the validation results, the model undergoes further training with parameter adjustments to enhance performance, a process known as model training and validation. The refined model is then evaluated using the testing data. Once validated, the model is ready for deployment and can be used to make predictions on real-world data.
Delving deeper into the machine learning model, the proposed neural network consists of three fundamental layers. The initial layer, known as the input layer, receives data in the form of features, denoted as x1 + x2 + x3 … xn. These input features are then fed into the hidden layer, carrying their initially assigned weights. The hidden layer’s role is to process the inputs using artificial neurons, which compute outputs through an activation function. Each neuron is assigned a bias value, and the weights are updated according to the formula f = (wx + b).
The resulting values are subsequently transferred to the final layer, known as the output layer, which generates the model’s predictions. Figure 3 shows the whole process in a flow chart.
SUPPORT VECTOR MACHINE (SVM)
Support vector machine (SVM) algorithm is applied to address a multiclass classification problem. A dataset comprising N samples and D features was used, with each sample assigned to one of K unique classes. Prior to implementing the support vector machine, one-hot encoding was initialized to transform the class labels into a binary format. This encoding method converts class labels into binary vectors of length K, where each vector has a single element set to 1 to denote the class, and all other elements set to 0.
For each class k (where k ranges from 1 to K), a binary classifier is trained using SVM. The purpose of this classifier is to distinguish class k from the other classes. The binary label for class k is denoted as yk, a vector of length N, where if sample I belongs to class k, and otherwise.
The training dataset (x, yk) for each binary classifier is then processed by using the SVM, solving the optimization problem for class k. This approach allows the SVM to effectively learn decision boundaries that separate each class from the others in the feature space. Figure 4 shows this whole process in a flow chart.
DECISION TREES
In this research, another technique that was utilized is decision trees. A technique requiring consideration of several key elements:
Decision Nodes: these nodes represent feature tests or decisions. The feature at the node is represented by n as Fn, with the decision threshold represented as Tn. The decision node compares the value of the specific feature in the input data against the threshold to determine the next path to follow in the tree.
Leaf Nodes: these nodes indicate the final prediction or outcome. For classification tasks, each leaf node corresponds to a specific class label, whereas for regression tasks, each leaf node represents a numerical value.
Splitting Criteria: the algorithm employs a metric to select the optimal feature and threshold for splitting the data at each decision node. In classification tasks, metrics such as Gini impurity or entropy might be used, while for regression tasks, mean squared error is commonly employed.
The mathematical representation of these concepts is as follows.
For decision node n:
Splitting feature: Fn.
Threshold: Tn.
Left child: a subtree where the condition Fn ≤ Tn holds.
Right child: a subtree where the condition Fn > Tn holds.
For leaf node n:
As decision trees are used for regression tasks in this study, the leaf nodes represent an output variable.
By considering these elements, decision trees can effectively model complex relationships within the data, providing valuable insights and predictions for regression tasks. Figure 5 shows the whole process in the flow chart.
Ensemble methods
Ensemble methods are a powerful technique in machine learning that combine the predictions of multiple models to improve the overall performance and robustness of the system. By leveraging the strengths of different models, ensemble methods can often achieve higher accuracy and generalization compared to individual models. There are two general types of ensemble methods used in this research.
Bagging
Bagging is the short form for bootstrap aggregating; it is an ensemble learning technique that enhances the stability and accuracy of machine learning models by reducing variance and preventing overfitting. For this particular problem, boosting technique is applied on the same dataset. It involves creating multiple subsets of the training data through bootstrapping, a method of random sampling with replacement. Each subset is used to train a separate model, typically a strong and complex learner such as a decision tree. These models are then aggregated to form an ensemble.
As discussed above, the dataset used in this research is of the numeric type, so this is the case of regression tasks, and the final prediction is obtained by averaging the predictions of all the models. By training multiple models on different subsets of data, bagging mitigates the impact of outliers and noise, leading to improved generalization. The whole process is shown in Fig. 6.
Boosting
Boosting is an ensemble learning technique designed to enhance the accuracy of predictive models by sequentially combining multiple weak learners to form a strong learner. Unlike methods such as bagging, which focuses on reducing variance by training models independently, boosting emphasizes reducing bias by training models sequentially. In this particular program, boosting is applied. The process begins by training a weak learner on the entire dataset and evaluating its performance. In subsequent iterations, the algorithm increases the weights of the misclassified samples, making them more influential in the training of the next model. Each subsequent model is trained to correct the errors made by its predecessor, effectively learning from the mistakes. This iterative process continues until a predefined number of weak learners are trained or the model performance stabilizes. The final prediction is a weighted average or vote of the predictions from all the weak learners, resulting in a model with improved accuracy and robustness. Figure 7 shows the whole process in the form of a flow chart.
Table III is where Matern gives the optimal results. Kernal determines the correlation structure between data points, influencing the shape and smoothness of the inferred functions. GPR provides not only a predictive mean but also a measure of uncertainty around each prediction, making it particularly useful in applications where understanding prediction confidence is crucial, such as in active learning and Bayesian optimization.
Gaussian process regression (GPR)
Gaussian process regression (GPR) is a non-parametric, Bayesian approach to regression that is particularly well-suited for modeling complex, non-linear relationships. Unlike traditional regression methods that assume a fixed functional form, GPR treats the regression problem as a probabilistic inference problem, where the goal is to infer a distribution over functions that are consistent with the observed data.
The model used for experimentation is defined by a mean function, often set to zero, and a covariance function (or kernel), which encodes assumptions about the smoothness and amplitude of the target function. Three different popular kernels, such as the Radial Basis Function (RBF) and Matern, are used, as shown in the result section.
Despite its strengths, GPR has computational limitations, scaling cubically with the number of data points, which makes it challenging to apply directly to large datasets. Careful selection and tuning of the kernel and its hyperparameters are essential to maximize the model’s performance. Figure 8 shows the process of predicting values through the GPR method.
EXPERIMENTATION
System specifications
The experimentation was conducted using an HP Pavilion gaming laptop equipped with an AMD Ryzen 7 4800H processor and Radeon graphics system. The system operated on a 64-bit Windows 10 operating system with 256 GB of primary memory.
Dataset
The dataset employed in this study was generated via simulation spanning Reynolds numbers ranging from 70 to 150. The simulation results were validated against experimental findings and subsequently detailed in two published research articles.30,31 It contains 17 different classes having 48 samples in each class. The dataset is in the form of numerical data generated through extensive experimentation. There are 17 labels for each of the 48 selected Reynold numbers; therefore, the size of the dataset is 17 × 48 entries. The details of the dataset labels are presented in Table I.
No. of classes . | 17 . |
---|---|
Labels | Reynolds number |
Natural frequency | |
Prandtl number | |
Density | |
Reduced velocity | |
Thermal conductivity | |
Strouhal number (oscillating) | |
Oscillating shedding frequency | |
Frequency ratio | |
RMS drag coefficient | |
RMS lift coefficient | |
RMS in-line oscillation | |
RMS transverse oscillation | |
Maximum transverse oscillation | |
Maximum Nusselt number | |
Mean Nusselt number | |
RMS Nusselt number | |
Total no. of samples | 48 samples of each class |
Reynolds number range | 70 < Re < 150 |
No. of classes . | 17 . |
---|---|
Labels | Reynolds number |
Natural frequency | |
Prandtl number | |
Density | |
Reduced velocity | |
Thermal conductivity | |
Strouhal number (oscillating) | |
Oscillating shedding frequency | |
Frequency ratio | |
RMS drag coefficient | |
RMS lift coefficient | |
RMS in-line oscillation | |
RMS transverse oscillation | |
Maximum transverse oscillation | |
Maximum Nusselt number | |
Mean Nusselt number | |
RMS Nusselt number | |
Total no. of samples | 48 samples of each class |
Reynolds number range | 70 < Re < 150 |
Performance evaluation
The performance of all the experiments undertaken is evaluated using four different parameters.
Accuracy
MSE
RMSE
R2
RESULTS AND DISCUSSION
Classification
To perform classification, neural networks were used, and Levenberg–Marquardt algorithm was implemented to perform the experimentation. Neural works same as the principle of the human brain. A brain responds well with more information, same as the case with neural networks. To check this phenomenon, the dataset was divided into five different divisions. In the first scheme, the training portion of the data was kept 70% and the remaining 30% was used for both testing and validation. Out of 30%, data 15% were used for testing and 15% for validation purposes. The results were evaluated on five different evaluation parameters; it can clearly be seen that error ratio is reducing and classification accuracy is increasing. The value of R is 0.99, whereas the mean absolute error is reduced to 0.0028. Similarly, when the data were divided in 50:50, the results are different. There is not a visible difference in classification accuracy but mean absolute error and mean square error are increased, which shows that the changes of wrong information have increased. While when data are divided in 30:70, where only 30% data were used for training, the model is not performing so well. The classification accuracy has reduced to 0.65%, whereas mean square error has increased to 0.36. Table II presents the details of all the values at different training testing ratios.
. | Supervised learning . | ||||
---|---|---|---|---|---|
. | Classification . | ||||
. | Neural networks (Levenberg–Marquardt) . | ||||
Parameter . | 70:30 . | 60:40 . | 50:50 . | 40:60 . | 30:70 . |
R | 0.994 86 | 0.982 11 | 0.998 53 | 0.957 38 | 0.654 7 |
R-squared | 0.989 74 | 0.964 53 | 0.997 07 | 0.916 58 | 0.428 63 |
RMSE | 0.016 68 | 0.044 35 | 0.014 36 | 0.115 05 | 0.604 57 |
MSE | 0.000 28 | 0.001 97 | 0.000 21 | 0.013 24 | 0.365 5 |
MAE | 0.002 8 | 0.015 | 0.005 6 | 0.020 1 | 0.023 8 |
. | Supervised learning . | ||||
---|---|---|---|---|---|
. | Classification . | ||||
. | Neural networks (Levenberg–Marquardt) . | ||||
Parameter . | 70:30 . | 60:40 . | 50:50 . | 40:60 . | 30:70 . |
R | 0.994 86 | 0.982 11 | 0.998 53 | 0.957 38 | 0.654 7 |
R-squared | 0.989 74 | 0.964 53 | 0.997 07 | 0.916 58 | 0.428 63 |
RMSE | 0.016 68 | 0.044 35 | 0.014 36 | 0.115 05 | 0.604 57 |
MSE | 0.000 28 | 0.001 97 | 0.000 21 | 0.013 24 | 0.365 5 |
MAE | 0.002 8 | 0.015 | 0.005 6 | 0.020 1 | 0.023 8 |
Figure 9 shows the graphs comparing the actual and predicted values of Reynolds numbers, with Reynolds numbers as the input parameter and ymax as the output parameter. Various training and testing ratios were employed across the experiments. In these graphs, the dotted red lines represent the predictions made by our artificial neural network (ANN) model, while the solid blue lines depict the actual observed values.
While increasing the amount of training data can improve validation accuracy, test accuracy may decrease if the model is overfed with data beyond its capacity. Figure 10 shows these results, where the diagonal line represents the actual output values, and the small circles indicate predictions made by using the ANN model. With a training–testing ratio of 50:50, the model achieves optimal testing accuracy of 0.998. However, when the training data are reduced to 30%, model accuracy decreases to 0.94. Conversely, increasing the training data from 50% to 70% leads to overfitting, causing the testing accuracy to decline from 0.998 to 0.994. These findings suggest that having a large dataset is not always necessary to achieve optimal results, and excellent performance can be obtained with a smaller amount of data.
Regression
Table III presents a comparison of four supervised learning regression techniques applied to identify the most effective method for the specific problem. The input variables, Reynolds number (Re), and frequency ratio, were consistent across all experiments, while the output variable varied. Initially, ymax served as the output variable, analyzed using three types of decision trees: fine, medium, and coarse. Performance was evaluated using RMSE, R2, MSE, and MAE. The results indicate that the fine decision tree was the most suitable, achieving an R2 value of 91%.
. | Supervised learning . | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
. | Regression . | ||||||||||||||
. | Decision trees . | Support vector machines . | GPR . | Ensemble methods . | |||||||||||
Parameter . | Fine . | Medium . | Coarse . | Linear . | Quadratic . | Cubic . | Fine Gaussian . | Medium Gaussian . | Coarse Gaussian . | Rational quadratic . | Squared exponential . | Matern 5/2 . | Exponential . | Bagging . | Boosting . |
RMSE | 0.066 2 | 0.156 | 0.221 | 0.238 | 0.074 7 | 0.096 7 | 0.051 6 | 0.075 2 | 0.199 | 0.019 5 | 0.020 3 | 0.017 9 | 0.018 8 | 0.092 7 | 0.088 6 |
R-squared | 0.91 | 0.5 | 0 | 1.16 | 0.89 | 0.81 | 0.95 | 0.88 | 0.19 | 0.99 | 0.99 | 0.99 | 0.99 | 0.82 | 0.84 |
MSE | 0.004 39 | 0.024 4 | 0.048 7 | 0.056 5 | 0.005 58 | 0.009 35 | 0.002 66 | 0.005 66 | 0.0397 | 0.000 381 | 0.000 410 | 0.000 319 | 0.000 353 | 0.0086 | 0.007 86 |
MAE | 0.034 4 | 0.118 | 0.203 | 0.127 | 0.054 0 | 0.047 4 | 0.040 8 | 0.048 6 | 0.124 | 0.008 48 | 0.008 69 | 0.007 51 | 0.007 039 | 0.065 0 | 0.048 8 |
. | Supervised learning . | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
. | Regression . | ||||||||||||||
. | Decision trees . | Support vector machines . | GPR . | Ensemble methods . | |||||||||||
Parameter . | Fine . | Medium . | Coarse . | Linear . | Quadratic . | Cubic . | Fine Gaussian . | Medium Gaussian . | Coarse Gaussian . | Rational quadratic . | Squared exponential . | Matern 5/2 . | Exponential . | Bagging . | Boosting . |
RMSE | 0.066 2 | 0.156 | 0.221 | 0.238 | 0.074 7 | 0.096 7 | 0.051 6 | 0.075 2 | 0.199 | 0.019 5 | 0.020 3 | 0.017 9 | 0.018 8 | 0.092 7 | 0.088 6 |
R-squared | 0.91 | 0.5 | 0 | 1.16 | 0.89 | 0.81 | 0.95 | 0.88 | 0.19 | 0.99 | 0.99 | 0.99 | 0.99 | 0.82 | 0.84 |
MSE | 0.004 39 | 0.024 4 | 0.048 7 | 0.056 5 | 0.005 58 | 0.009 35 | 0.002 66 | 0.005 66 | 0.0397 | 0.000 381 | 0.000 410 | 0.000 319 | 0.000 353 | 0.0086 | 0.007 86 |
MAE | 0.034 4 | 0.118 | 0.203 | 0.127 | 0.054 0 | 0.047 4 | 0.040 8 | 0.048 6 | 0.124 | 0.008 48 | 0.008 69 | 0.007 51 | 0.007 039 | 0.065 0 | 0.048 8 |
The same input and output parameters were tested using six SVM techniques: linear, quadratic, cubic, fine Gaussian, medium Gaussian, and coarse Gaussian. Among these, the fine Gaussian SVM yielded the best results with an R2 value of 95%.
The experiments were then repeated using four Gaussian process regression (GPR) techniques: rational quadratic, squared exponential, Matern 5/2, and exponential. Again, performance was measured using RMSE, R2, MSE, and MAE. The Matern 5/2 technique demonstrated superior performance with an R2 value of 99%. At the end, ensemble methods were evaluated using both bagging and boosting techniques, with the same input and output parameters and evaluation metrics. The results revealed that boosting trees were the most effective ensemble method for regression analysis on this dataset, achieving an R2 value of 84%.
The response graphs from these experiments are shown in Fig. 11. These graphs, also known as regression plots or fitted line plots, illustrate the relationship between independent and dependent variables, showing how predicted values change with variations in the independent variable.
The comparison of actual to predicted values shown in Fig. 11 shows that while the fine decision trees exhibit a similar pattern to the actual values, a consistent deviation is observed rendering it lower accuracy of predictions. The best results are predicted by using the Matern 5/2 algorithm of the Gaussian process regression model, which is highly accurate especially in the lock-in or resonance (high oscillation amplitude) flow regime. The fine Gaussian support vector machine algorithm and the boosted ensemble method also produced satisfactory results but not as accurate as the Matern 5/2 GPR.
Figure 12 shows the relationship between actual and predicted values, where the straight line represents the true or actual response, and the blue dots denote the model’s predictions. The Gaussian process regression (GPR) technique demonstrates superior performance compared to the other methods tested. The figure shows that most of the predicted values, indicated by the blue dots, align closely with the diagonal line of actual responses, highlighting the accuracy of the GPR model.
This figure shows that, for the Matern 5/2 kernel, most of the predicted residuals (shown as orange dots) are clustered close to the horizontal line at zero, indicating that the model’s predictions are closely aligned with the actual values and the residuals are minimal. This suggests that the Matern 5/2 kernel provides the best fit among the tested models, with residuals being well-distributed around zero.
CONCLUSION
The future trajectory is increasingly oriented toward artificial intelligence and machine learning, with numerous applications spanning various domains. In the field of fluid dynamics, which is rapidly evolving and has significant implications for engineering and other fields, the necessity for processing extensive datasets derived from experiments, field studies, and simulations is paramount. Traditional methods for data collection and analysis are often characterized by their time-consuming and resource-intensive nature. To address these challenges and explore future directions, this research evaluates and compares various machine learning techniques to determine their efficacy for numeric data generated through simulations and experimental work.
The results show that even a training–testing ratio of 30:70 may provide sufficiently credible predictions although for a ratio of 50:50, the accuracy of predictions shows the most credible predictions and hence is a sufficiently high training-testing ratio.
Higher training–testing ratios tended to show overfitting tendencies due to the small dataset size used. It is, therefore, beneficial to use smaller training–testing ratios for smaller datasets.
Fine decision trees were observed to generate the best predictions among all the decision trees tested.
From among the six SVM algorithms tested, fine Gaussian SVM provided the best predictions.
Ensemble methods generally produced the lowest quality results but from among the two types checked, the boosted ensemble method generated the better results.
Matern 5/2 GPR algorithms showed the overall best results in all of the models and algorithms tested.
All the GPR techniques generally provided the best predictions from all the different techniques tested.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
A. Ijaz: Conceptualization (equal); Data curation (lead); Formal analysis (lead); Investigation (lead); Methodology (lead); Resources (equal); Software (lead); Validation (lead); Visualization (lead); Writing – original draft (lead); Writing – review & editing (equal). S. Manzoor: Conceptualization (equal); Funding acquisition (lead); Project administration (lead); Resources (equal); Supervision (lead); Writing – review & editing (equal).
DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding author upon reasonable request.