Background: Type 2 diabetes mellitus (T2DM) is the most common type of diabetes, accounting for around 90% of all diabetes. Studies have found that dietary habits and biochemical metabolic changes are closely related to T2DM disease surveillance, but early surveillance tools are not specific and have lower accuracy. This paper aimed to provide a reliable artificial intelligence model with high accuracy for the clinical diagnosis of T2DM. Methods: A cross-sectional dataset comprising 8981 individuals from the First Affiliated Hospital of Guangxi Medical University was analyzed by a model fusion framework. The model includes four machine learning (ML) models, which used the stacking method. The ability to leverage the strengths of different algorithms to capture complex patterns in the data can effectively combine questionnaire data and blood test data to predict diabetes. Results: The experimental results show that the stacking model achieves significant prediction results in diabetes detection. Compared with the single machine learning algorithm, the stacking model has improved in the metrics of accuracy, recall, and F1-score. The test set accuracy is 0.90, and the precision, recall, F1-score, area under the curve, and average precision (AP) are 0.91, 0.90, 0.90, 0.90, and 0.85, respectively. Additionally, this study showed that HbA1c (P < 0.001,OR = 2.203), fasting blood glucose (FBG) (P < 0.001,OR = 1.586), Ph2BG (P < 0.001,OR = 1.190), age (P < 0.001,OR = 1.018), Han nationality (P < 0.001,OR = 1.484), and carbonate beverages (P = 0.001,OR = 1.347) were important predictors of T2DM. Conclusion: This study demonstrates that stacking models show great potential in diabetes detection, and by integrating multiple machine learning algorithms, stacking models can significantly improve the accuracy and stability of diabetes prediction and provide strong support for disease prevention, early diagnosis, and individualized treatment.
I. INTRODUCTION
About 422 × 106 people worldwide have diabetes, the majority living in low- and middle-income countries, and 1.5 × 106 deaths are directly attributed to diabetes each year. Both the number of cases and the prevalence of diabetes have been steadily increasing over the past few decades.1 According to International Diabetes Federation statistics, China has become the country with the largest number of adults aged 20–79 years with diabetes. The huge diabetic population places a great burden on modern healthcare systems and a great economic burden on the family of patients with diabetes. Diabetes accounts for more than $966 billion in health spending, and this amount is still increasing.2
Type 2 diabetes is the most common type, a metabolic chronic disease characterized by insulin resistance and elevated blood glucose levels.3 Type 2 diabetes not only imposes health burdens on the patient, such as a significantly increased risk of complications, such as cardiovascular disease, neuropathy, and retinopathy, but it also puts significant socioeconomic pressures, such as rising healthcare costs, a reduced workforce, and an increased burden on families and social increased burden on family and social support systems. However, the symptoms of the disease are not obvious at the onset, making it difficult to detect and confirm the diagnosis.4 Prevention and detection are essential means to cope with diabetes. Therefore, establishing effective prediction models to assess an individual's risk of diabetes is essential.
Machine learning (ML) is widely used in various fields, Shaukat et al.5,6 explored the potential of machine learning techniques to improve the detection of unknown and polymorphic security attacks, and Alam7,8 proposed a new framework to identify prognostic factors for malignant mesothelioma through association rule mining techniques. (All abbreviations in the paper are listed in Nomenclature). The application of machine learning and deep learning (DL) algorithms in healthcare has been proven.9 Nowadays, many scholars use ML for disease detection, Kumar et al.10 used ML and DL models for speech recognition task to achieve detection of dementia, Abdullah et al.11 study proved the potential of deep learning convolutional neural networks(CNNs) and sequential CNNs for disease detection and classification, Srinivas et al.12 proposed three migration learning based CNN models to localize brain tumors, and Alsubaie et al.13 proposed a novel CNN architecture called ConvADD for detecting Alzheimer's disease. Shaukat et al.14–16 applied machine learning and deep learning technology in the field of network security, and in the past decade, the application of machine learning technology in the field of network security has made remarkable progress. From static analysis to dynamic analysis, and then to use deep learning for malicious software detection, expanding the application scope of machine learning technology, combining DL and ML, eliminates the demand for intensive characteristic engineering tasks and field knowledge, the accuracy of malicious software reached 99.06%, and this fusion can combine the advantages of both, improve the performance of the model and robustness.
This work consists of the following contributions:
Propose a new stacking machine learning framework to analyze the 8981-case cross-sectional dataset from the First Affiliated Hospital of Guangxi Medical University. Data problems are solved by feature engineering and data preprocessing methods, hyperparameters are optimized by learning curves and grid search, and model performance is evaluated using cross-validation and medical statistical methods.
The proposed machine learning framework in this work is superior to any separately constructed machine learning methods and ensemble models. The effect of data preprocessing on the model was also examined using the PIMA database to demonstrate the stability of the model. Statistical analysis was performed to show that our proposed stacking model is capable of better detection efficiency. The reliability of the model was demonstrated using feature importance visualization, and the potential value of other features for diabetes diagnosis was explored.
II. RELATED WORKS
The application of machine learning (ML) technology in medical diagnostic systems has become mature. This technology has been proven to be accurate in diagnosis, successful in treatment, and cost effective.17 In this research, we conducted a thorough search of the PubMed and Web of Science databases using the following search terms to identify recent and relevant studies: Type 2 diabetes mellitus (T2DM), type 2 diabetes, diabetes mellitus, machine learning, stacking model, fusion model, and ensemble model. Nineteen relevant studies were adopted. Many scholars have performed several research using the Pima Indians dataset to improve the accuracy of models in clinical prediction. Joshi et al.18 achieved 78.26% accuracy on the Pima Indians dataset using logistic regression (LR) and decision tree (DT). Chang et al.19 used random forest (RF) to improve accuracy to 79.57%. Furthermore, Adua et al.20 recruited 219 patients with type 2 diabetes mellitus (T2DM) and 219 healthy individuals. Four ML classification algorithms, namely, Naïve–Bayes (NB), k-nearest neighbor (KNN), support vector machine (SVM), and DT, were used to predict T2DM. NB classifiers yielded 94% accuracy.
In processing small datasets, traditional ML methods can obtain satisfactory results. However, in terms of accuracy, the development of traditional ML models is near saturation. Many scholars use traditional ML algorithms for comparison with new models, and more scholars focus their attention on ensemble classifiers and neural network models. In the Pima Indians dataset, Khanam and Foo used a neural network with two hidden layers to increase accuracy to 88.6%.21 Edeh et al.22 presented a supportive diagnostic system based on the comparison of four predictive algorithm models for predicting diabetes in two different databases (Frankfurt Hospital diabetes dataset and Pima Indian dataset). Based on several performance assessment methods, such as accuracy, recall, and F1 score, the authors concluded that RF provides a more accurate prediction and a higher performance than other models. Xie et al.23 constructed a ML prediction model using diabetes data from 138 146 participants, and the experimental results showed that the neural network model had the best model performance with an area under the curve (AUC) of 0.7949 and an accuracy of 82.4%. When using a single machine learning model, there may be challenges such as overfitting, underfitting, and lack of generalization ability. These difficulties can result in good performance on the training data but inadequate performance on new data.24,25 To address these concerns, an ensemble machine learning approach can be utilized.
Medical data are usually unbalanced, which affects the performance of the model, and many scholars have used various methods in order to solve this problem. Khushi et al.26 explored the performance effects of 23 class-imbalance methods and three classical classifiers using two datasets, the Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial dataset and the National Lung Screening Trial (NLST) dataset (with imbalance ratios of 24.7 and 25.0, respectively); the results of the study showed that class-imbalance balanced learning can significantly improve the classification ability of the model. Talha Mahboob Alam et al. encountered the class imbalance problem in a number of studies,27,28 and they solved the problem by synthetic minority over-sampling technique (SMOTE) technique, borderline SMOTE technique, data augmentation method, and other methods to solve the data imbalance problem, and the experimental results were significantly improved after class imbalance learning. Yang et al.29 used extremely imbalanced dataset (imbalance ratio of 143.7) in their study and employed 23 class imbalance learning methods combined with machine learning models to improve the accuracy of early screening for ovarian cancer.
Ensemble ML methods is a powerful and widely used technique in the field of machine learning, which aims to improve the predictive performance and stability of a model by constructing and combining multiple learners to accomplish a learning task. Ali et al.30 solved the problem of data imbalance by using the resampling algorithms such as random under-sampling, random over-sampling, and SMOTE, and four integrated machine learning algorithms were used to predict schistosomiasis and finally CatBoost model performed the best with 87.1% accuracy. Devnath et al.31 applied deep integration learning technique for automatic detection of cxr pneumoconiosis in coal miners and showed that the integrated integration technique combining deep learning methods outperformed the other methods and achieved 91.50% accuracy in the automatic detection of pneumoconiosis. Integrated ML methods, such as LightGBM and CatBoost, were also applied to diabetes prediction. LightGBM and CatBoost are also applied to diabetes prediction. Liu et al. predicted T2DM in older Chinese adults by applying random under sampling to address category imbalance and Shapley additive interpretation to calculate and visualize feature importance. The XGBoost model with 21 features performed the best in predicting T2DM.32 Yang et al.33 also did related research using the 2011–2017 dataset of patients with diabetes in Luzhou City, China. The XGBoost model also performed excellently in this dataset. Xue et al. compared DT, RF, AdaBoost, and DT (AdaBoost), and extreme gradient enhanced DT (XGBoost) based on a cross-sectional study of 584 168 adult subjects from a national physical examination and showed that XGBoost was the best classifier with an AUC of 0.9680.34 Dong et al.35 compared the XGBoost and LightGBM models in the dataset of PLA General Hospital, and the results showed that LightGBM was better than XGBoost. Rufo et al. explored the application of diabetes data from Zewditu Memorial Hospital in Addis Ababa, Ethiopia, in the field of ML and constructed the LightGBM model, which was validated by comparing KNN, SVM, NB, bagging, RF, and XGBoost.36 Rufo et al.36 also used the LightGBM model, which yielded 98.1% accuracy. In the field of clinical prediction, machine learning models like LightGBM, CatBoost, and XGBoost have proven to be effective. However, there is still a concern with bias and variance. To address this issue, a stacking model that combines various models can reduce bias and variance, ultimately improving the overall performance and generalization of the model.
Ensemble techniques that stacking propose diversity, stability, and outstanding performance illustrated in many recent studies. Xiong et al.37 used voting to combine five ML models to predict diabetes in the dataset from Nanjing Drum Tower Hospital with 91% accuracy and 0.97 AUC. Sumathi and Meganathan used voting to predict gestational DM, and studies showed that the fusion model is superior to the classical ML model and achieved good results with a precision of 94%, a recall of 94%, an accuracy of 94.24%, and a F1 score of 94%.38 Deberneh and Kim used LR, RF, SVM, XGBoost, and model fusion methods (stacking and soft voting) to train and predict DM using electronic health records collected by a private medical institution and achieved effective results.39 However, without deep exploration, the superiority of model fusion is not obvious.
Table I outlines the development of ML in diabetes diagnostics. From the table, fusion models can integrate the benefits of a single model to better predict outcomes. The potential of fusion models in disease diagnosis and prognosis remains to be explored. In this paper, we try to prove that the stacking model has more advantages in obtaining a higher prediction accuracy. The study flow chart is depicted in Fig. 1.
Authors . | Year . | Models . | Data sources . | Accuracy (%) & AUC . |
---|---|---|---|---|
Xie et al.23 | 2019 | SVM, DT, LR, RF, NN, and Gaussian NB | 2014 BRFSS dataset | 82.4% |
Xiong et al.37 | 2019 | MLP, AdaBoost, RF, SVM, GTB, and voting | Nanjing Drum Tower Hospital dataset | 91% |
Xue et al.34 | 2020 | DT, RF, AdaBoost, and XGBoost | National physical examination | 90.6% |
Yang et al.29 | 2020 | DT | PLCO | 95.32% |
Joshi et al.18 | 2021 | LR and DT | PIDD | 78.26% |
Adua et al.20 | 2021 | NB, KNN, SVM, and DT | A hospital and community for African populations in Ghana | 93% |
Khanam and Foo21 | 2021 | NN, DT, KNN, RF, NB, AB, and LR | PIDD | 88.6% |
Yang et al.33 | 2021 | XGBoost | 2011–2017 dataset of patients with diabetes in Luzhou City, China | 87.68% |
Rufo et al.36 | 2021 | LightGBM | Zewditu Memorial Hospital in Addis Ababa, Ethiopia | 98.1% and 98.1% |
Deberneh and Kim39 | 2021 | LR, RF, SVM, XGBoost, CIM, stacking classifier, and soft Voting | Private medical institutions | 73% |
Sumathi and Meganathan38 | 2021 | MLP, SVM, LR, and stacking | PIDD | 78.2% |
Chang et al.19 | 2022 | NB, RF, and J48DT | PIDD | 79.57% |
Alam et al.27 | 2022 | AlexNet, InceptionV3, and RegNetY-320 | MNIST: HAM10 000 dataset | 91% |
Onyema Edeh et al.22 | 2022 | RF, SVM, NB, DT, and K-means | Frankfurt Hospital dataset and PIDD | 97.6% |
Ali et al.30 | 2022 | Gradient boosting, light gradient boosting, extreme gradient boosting, and CatBoost | Hubei Institute of Schistosomiasis Prevention and Control, China | 87.1% |
Liu et al.31 | 2022 | (Simple averaging, multi-weighted averaging, and majority voting (MVOT) | CSIRO dataset, NIOSH teaching chest x-ray dataset and ILO Standard Radiographs | 91.50% |
Liu et al.32 | 2022 | LR, DT, RF, and XGBoost | Health screening data of adults older than 65 years in Wuhan, China from 2018–2020 | 75.03% and 78.05% |
Dong et al.35 | 2022 | LightGBM, XGBoost, AdaBoost, NN, DT, SVM, and LR | PLA General Hospital | 81.5% |
Authors . | Year . | Models . | Data sources . | Accuracy (%) & AUC . |
---|---|---|---|---|
Xie et al.23 | 2019 | SVM, DT, LR, RF, NN, and Gaussian NB | 2014 BRFSS dataset | 82.4% |
Xiong et al.37 | 2019 | MLP, AdaBoost, RF, SVM, GTB, and voting | Nanjing Drum Tower Hospital dataset | 91% |
Xue et al.34 | 2020 | DT, RF, AdaBoost, and XGBoost | National physical examination | 90.6% |
Yang et al.29 | 2020 | DT | PLCO | 95.32% |
Joshi et al.18 | 2021 | LR and DT | PIDD | 78.26% |
Adua et al.20 | 2021 | NB, KNN, SVM, and DT | A hospital and community for African populations in Ghana | 93% |
Khanam and Foo21 | 2021 | NN, DT, KNN, RF, NB, AB, and LR | PIDD | 88.6% |
Yang et al.33 | 2021 | XGBoost | 2011–2017 dataset of patients with diabetes in Luzhou City, China | 87.68% |
Rufo et al.36 | 2021 | LightGBM | Zewditu Memorial Hospital in Addis Ababa, Ethiopia | 98.1% and 98.1% |
Deberneh and Kim39 | 2021 | LR, RF, SVM, XGBoost, CIM, stacking classifier, and soft Voting | Private medical institutions | 73% |
Sumathi and Meganathan38 | 2021 | MLP, SVM, LR, and stacking | PIDD | 78.2% |
Chang et al.19 | 2022 | NB, RF, and J48DT | PIDD | 79.57% |
Alam et al.27 | 2022 | AlexNet, InceptionV3, and RegNetY-320 | MNIST: HAM10 000 dataset | 91% |
Onyema Edeh et al.22 | 2022 | RF, SVM, NB, DT, and K-means | Frankfurt Hospital dataset and PIDD | 97.6% |
Ali et al.30 | 2022 | Gradient boosting, light gradient boosting, extreme gradient boosting, and CatBoost | Hubei Institute of Schistosomiasis Prevention and Control, China | 87.1% |
Liu et al.31 | 2022 | (Simple averaging, multi-weighted averaging, and majority voting (MVOT) | CSIRO dataset, NIOSH teaching chest x-ray dataset and ILO Standard Radiographs | 91.50% |
Liu et al.32 | 2022 | LR, DT, RF, and XGBoost | Health screening data of adults older than 65 years in Wuhan, China from 2018–2020 | 75.03% and 78.05% |
Dong et al.35 | 2022 | LightGBM, XGBoost, AdaBoost, NN, DT, SVM, and LR | PLA General Hospital | 81.5% |
III. RESULTS
This section will be divided into five parts to fully demonstrate the results of each step of the experiments. Section III A describes the results of class imbalance learning and feature selection. Section III B shows the performance comparison of stacking with other models, and the results show that the stacked model is able to have better results relative to a single integrated learning model. Section III C shows external validation, and Sec. III D experiments of model comparison by dividing the dataset. The results further validate the reliability of the model. Section III E uses statistical analysis methods for feature evaluation as well as validation of the proposed model against other models.
A. Class imbalance and feature selection
After class imbalance, our dataset has 8630 samples. It contains 2215 with T2DM and 6415 with non-diabetic. After features selected, the results showed Age, Female, Male, HAN, ZHUANG, Smoke, Drink alcohol, Tea, Carbonate Beverages, Coffee, Hypertension, Retinopathy, Hyperlipidemia, Snore, Hypotensive Drugs, SBP, DBP, BMI, WC, HC, CRP, HDL, LDL, TCHO, TG, AST, Y-GT, FBG, P2hPG, HbA1C, and FINS. Figure 2 illustrates the relationship between the number of features and the accuracy of the model. From the curve, it is found that when the features reach 22, the model improvement begins to slow down, and when the features reach 30, the model can reach the highest accuracy.
B. Comparison of the model performance
Table III shows a comparison of the performance of the five models. As can be seen from the data in the table, the stacking model can combine the advantages of the basic learner to produce better results. The stacking model performed best on this dataset, with a test set accuracy of 0.91, and the precision, recall, F1-score, AUC, and AP were 0.91, 0.90, 0.90, 0.90, and 0.85, respectively. The ROC curves and PR curves are shown in Fig. 3.
CatBoost, XGBoost, and LightGBM are recognized as the three leading implementations of gradient boosting decision trees (GBDT), each representing significant advancements within the GBDT framework. These models have become indispensable tools in machine learning, especially for structured data tasks. XGBoost, CatBoost, and LightGBM are highly efficient and capable of delivering state-of-the-art performance across a wide range of machine learning challenges. Each model offers distinct advantages depending on the characteristics of the dataset and the specific requirements of the task, making them critical components in modern machine learning workflows. In this study, these three models were compared with the stacking model, the results are shown in Fig. 4, and the ROC curve of the stacking model is not inferior to the three models, or even better than XGBoost, CatBoost, and LightGBM.
C. External validation
Table IV outlines the stacking model's performance in the Pima Indian medical association (PIMA) dataset and compares the performance of other models. As shown in the table, the stacking model still reached the best performance in the test set with accuracy, precision, recall, and F1-score of 0.74, 0.73, 0.74, and 0.73, respectively.
D. Comparison of divide the dataset
Table V shows the model performance of the four experiments. It can be seen that stacking models perform best from the table. Figure 5 presents the contributions of the features on the models output ranked in four experiments. Permutation feature importance is an effective method for explaining black box models. This is helpful for us looking for features that are significant risk factors for incident T2DM. Figure 6 provides an overview of the distribution of the effects of age and BMI on diabetes. From the figure, it is found that the yellow dots are mostly concentrated between 60 and 80, while the trend line has an upward trend, but the amplitude is not obvious.
E. Statistical analysis
According to the univariate logistic regression analysis in Table VI, Age, Carbonate Beverages, Han, AST, FBG, Ph2BG, and HbA1c are significant predictors of the occurrence of T2DM in the overall population (P < 0.05). Tea drinking was not statistically significant in our dataset. Therefore, we not included tea in the multivariate regression. As can be seen from Table VII, the salient features identified in the univariate analysis described above were included in the multivariate logistic regression analysis. The odds ratios (ORs) calculated indicated the relative risk of T2DM. The results showed that Age, Carbonate Beverages, Han, FBG, Ph2BG, and HbA1c were independent predictors of T2DM.
IV. DISCUSSION
With the development of artificial intelligence, machine learning has been widely integrated into the field of medical diagnosis.15–21,31–36 The stacking model has also been widely applied in the field of diabetes diagnostics.37,38,40,41 In this retrospective study, we applied four machine learning models to build a stacked model of the risk of type 2 diabetes in the Guangxi area. It was found that the stacking model showed the best performance in predicting type 2 diabetes through model performance comparison and external validation, with a test set accuracy of 0.90, and the precision, recall, F1-score, AUC, and AP were 0.91, 0.90, 0.90, 0.90, and 0.85, respectively. This suggested that the stacking model can use questionnaire data and blood test data to predict early type 2 diabetes, which could benefit the prevention and control of diabetes. Compared with previous research,40,41 we used random forest to fill the dataset, SMOTEENN to handle class imbalance data, and wrapper for feature selection. Model performance is proven through external verification. In addition, we have added the HbA1c indicator. By ranking the importance of HbA1c indicators in the model, it is proved that the model can predict the occurrence of diabetes based on the currently recognized indicators.
This study also designed four experiments to explore the effect of data preprocessing on the model. Experiment 1 had no imbalance processing and feature selection; experiment 2 did not use imbalance processing but feature selection; experiment 3 used imbalance processing without feature selection; and experiment 4 used imbalance processing and used feature selection. The performance of the stacked fusion model was observed to determine whether unbalanced processing and feature selection were used. We verified the above-mentioned hypotheses using the Pima Indians dataset. As shown in Table IV, the integrated model's detection of the Pima Indian dataset improved by about 18% after the imbalance treatment. We conclude that data imbalance affects the performance of the stacked fusion model.
Despite increasing knowledge regarding risk factors for type 2 diabetes and evidence for successful prevention programs, the incidence and prevalence of the disease continue to rise globally.42 How to design screening programs for early detection and safe and effective treatment will be a key issue in reducing diabetes morbidity and mortality. Notably, the feature importance analysis is an important way to study the factors that influence diabetes in the early stages. To rule out the possible randomness of diabetes factors to model predictions, we divided the dataset and visualized feature importance. In our study, substantial contributions of HbA1c, FBG, Ph2BG, Age, Tea, Han, Carbonate Beverages, AST, etc., were made to the prediction model.
We used statistical methods to verify the importance characteristics of the model, and the results showed that Age, Carbonate Beverages, Han, FBG, Ph2BG, and HbA1c were all risk factors for diabetes mellitus (OR > 1) and were statistically significant (P < 0.05). Tea drinking is not statistically significant in our data, but many references to polyphenolic compounds in tea can effectively inhibit diabetes,43,44 which is also consistent with the results shown that the tea drinking has an impact on diabetes by our model. HbA1c concentration is a stable diagnostic measure for type 2 diabetes.45,46 However, it is not available in all regions. In developing countries, fasting plasma glucose and HbA1c concentrations are inconsistent across ethnicities and with age.47 This makes sense to look for early diagnostic factors that trigger type 2 diabetes in different regions and ethnicities. As shown in the present study, HbA1c concentration occupies the most important position in the model as a reliable diagnostic indicator for diagnosing type 2 diabetes, followed by FBG and blood sugar 2 h after a meal. Studies have shown that individuals with a higher blood glucose would have a greater likelihood of developing diabetes.31,38 These features can also be used as diagnostic indicators for type 2 diabetes detection.46 However, in our study, FBG was shown to be more reliable in diagnosing type 2 diabetes than blood sugar 2 h after a meal. AST also occupies a certain position, but there are no relevant studies that show an association between AST and diabetes.
Obesity and age are important factors in diabetes.42 We plotted the distribution of the effects of age and BMI on diabetes. As can be seen from Fig. 6, diabetics are concentrated between the ages of 60 and 80. In addition, BMI also increases slightly with age. More diabetic patients have a BMI of more than 30. In our study, another interesting finding was that snoring people are more likely to develop diabetes and the Han population will be more likely to get diabetes than the Zhuang population, which agreed with results from earlier studies.17,20,48 Our findings further support the views of the above study.
There are limitations in this paper. We only explain the model in terms of feature importance, which is one-sided. Due to data issues, we cannot prove that there is an association between AST and diabetes. In addition, due to the follow-up condition, this study used data from the whole year of 2011, so it may be a bit outdated in time. However, as a chronic disease, diabetes is independent of the timing of the sampled data. Our external validation data also use the already publicly available PIMA dataset. On the other hand, the machine learning framework used in this paper does not incorporate deep learning, which could potentially degrade the performance of the model. Time complexity qualitatively describes the running time of an algorithm and can measure the efficiency of its execution.49 At present, we do not discuss the time complexity, which will be further supplemented in the future work.
In future research, we will explore the robustness of the model against adversarial attacks. In addition, the integration of machine learning algorithms and deep learning models still deserves further exploration by tuning the parameters of the model pairs, including learning rate, batch size, and network structure, in order to find the optimal model configuration. With the continuous updating of data, the models will be continuously trained and updated to adapt to new data distributions and task requirements.
V. CONCLUSIONS
In this retrospective study, we propose a model fusion framework to analyze a cross-sectional dataset of 8981 cases from the First Affiliated Hospital of Guangxi Medical University. Data problems are solved by feature engineering and data preprocessing methods, hyperparameters are optimized using learning curves and grid search, and model performance is evaluated using cross-validation and medical statistical methods. This paper compares other machine learning models with the fusion model, and the results demonstrate that the fusion model outperforms any constructed individual machine learning method and integrated model. The effect of data imbalance handling and feature selection on the model was tested. This study examined the effect of data preprocessing on the model using the PIMA database to demonstrate the robustness of the model. Statistical tests were performed to verify that the proposed model has better generalization. Using feature importance visualization, the reliability of the fusion model is demonstrated and the potential value of other features for diabetes diagnosis is explored. The model fusion framework proposed in this paper can provide assistance in diabetes detection and prevention. In addition, the fusion model can be used in applications or websites to help early warning of diabetic patients.
VI. METHODS
In this work, we built a two-layer stacking model and demonstrated that the stacking model has more advantages in obtaining a higher predictive accuracy for type 2 diabetes prediction. Figure 1 shows the learning process of the whole study. It includes dataset selection, data preprocessing (data cleaning, class imbalance learning, and feature selection), and model selection. The details of the work are described in the following.
A. Data source
In this retrospective cohort study, the raw data were derived from the Endocrine Department of the First Affiliated Hospital of Guangxi Medical University. Ethical approval was granted by the Ethics Committee of the First Affiliated Hospital of Guangxi Medical University with grant number 2011–14. Data samples were reviewed, and samples containing unreasonable values were removed based on medical criteria. However, the samples with an overly high value in the blood test were not treated since these outliers belong to valid patients. In addition, samples with too many missing features (≥12 features) in a single sample are also deleted. The features in the dataset have been carefully selected based on the available variables in our dataset, clinical expertise, and prior literature evidence of their associations with T2DM. A dataset containing 8981 samples was finally obtained. It contains 30 unique features, where 1596 were diagnosed with T2DM, and 7385 with non-diabetic. Table II outlines the description of the database attributes used in this study. A detailed statistical description of the nominal characteristics and a statistical analysis of the numerical attributes presented in the diabetes dataset, including missing values, centralized trend measures, standard deviations, minimum values, and maximum values in Appendix Tables VIII and IX.
Type of data . | Feature . | Description . |
---|---|---|
Questionnaire data | Age | Age at the time of sampling. |
Gender | Male (1) or Female (2) | |
Race | Han (1) or Zhuang (2) or other (3) | |
Smoke | Whether or not you smoke? No (1) or Yes, but no often (2) or Yes, everyday (3) | |
Drink alcohol | Whether or not you drink alcohol? No (1) or Yes, but no often (2) or Yes, every week (3). | |
Tea | Whether or not you drink tea in the past year? Never or almost never (1) or occasional drinking (2) or drinking tea often in the past (3) or drinking tea often now (4) | |
Carbonated beverages | Whether or not you drink carbonated beverages? Yes (1) or No (2) | |
Coffee | Whether or not you drink coffee? Yes (1) or No (2) | |
Hypertension | Whether or not you have hypertension? Yes (1) or No (2) | |
Retinopathy | Whether or not you have retinopathy? Yes (1) or No (2) | |
Hyperlipidemia | Whether or not you have hyperlipidemia? Yes (1) or No (2) | |
FLD | Whether or not you have fatty liver? Yes (1) or No (2) | |
Snore | Whether or not you snore? Often (1) or Occasionally (2) or Never (3) or Unclear (4) | |
Hypotensive Drugs | Whether or not to take hypotensive drugs today? Yes (1) or No (2) | |
SBP | Systolic pressure | |
DBP | Diastolic pressure | |
BMI | Body mass index is a commonly used standard to measure the degree of fat and thinness of the human body and whether it is healthy. | |
WC | Waist circumference | |
HC | Hip circumference | |
Blood test data | CRP | C-reactive protein |
HDL | High-density lipoprotein | |
LDL | LDL | |
TCHO | Total cholesterol. | |
TG | Triglyceride. | |
AST | Aspartate aminotransferase | |
γ-GT | γ-glutamyl transpeptidase | |
FBG | Fasting plasma glucose | |
P2hPG | Blood glucose 2 h after meals | |
HbA1C | Glycated hemoglobin | |
FINS | Fasting insulin |
Type of data . | Feature . | Description . |
---|---|---|
Questionnaire data | Age | Age at the time of sampling. |
Gender | Male (1) or Female (2) | |
Race | Han (1) or Zhuang (2) or other (3) | |
Smoke | Whether or not you smoke? No (1) or Yes, but no often (2) or Yes, everyday (3) | |
Drink alcohol | Whether or not you drink alcohol? No (1) or Yes, but no often (2) or Yes, every week (3). | |
Tea | Whether or not you drink tea in the past year? Never or almost never (1) or occasional drinking (2) or drinking tea often in the past (3) or drinking tea often now (4) | |
Carbonated beverages | Whether or not you drink carbonated beverages? Yes (1) or No (2) | |
Coffee | Whether or not you drink coffee? Yes (1) or No (2) | |
Hypertension | Whether or not you have hypertension? Yes (1) or No (2) | |
Retinopathy | Whether or not you have retinopathy? Yes (1) or No (2) | |
Hyperlipidemia | Whether or not you have hyperlipidemia? Yes (1) or No (2) | |
FLD | Whether or not you have fatty liver? Yes (1) or No (2) | |
Snore | Whether or not you snore? Often (1) or Occasionally (2) or Never (3) or Unclear (4) | |
Hypotensive Drugs | Whether or not to take hypotensive drugs today? Yes (1) or No (2) | |
SBP | Systolic pressure | |
DBP | Diastolic pressure | |
BMI | Body mass index is a commonly used standard to measure the degree of fat and thinness of the human body and whether it is healthy. | |
WC | Waist circumference | |
HC | Hip circumference | |
Blood test data | CRP | C-reactive protein |
HDL | High-density lipoprotein | |
LDL | LDL | |
TCHO | Total cholesterol. | |
TG | Triglyceride. | |
AST | Aspartate aminotransferase | |
γ-GT | γ-glutamyl transpeptidase | |
FBG | Fasting plasma glucose | |
P2hPG | Blood glucose 2 h after meals | |
HbA1C | Glycated hemoglobin | |
FINS | Fasting insulin |
The Pima Indian dataset was used in the external validation. The Pima Indian dataset was downloaded from Kaggle (https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database) and is available via a CC0 public domain license. The dataset is properly anonymized and does not contain any identifiable features of the subjects. This dataset comprised 768 samples, including 500 patients without diabetes and 268 patients with diabetes, as well as their eight characteristics and corresponding classifications.
B. Data preprocessing
Substantial instances of missing data are a serious problem that undermines the scientific credibility of causal conclusions from clinical trials.50 This study built a random forest regression model to fill in missing values.
Class imbalance is naturally inherent in many real-world applications. Treatment methods of unbalanced data have an important impact on the model performance.51,52 Since the categories of the incident T2DM in the dataset were imbalanced, the SMOTEENN23,53 was applied to the training set to resolve the effect of class imbalance. Synthetic minority oversampling technique (SMOTE)54 is used to analyze minority samples and synthesize new samples based on minority samples to add to the dataset. Edited nearest neighbors (ENN)23 test each instance with k-NN against the remaining samples in this method. Those incorrectly classified will be discarded, and the remaining samples will form the edited dataset. The hyperparameter sampling strategy is set to 0.3, and the ratio is obtained after many repeated experiments.
Using positively or negatively correlated features will cause data redundancy, reduce the accuracy of the model, and increase the computational cost.55 Feature selection is an important task in data mining and ML applications. It removes irrelevant and redundant features to improve model learning performance.56
Wrapper is a method of feature selection that predicts the effect score based on the objective function. It generally finds better feature subset classification performance and relatively high accuracy compared to other feature selection methods. This study set the objective function as recursive feature elimination (RFE).57 The learning curve of the wrapper is plotted as shown in Fig. 2.
C. Model development
Model fusion refers to building and combining multiple well-performing learners to accomplish a learning task. Different models have their own strengths and differences, and model fusion can make it possible to utilize the strengths of each model, so that these relatively weak learners can be combined by some strategy to achieve a relatively strong learner. Model fusion is derived from, but superior to, model integration. The main difference is that model fusion uses better performing learners, while model integration uses learners from multiple bases, so the training bases are different. From a statistical point of view, model fusion reduces the risk caused by choosing the wrong assumptions, improves the likelihood of capturing real data patterns, and improves the likelihood of having better generalization capabilities. There are two main integration learning methods, boosting and bagging. Stacking combines these two integration methods by utilizing multiple primary learners on the raw data and then sending the features learned by the primary learners to the meta-learner for fitting.
In this study, we built a two-story stacking model. Random forest (RF), Extra-Trees(ET), GradientBoosting (GBDT), and AdaBoost are basic learning algorithms as the first layer, whereas RF is the meta-learner as the second layer. Figure 7 outlines the specific model structure. To avoid overfitting, we use fivefold cross-validation and set random seeds. This study uses the setup learning curve and the GridSearchCV hyper-parameter tuning method to find the best hyperparameters. The GridSearchCV hyper-parameter tuning method will loop through all candidate parameter selections, trying every possibility to find the best performing hyperparameters. To ensure the stability of the hyperparameters, we chose tenfold cross-validation.
D. Model evaluation
Model performance was evaluated on a test set using accuracy, precision, recall, F1-score, P-R curve, and AUC as model evaluation criteria. Each evaluation method was based on one of four categories: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN):
True Positive (TP): predicts positive classes as positive class numbers.
True Negative (TN): predicts a negative class as a negative number.
False Positive (FP): predicts a negative class as a positive class (type 1 error).
False Negative (FN): predicts the positive class as the number of negative classes (type 2 error).
PR Curve (Precision–Recall Curve): the PR curve demonstrates the trade-off between precision and recall of the model at different thresholds. By analyzing the PR curve, the optimal model threshold can be determined and the performance of the model under different thresholds can be understood.
AUC (Area Under the Curve): AUC provides a single value for comparing the performance of different models, with higher AUC indicating better model performance in classification tasks.
E. Model explanation
In this paper, the model is built using stacking and compared to the RF, ET, GBDT, and AdaBoost models. Table III presents the performance metric scores of different methods.
Learner . | Test set accuracy . | Precision (weighted) . | Recall (weighted) . | F1-score (weighted) . | AUC . | AP . |
---|---|---|---|---|---|---|
RF | 0.82 | 0.84 | 0.82 | 0.82 | 0.89 | 0.83 |
Extra-trees | 0.81 | 0.83 | 0.81 | 0.82 | 0.88 | 0.81 |
GBDT | 0.90 | 0.90 | 0.89 | 0.89 | 0.90 | 0.83 |
AdaBoost | 0.88 | 0.88 | 0.88 | 0.87 | 0.88 | 0.81 |
Stacking | 0.90 | 0.91 | 0.90 | 0.90 | 0.90 | 0.85 |
Learner . | Test set accuracy . | Precision (weighted) . | Recall (weighted) . | F1-score (weighted) . | AUC . | AP . |
---|---|---|---|---|---|---|
RF | 0.82 | 0.84 | 0.82 | 0.82 | 0.89 | 0.83 |
Extra-trees | 0.81 | 0.83 | 0.81 | 0.82 | 0.88 | 0.81 |
GBDT | 0.90 | 0.90 | 0.89 | 0.89 | 0.90 | 0.83 |
AdaBoost | 0.88 | 0.88 | 0.88 | 0.87 | 0.88 | 0.81 |
Stacking | 0.90 | 0.91 | 0.90 | 0.90 | 0.90 | 0.85 |
To verify the robustness of the model, external validation is set up. In this paper, the model is tested using the Pima Indian dataset. Table IV shows the performance of stacking models in the datasets.
Learner . | Test set accuracy . | Precision (weighted) . | Recall (weighted) . | F1-score (weighted) . |
---|---|---|---|---|
RF | 0.72 | 0.73 | 0.72 | 0.72 |
Extra-trees | 0.70 | 0.71 | 0.70 | 0.70 |
GBDT | 0.72 | 0.72 | 0.72 | 0.72 |
AdaBoost | 0.72 | 0.72 | 0.72 | 0.72 |
Stacking | 0.74 | 0.73 | 0.74 | 0.73 |
Learner . | Test set accuracy . | Precision (weighted) . | Recall (weighted) . | F1-score (weighted) . |
---|---|---|---|---|
RF | 0.72 | 0.73 | 0.72 | 0.72 |
Extra-trees | 0.70 | 0.71 | 0.70 | 0.70 |
GBDT | 0.72 | 0.72 | 0.72 | 0.72 |
AdaBoost | 0.72 | 0.72 | 0.72 | 0.72 |
Stacking | 0.74 | 0.73 | 0.74 | 0.73 |
To validate the reliability of the model and what are the key factors in diagnosing diabetes with different characteristics, in this study, the processed data are divided into four parts for experiments: experiment one: using the model to train and predict the questionnaire data in the dataset; experiment two: using the model to train and predict the blood test data in the dataset; in the third experiment, the model was trained and predicted using the dataset after excluding the gold standard for detecting diabetes (HbA1C); and experiment four uses all the data for model training. To show the impact of model performance and features on diabetes, the model performance was compared for different datasets. Table V outlines the details. In order to evaluate the practical significance of the model, a visual interpretation of the model was performed. The ranking of the importance of the features shows the risk factors that are most relevant to the impact of diabetes.
. | Learner . | Test set accuracy . | Precision (weighted) . | Recall (weighted) . | F1-score (weighted) . | |
---|---|---|---|---|---|---|
Experiment 1 (questionnaire Data) | RF | 0.69 | 0.75 | 0.69 | 0.70 | |
Extra-trees | 0.68 | 0.74 | 0.68 | 0.69 | ||
GBDT | 0.81 | 0.81 | 0.81 | 0.77 | ||
AdaBoost | 0.80 | 0.80 | 0.80 | 0.77 | ||
Stacking | 0.81 | 0.81 | 0.81 | 0.77 | ||
Experiment 2 (blood test data) | RF | 0.72 | 0.80 | 0.72 | 0.74 | |
Extra-trees | 0.72 | 0.81 | 0.72 | 0.74 | ||
GBDT | 0.88 | 0.89 | 0.88 | 0.87 | ||
AdaBoost | 0.86 | 0.86 | 0.86 | 0.85 | ||
Stacking | 0.89 | 0.89 | 0.89 | 0.88 | ||
Experiment 3 (remove HbA1C) | RF | 0.82 | 0.83 | 0.82 | 0.82 | |
Extra-trees | 0.80 | 0.82 | 0.80 | 0.80 | ||
GBDT | 0.89 | 0.89 | 0.89 | 0.88 | ||
AdaBoost | 0.87 | 0.87 | 0.87 | 0.87 | ||
Stacking | 0.89 | 0.89 | 0.89 | 0.88 | ||
Experiment 4 (all data) | RF | 0.82 | 0.84 | 0.82 | 0.82 | |
Extra-trees | 0.81 | 0.83 | 0.81 | 0.82 | ||
GBDT | 0.89 | 0.90 | 0.89 | 0.89 | ||
AdaBoost | 0.88 | 0.88 | 0.88 | 0.87 | ||
Stacking | 0.90 | 0.91 | 0.90 | 0.90 |
. | Learner . | Test set accuracy . | Precision (weighted) . | Recall (weighted) . | F1-score (weighted) . | |
---|---|---|---|---|---|---|
Experiment 1 (questionnaire Data) | RF | 0.69 | 0.75 | 0.69 | 0.70 | |
Extra-trees | 0.68 | 0.74 | 0.68 | 0.69 | ||
GBDT | 0.81 | 0.81 | 0.81 | 0.77 | ||
AdaBoost | 0.80 | 0.80 | 0.80 | 0.77 | ||
Stacking | 0.81 | 0.81 | 0.81 | 0.77 | ||
Experiment 2 (blood test data) | RF | 0.72 | 0.80 | 0.72 | 0.74 | |
Extra-trees | 0.72 | 0.81 | 0.72 | 0.74 | ||
GBDT | 0.88 | 0.89 | 0.88 | 0.87 | ||
AdaBoost | 0.86 | 0.86 | 0.86 | 0.85 | ||
Stacking | 0.89 | 0.89 | 0.89 | 0.88 | ||
Experiment 3 (remove HbA1C) | RF | 0.82 | 0.83 | 0.82 | 0.82 | |
Extra-trees | 0.80 | 0.82 | 0.80 | 0.80 | ||
GBDT | 0.89 | 0.89 | 0.89 | 0.88 | ||
AdaBoost | 0.87 | 0.87 | 0.87 | 0.87 | ||
Stacking | 0.89 | 0.89 | 0.89 | 0.88 | ||
Experiment 4 (all data) | RF | 0.82 | 0.84 | 0.82 | 0.82 | |
Extra-trees | 0.81 | 0.83 | 0.81 | 0.82 | ||
GBDT | 0.89 | 0.90 | 0.89 | 0.89 | ||
AdaBoost | 0.88 | 0.88 | 0.88 | 0.87 | ||
Stacking | 0.90 | 0.91 | 0.90 | 0.90 |
Characteristics were assessed using one-way logistic regression and multifactor logistic regression. Logistic regression analysis was applied to calculate the odds ratio (OR) with 95% confidence interval (CI). P < 0.05 was considered to indicate statistical significance. The results are shown in Tables VI and VII. To explore the impact of certain characteristics on diabetes, we present trends in the form of scatter plots.
Variables . | . | OR . | 95%CI . | P-value . |
---|---|---|---|---|
Age | 1.053 | (1.047,1.058) | <0.001 | |
Han | (No) | 1 (reference) | ||
(Yes) | 1.799 | (1.586,2.041) | <0.001 | |
Tea | (Never) | 1 (reference) | ||
(Occasional) | 0.907 | (0.800,1.029) | 0.129 | |
(Used to drink tea) | 0.972 | (0.602,1.568) | 0.906 | |
(Now often) | 0.901 | (0.780,1.039) | 0.152 | |
Carbonated Beverages | (No) | 1 (reference) | ||
(Yes) | 1.913 | (1.660,2.204) | <0.001 | |
AST | 1.015 | (1.011,1.020) | <0.001 | |
FBG | 2.708 | (2.547,2.879) | <0.001 | |
P2hPG | 1.450 | (1.470,1.480) | <0.001 | |
HbA1c | 5.194 | (4.703,5.736) | <0.001 |
Variables . | . | OR . | 95%CI . | P-value . |
---|---|---|---|---|
Age | 1.053 | (1.047,1.058) | <0.001 | |
Han | (No) | 1 (reference) | ||
(Yes) | 1.799 | (1.586,2.041) | <0.001 | |
Tea | (Never) | 1 (reference) | ||
(Occasional) | 0.907 | (0.800,1.029) | 0.129 | |
(Used to drink tea) | 0.972 | (0.602,1.568) | 0.906 | |
(Now often) | 0.901 | (0.780,1.039) | 0.152 | |
Carbonated Beverages | (No) | 1 (reference) | ||
(Yes) | 1.913 | (1.660,2.204) | <0.001 | |
AST | 1.015 | (1.011,1.020) | <0.001 | |
FBG | 2.708 | (2.547,2.879) | <0.001 | |
P2hPG | 1.450 | (1.470,1.480) | <0.001 | |
HbA1c | 5.194 | (4.703,5.736) | <0.001 |
Variables . | . | OR . | 95%CI . | P-value . |
---|---|---|---|---|
Age | 1.018 | (1.011,1.024) | <0.001 | |
Han | (No) | 1 (reference) | ||
(Yes) | 1.484 | (1.263,1.743) | <0.001 | |
Carbonated beverages | (No) | 1 (reference) | ||
(Yes) | 1.347 | (1.129,1.607) | 0.001 | |
AST | 1.004 | (0.997,1.011) | 0.236 | |
FBG | 1.586 | (1.485,1.693) | <0.001 | |
P2hPG | 1.190 | (1.159,1.222) | <0.001 | |
HbA1c | 2.203 | (1.916,2.469) | <0.001 |
Variables . | . | OR . | 95%CI . | P-value . |
---|---|---|---|---|
Age | 1.018 | (1.011,1.024) | <0.001 | |
Han | (No) | 1 (reference) | ||
(Yes) | 1.484 | (1.263,1.743) | <0.001 | |
Carbonated beverages | (No) | 1 (reference) | ||
(Yes) | 1.347 | (1.129,1.607) | 0.001 | |
AST | 1.004 | (0.997,1.011) | 0.236 | |
FBG | 1.586 | (1.485,1.693) | <0.001 | |
P2hPG | 1.190 | (1.159,1.222) | <0.001 | |
HbA1c | 2.203 | (1.916,2.469) | <0.001 |
SUPPLEMENTARY MATERIAL
See the supplementary material for the following: “an epidemiological investigation project on the risk of complications in patients with diabetes mellitus in the community baseline survey informed consent” and “epidemiological study of the risk of complications in patients with type 2 diabetes mellitus in China baseline questionnaire.”
ACKNOWLEDGMENTS
We thank all participants who agreed to participate in this study.
This work was support in part by the National Natural Science Foundation of China (No. 62341601,81860604); the National Key Research & Development Plan for Precision Medicine Key Program (Nos. 2016YFC0901200 and 2016YFC0901205); the National Health Commission of China Public Welfare Research Project (No. 201502007); the Innovation Project of Clinical Research Climbing Plan of the First Affiliated Hospital of Guangxi Medical University (No. YYZS2020012); and the Guangxi Medical and Health Appropriate Technology Development and Promotion Application Project (No. S2017026).
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Ethics Approval
Ethics approval for experiments reported in the submitted manuscript on animal or human subjects was granted. All patients have signed an informed consent form. Authors reporting experiments on humans and all experiments were performed in accordance with relevant guidelines and regulations. The study was approved by Medical Ethics Committee of Ruijin Hospital, Shanghai Jiaotong University (approval number: 2011–14).
Author Contributions
Yong Fu, Xinghuan Liang and Xi Yang contributed equally to this work.
Yong Fu: Conceptualization (equal); Methodology (equal); Software (equal); Writing – original draft (equal); Writing – review & editing (equal). Xinghuan Liang: Data curation (equal); Investigation (equal); Writing – original draft (equal). Xi Yang: Data curation (equal); Investigation (equal); Writing – original draft (equal); Writing – review & editing (equal). Li Li: Data curation (equal); Investigation (equal); Resources (equal). LiHeng Meng: Data curation (equal); Methodology (equal). Yuekun Wei: Data curation (equal). Daizheng Huang: Conceptualization (equal); Writing – original draft (equal); Writing – review & editing (equal). Yingfen Qin: Conceptualization (equal); Funding acquisition (equal); Writing – review & editing (equal).
DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding authors upon reasonable request.
NOMENCLATURE
List of Abbreviations
- AUC
-
area under curve
- CI
-
confidence interval
- ENN
-
edited nearest neighbors
- ET
-
extra-trees
- GBDT
-
gradientboosting
- KNN
-
K-nearest neighbor
- LR
-
logistic regression
- ML
-
machine learning
- NB
-
Naïve–Bayes
- OR
-
odds ratio
- RF
-
random forest
- RFE
-
recursive feature elimination
- ROC
-
receiver operating characteristic curve
- SMOTE
-
synthetic minority oversampling technique
- SVM
-
support vector machine
- T2DM
-
type 2 diabetes mellitus
NOMENCLATURE
List of Abbreviations
- AUC
-
area under curve
- CI
-
confidence interval
- ENN
-
edited nearest neighbors
- ET
-
extra-trees
- GBDT
-
gradientboosting
- KNN
-
K-nearest neighbor
- LR
-
logistic regression
- ML
-
machine learning
- NB
-
Naïve–Bayes
- OR
-
odds ratio
- RF
-
random forest
- RFE
-
recursive feature elimination
- ROC
-
receiver operating characteristic curve
- SMOTE
-
synthetic minority oversampling technique
- SVM
-
support vector machine
- T2DM
-
type 2 diabetes mellitus
APPENDIX: Statistical Analysis of Numerical and Nominal Attributes in a Diabetic Population
Numerical attributes statistical description. Nominal attributes statistical analysis.
Feature . | State . | Count . | Mean . | Std . | Min . | Max . |
---|---|---|---|---|---|---|
Age | NonT2DM | 7385 | 54.71 | 112.174 | 29 | 100 |
T2DM | 1596 | 60.77 | 106.321 | 35 | 91 | |
SBP | NonT2DM | 7277 | 131.37 | 427.896 | 66 | 216 |
T2DM | 1575 | 138.58 | 437.075 | 70 | 228 | |
DBP | NonT2DM | 7277 | 78.53 | 144.965 | 33 | 139 |
T2DM | 1575 | 79.69 | 142.326 | 49 | 125 | |
BMI | NonT2DM | 7231 | 23.9897 | 10.550 | 13.333 | 45.6538 |
T2DM | 1568 | 24.8180 | 12.276 | 15.623 | 44.7087 | |
WC | NonT2DM | 7218 | 82.442 | 80.630 | 53.0 | 125.0 |
T2DM | 1560 | 85.666 | 81.454 | 58.0 | 130.2 | |
HC | NonT2DM | 7181 | 94.002 | 43.893 | 59.2 | 140.0 |
T2DM | 1550 | 95.398 | 50.378 | 70.0 | 133.0 | |
CRP | NonT2DM | 7310 | 69.205 | 331.450 | 19.6 | 358.7 |
T2DM | 1583 | 72.544 | 305.577 | 19.8 | 259.1 | |
HDL | NonT2DM | 7311 | 1.3039 | 0.170 | 0.13 | 3.00 |
T2DM | 1583 | 1.3246 | 0.147 | 0.31 | 2.77 | |
LDL | NonT2DM | 7310 | 2.9151 | 0.995 | 0.18 | 10.46 |
T2DM | 1583 | 3.1779 | 0.934 | 0.62 | 11.93 | |
TCHO | NonT2DM | 7311 | 4.9387 | 1.957 | 0.40 | 13.03 |
T2DM | 1582 | 5.3598 | 1.614 | 1.39 | 12.17 | |
TG | NonT2DM | 7304 | 1.4754 | 1.402 | 0.10 | 15.59 |
T2DM | 1579 | 1.8708 | 2.352 | 0.25 | 14.76 | |
AST | NonT2DM | 7288 | 19.60 | 106.358 | 3 | 254 |
T2DM | 1583 | 21.43 | 98.971 | 3 | 136 | |
YGT | NonT2DM | 7265 | 27.21 | 833.690 | 4 | 595 |
T2DM | 1581 | 33.97 | 1668.779 | 4 | 768 | |
FBG | NonT2DM | 7102 | 5.51803 | 0.727 | 0.110 | 17.060 |
T2DM | 1579 | 7.5948 | 8.323 | 1.88 | 26.95 | |
P2hPG | NonT2DM | 7015 | 6.99696 | 4.630 | 3.010 | 23.200 |
T2DM | 1554 | 11.6580 | 32.563 | 3.10 | 33.86 | |
HbA1C | NonT2DM | 7218 | 5.533 | 0.356 | 2.7 | 19.5 |
T2DM | 1579 | 6.847 | 2.952 | 4.0 | 15.5 | |
FINS | NonT2DM | 7312 | 8.426 | 32.041 | 0.1 | 163.9 |
T2DM | 1582 | 11.283 | 185.873 | 0.4 | 238.7 |
Feature . | State . | Count . | Mean . | Std . | Min . | Max . |
---|---|---|---|---|---|---|
Age | NonT2DM | 7385 | 54.71 | 112.174 | 29 | 100 |
T2DM | 1596 | 60.77 | 106.321 | 35 | 91 | |
SBP | NonT2DM | 7277 | 131.37 | 427.896 | 66 | 216 |
T2DM | 1575 | 138.58 | 437.075 | 70 | 228 | |
DBP | NonT2DM | 7277 | 78.53 | 144.965 | 33 | 139 |
T2DM | 1575 | 79.69 | 142.326 | 49 | 125 | |
BMI | NonT2DM | 7231 | 23.9897 | 10.550 | 13.333 | 45.6538 |
T2DM | 1568 | 24.8180 | 12.276 | 15.623 | 44.7087 | |
WC | NonT2DM | 7218 | 82.442 | 80.630 | 53.0 | 125.0 |
T2DM | 1560 | 85.666 | 81.454 | 58.0 | 130.2 | |
HC | NonT2DM | 7181 | 94.002 | 43.893 | 59.2 | 140.0 |
T2DM | 1550 | 95.398 | 50.378 | 70.0 | 133.0 | |
CRP | NonT2DM | 7310 | 69.205 | 331.450 | 19.6 | 358.7 |
T2DM | 1583 | 72.544 | 305.577 | 19.8 | 259.1 | |
HDL | NonT2DM | 7311 | 1.3039 | 0.170 | 0.13 | 3.00 |
T2DM | 1583 | 1.3246 | 0.147 | 0.31 | 2.77 | |
LDL | NonT2DM | 7310 | 2.9151 | 0.995 | 0.18 | 10.46 |
T2DM | 1583 | 3.1779 | 0.934 | 0.62 | 11.93 | |
TCHO | NonT2DM | 7311 | 4.9387 | 1.957 | 0.40 | 13.03 |
T2DM | 1582 | 5.3598 | 1.614 | 1.39 | 12.17 | |
TG | NonT2DM | 7304 | 1.4754 | 1.402 | 0.10 | 15.59 |
T2DM | 1579 | 1.8708 | 2.352 | 0.25 | 14.76 | |
AST | NonT2DM | 7288 | 19.60 | 106.358 | 3 | 254 |
T2DM | 1583 | 21.43 | 98.971 | 3 | 136 | |
YGT | NonT2DM | 7265 | 27.21 | 833.690 | 4 | 595 |
T2DM | 1581 | 33.97 | 1668.779 | 4 | 768 | |
FBG | NonT2DM | 7102 | 5.51803 | 0.727 | 0.110 | 17.060 |
T2DM | 1579 | 7.5948 | 8.323 | 1.88 | 26.95 | |
P2hPG | NonT2DM | 7015 | 6.99696 | 4.630 | 3.010 | 23.200 |
T2DM | 1554 | 11.6580 | 32.563 | 3.10 | 33.86 | |
HbA1C | NonT2DM | 7218 | 5.533 | 0.356 | 2.7 | 19.5 |
T2DM | 1579 | 6.847 | 2.952 | 4.0 | 15.5 | |
FINS | NonT2DM | 7312 | 8.426 | 32.041 | 0.1 | 163.9 |
T2DM | 1582 | 11.283 | 185.873 | 0.4 | 238.7 |
Feature . | State . | Count . | Values (count) . |
---|---|---|---|
Gender | NonT2DM | 7385 | Male (2850), Female (4535) |
T2DM | 1596 | Male (575), Female (1021) | |
Race | NonT2DM | 7183 | Han (4609), Zhuang (2487), other (87) |
T2DM | 1552 | Han (1186), Zhuang (341), other (25) | |
Smoke | NonT2DM | 7062 | Non-smoke (5935), occasionally (264), smoke (863) |
T2DM | 1509 | Non-smoke (1327), Occasionally (32), smoke (150) | |
Drink alcohol | NonT2DM | 7113 | Non-drink alcohol (4654) occasionally (1780), drink alcohol (679) |
T2DM | 1533 | Non-drink alcohol (1122) occasionally (294), drink alcohol (117) | |
Tea | NonT2DM | 7307 | Non-tea (3404) occasional (2262) often (89) drinking tea (1552) |
T2DM | 1580 | Non-tea (774) occasional (468) often (20) drinking tea (318) | |
Carbonate beverages | NonT2DM | 7059 | Carbonated beverages (1704), non-carbonated beverages (5355) |
T2DM | 1536 | Carbonated beverages (204), non-carbonated beverages (1332) | |
Coffee | NonT2DM | 7046 | Coffee (643), non- coffee (6403) |
T2DM | 1530 | Coffee (102), non- coffee (1428) | |
Hypertension | NonT2DM | 7324 | Hypertension (1291), non- hypertension (6033) |
T2DM | 1590 | Hypertension (494), non- hypertension (1096) | |
Retinopathy | NonT2DM | 7324 | Retinopathy (24), non-retinopathy (7289) |
T2DM | 1587 | Retinopathy (19), non-retinopathy (1566) | |
Hyperlipidemia | NonT2DM | 7050 | Hyperlipidemia (735), non-hyperlipidemia (6580) |
T2DM | 1510 | Hyperlipidemia (274), non-hyperlipidemia (1312) | |
FLD | NonT2DM | 7176 | Fatty liver (768), non-fatty liver (6556) |
T2DM | 1546 | Fatty liver (249), non-fatty liver (1338) | |
Snore | NonT2DM | 7324 | Often (1328) or occasionally (2569) or never (2402) or unclear (751). |
T2DM | 1587 | Often (372) or occasionally (521) or never (468) or unclear (149). | |
Hypotensive drugs | NonT2DM | 7050 | Hypotensive drugs (317), non-hypotensive drugs (6859) |
T2DM | 1510 | Hypotensive drugs (164), non-hypotensive drugs (1382) |
Feature . | State . | Count . | Values (count) . |
---|---|---|---|
Gender | NonT2DM | 7385 | Male (2850), Female (4535) |
T2DM | 1596 | Male (575), Female (1021) | |
Race | NonT2DM | 7183 | Han (4609), Zhuang (2487), other (87) |
T2DM | 1552 | Han (1186), Zhuang (341), other (25) | |
Smoke | NonT2DM | 7062 | Non-smoke (5935), occasionally (264), smoke (863) |
T2DM | 1509 | Non-smoke (1327), Occasionally (32), smoke (150) | |
Drink alcohol | NonT2DM | 7113 | Non-drink alcohol (4654) occasionally (1780), drink alcohol (679) |
T2DM | 1533 | Non-drink alcohol (1122) occasionally (294), drink alcohol (117) | |
Tea | NonT2DM | 7307 | Non-tea (3404) occasional (2262) often (89) drinking tea (1552) |
T2DM | 1580 | Non-tea (774) occasional (468) often (20) drinking tea (318) | |
Carbonate beverages | NonT2DM | 7059 | Carbonated beverages (1704), non-carbonated beverages (5355) |
T2DM | 1536 | Carbonated beverages (204), non-carbonated beverages (1332) | |
Coffee | NonT2DM | 7046 | Coffee (643), non- coffee (6403) |
T2DM | 1530 | Coffee (102), non- coffee (1428) | |
Hypertension | NonT2DM | 7324 | Hypertension (1291), non- hypertension (6033) |
T2DM | 1590 | Hypertension (494), non- hypertension (1096) | |
Retinopathy | NonT2DM | 7324 | Retinopathy (24), non-retinopathy (7289) |
T2DM | 1587 | Retinopathy (19), non-retinopathy (1566) | |
Hyperlipidemia | NonT2DM | 7050 | Hyperlipidemia (735), non-hyperlipidemia (6580) |
T2DM | 1510 | Hyperlipidemia (274), non-hyperlipidemia (1312) | |
FLD | NonT2DM | 7176 | Fatty liver (768), non-fatty liver (6556) |
T2DM | 1546 | Fatty liver (249), non-fatty liver (1338) | |
Snore | NonT2DM | 7324 | Often (1328) or occasionally (2569) or never (2402) or unclear (751). |
T2DM | 1587 | Often (372) or occasionally (521) or never (468) or unclear (149). | |
Hypotensive drugs | NonT2DM | 7050 | Hypotensive drugs (317), non-hypotensive drugs (6859) |
T2DM | 1510 | Hypotensive drugs (164), non-hypotensive drugs (1382) |