Cardiovascular disease is now increasingly threatening to humanity. The accurate prediction of patients’ condition is significant to early prevention. This paper describes our research about how to predict patients’ risk of cardiovascular disease by processing their physical examination reports. We use five items (systolic pressure, diastolic pressure, triglyceride, high-density lipoprotein cholesterol and low-density lipoprotein cholesterol) to quantizer this risk in our research. To extract useful information from the medical records, we use natural language processing (NLP) method. To conserve the sentence into digital data, we use term frequency-inverse document frequency (TF-IDF) algorithm to extract major information from medical reports. Principal component analysis (PCA) algorithm is used to reduce the high dimension of text information data. Additionally, we extracted easy-transform numerical features and category features. Combining all these features, we use the xgboost algorithm to make final predictions. The results turn out to be well that the mean square error and relative error can be restricted to an acceptable low level.

1.
Mistretta
C A
,
Crummy
A B.
Diagnosis of Cardiovascular Disease by Digital Subtraction Angiography [J]
.
Science
,
1981
,
214
(
4522
):
761
765
.
2.
William J.
Bommer
,
Larry
Miller
.
Real-time two-dimensional color-flow Doppler: Enhanced Doppler flow imaging in the diagnosis of cardiovascular disease [J]
.
The American Journal of Cardiology
,
1982
,
49
(
4
):
944
944
.
3.
Matsushita
K
,
Coresh
J
,
Sang
Y
, et al 
Estimated glomerular filtration rate and albuminuria for prediction of cardiovascular outcomes: a collaborative meta-analysis of individual participant data [J]
.
Lancet Diabetes & Endocrinology
,
2015
,
3
(
7
):
514
525
.
4.
Manninen
V
,
Tenkanen
L
,
Koskinen
P
, et al 
Joint effects of serum triglyceride and LDL cholesterol and HDL cholesterol concentrations on coronary heart disease risk in the Helsinki Heart Study. Implications for treatment [J]
.
Circulation
,
1992
,
85
(
1
):
37
45
.
5.
Stamler
J
,
Stamler
R
,
Neaton
J D.
Blood pressure, systolic and diastolic, and cardiovascular risks: US population data [J]
.
Archives of internal medicine
,
1993
,
153
(
5
):
598
615
.
6.
Staessen
J A
,
Thijs
L
,
Fagard
R
, et al 
Predicting cardiovascular risk using conventional vs ambulatory blood pressure in older patients with systolic hypertension[J]
.
Jama
,
1999
,
282
(
6
):
539
546
.
7.
Soni
J
,
Ansari
U
,
Sharma
D
, et al 
Predictive data mining for medical diagnosis: An overview of heart disease prediction[J]
.
International Journal of Computer Applications
,
2011
,
17
(
8
):
43
48
8.
Wang
C J
,
Li
Y Q
,
Wang
L
, et al 
Development and evaluation of a simple and effective prediction approach for identifying those at high risk of dyslipidemia in rural adult residents [J]
.
PLoS One
,
2012
,
7
(
8
):
e43834
.
9.
Chiu
J P C
,
Nichols
E.
Named entity recognition with bidirectional LSTM-CNNs [J].
ArXiv: 1511.08308,
2015
.
10.
Alajali
W
,
Zhou
W
,
Wen
S
, et al 
Intersection Traffic Prediction Using Decision Tree Models [J]
.
Symmetry
,
2018
,
10
(
9
):
386
.
11.
Fan
J
,
Yue
W
,
Wu
L
, et al 
Evaluation of SVM, ELM and four tree-based ensemble models for predicting daily reference evapotranspiration using limited meteorological data in different climates of China [J]
.
Agricultural and Forest Meteorology
,
2018
,
263
:
225
241
.
12.
Szwabe
A
,
Misiorek
P.
 Decision Trees as Interpretable Bank Credit Scoring Models[C]//
International Conference: Beyond Databases, Architectures and Structures
.
Springer
,
Cham
,
2018
:
207
219
.
13.
Friedman
J H.
Greedy Function Approximation: A Gradient Boosting Machine [J]
.
Annals of Statistics
,
2001
,
29
(
5
):
1189
1232
.
14.
Chen
,
Tianqi
, and
Carlos
Guestrin
.
Xgboost: A scalable tree boosting system[C]
.
Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining
,
2016
.
15.
Chen
T
,
Tong
H
,
Benesty
M
, et al 
xgboost: Extreme Gradient Boosting [J].
2016
.