In recent years, the rise in computer threats, especially malware attacks, has led to research on various ways to detect and contain malware attacks. Malware detection approaches can be static, dynamic, or hybrid. Windows portable Executable (WPE) is the file format used by Microsoft windows for an executable file. Previous work based on static features of WPE provides acceptable accuracy, but it can't detect and judge malicious behavior during the execution of malware. This study utilized the EMBER dataset consisting of labeled benign and malicious samples of WPE files. Features based on API import calls are used to predict malicious and benign behavior of WPE files. The random forest, XGBoost, and LightGBM are applied over the collected dataset consisting of 1000 API call features of 1.55 million samples. Chi-Square and Gini importance-based feature selection techniques are used to find the top 200 features, whereas further machine learning models are trained over different feature subsets. Models trained over features selected using hybrid feature selection performed better than Chi-square-based feature selection. All models are analyzed and evaluated using standard performance measures where random forest outperformed with the accuracy of 90 using 150 features.

1.
C.E.R.T, India
,
Annual report
,
2020
, available at https://cert-in.org.in/s2cMainServlet?pageid=PUBANULREPRT
2.
H.S.
Anderson
,
P.
Roth
, preprint arXiv:1804.04637 (
2018
).
3.
Q.
Trinh
,
IEEE Dataport
, (
2021
), available at .
4.
Malware Statistics & Trends Report, AV-TEST
, (
2021
), available at https://www.av-test.org/en/statistics/malware.
5.
E.
Gandotra
,
D.
Bansal
, and
S.
Sofat
,
Journal of Information Security
,
5
,
56
-
64
, (
2014
).
6.
Y.
Yanfang
,
T.
Li
,
D.
Adjeroh
, and
S. S.
Iyengar
.
A Survey on Malware Detection Using Data Mining Techniques
,
ACM Computing Surveys (CSUR)
,
50
,
1
40
,
2017
.
7.
M. G.
Schultz
,
E.
Eskin
,
F.
Zadok
and
S. J.
Stolfo
, “
Data mining methods for detection of new malicious executables
,” in
Proceedings of the IEEE Symposium on Security and Privacy (2001) (IEEE, 2001
), pp.
38
49
.
8.
J. Z.
Kolter
and
A. M.
Maloof
,
Journal of Machine Learning Research
,
7
,
2721
2744
, (
2006
).
9.
I.
Firdausi
,
C.
lim
,
A.
Erwin
and
A. S.
Nugroho
, “
Analysis of Machine learning Techniques Used in Behavior-Based Malware Detection
,” in
Second International Conference on Advances in Computing, Control, and Telecommunication Technologies, (IEEE, 2010
), pp.
201
203
.
10.
B.
Anderson
,
C.B.
Storlie
, &
T.
Lane
, “
Improving malware classification: bridging the static/dynamic gap
,” in
Proceedings of the 5th ACM workshop on Security and artificial intelligence, (ACM, 2012
), pp.
3
14
.
11.
U.
Baldangombo
,
N.
Jambaljav
, and
S.J.
Horng
,
International Journal of Artificial Intelligence & Applications
,
4
,
113
126
, (
2013
).
12.
D.
Uppal
,
R.
Sinha
,
V.
Mehra
and
V.
Jain
, “
Malware detection and classification based on extraction of API sequences
,” in
International Conference on Advances in Computing, Communications and Informatics (ICACCI) (IEEE, 2014
), pp.
2337
2342
.
13.
T.
K
,
Tran
, and
H.
Sato
, “
NLP-based approaches for malware classification from API sequences
,” in
21st Asia Pacific Symposium on Intelligent and Evolutionary Systems (IES), (IEEE, 2017
), pp.
101
105
.
14.
H. D.
Pham
,
T. D.
Le
, and
T. N.
Vu
, “Static PE malware detection using gradient boosting decision trees algorithm,” in
International Conference on Future Data and Security Engineering
, (
Springer
,
2018
), pp.
228
236
15.
Y.
Oyama
,
T.
Miyashita
, and
H.
Kokubo
, “
Identifying useful features for malware detection in the ember dataset
,” in
7th international symposium on computing and networking workshops, (IEEE, 2019
), pp.
360
366
.
16.
M.
Chandrasekaran
,
A.
Ralescu
,
D.
Kapp
, and
T. M.
Kebede
, “Context for API Calls in Malware vs Benign Programs”, in
7th International Conference on Modelling and Development of Intelligent Systems (MDIS),
(
Springer
,
2020
), pp.
222
234
.
17.
G.
Ke
,
Q.
Meng
,
T.
Finley
,
T.
Wang
,
W.
Chen
,
W.
Ma
et al, “
Lightgbm: A highly efficient gradient boosting decision tree
,” in
Advances in neural information processing systems (NIPS 2017)
, vol.
30
, pp.
3146
3154
.
18.
A. D.
Essam
,
International Journal of Computer and Information Engineering
,
13
,
6
-
10
, (
2019
).
19.
J. K.
Jaiswal
, and
R.
Samikannu
, “
Application of Random Forest Algorithm on Feature Subset Selection and Classification and Regression
,” in
Proceeding of the 2nd World Congress on Computing and Communication Technologies (WCCCT), (IEEE, 2017
), pp.
65
68
.
20.
M.
Ijaz
,
M.H.
Durad
, and
M.
Ismail
, “
Static and Dynamic Malware Analysis Using Machine Learning
,” in
Proceedings of the 16th International Bhurban Conference on Applied Sciences and Technology (IBCAST), (IEEE, 2019
), pp.
687
691
.
This content is only available via PDF.
You do not currently have access to this content.