The telecommunications industry faced challenges with their datasets, primarily due to their high dimensionality and other issues such as imbalanced classes and missing values. These deficiencies led to inaccurate predictions and a decline in performance when the datasets were not handled properly. Due to the significant disparity in size between the churned customer class and the active customer class, the accuracy paradox arose. Consequently, despite the model’s accuracy metrics reaching 90%, this level of performance aligned with the actual distribution of classes. In addition, the presence of numerous features significantly prolonged the time required for learning and computation. This was due to the inclusion of redundant and unnecessary features, which created disarray and hindered the learning process. Therefore, the purpose of this study was to determine the effect of feature selection, imputation data, and techniques for dealing with imbalanced data on model performance. This study proposed the improvement of the techniques for developing voluntary churn models by combining techniques for dealing with imbalance and missing data with high dimensionality. Thus, when compared to other combinations of models, the combination of Decision Trees+Mode Imputation+SMOTE with Random Undersampling methods and Random Forest as the classifier builder produced the highest classification accuracy, AUC, and F1-Score. Additionally, this study suggested the use of Dask or PySpark for processing the large telecommunication dataset to allow for the faster and more effective execution of other machine learning algorithms in Python via parallel computing.

1.
U.
Sivarajah
,
M. M.
Kamal
,
Z.
Irani
and
V.
Weerakkody
,
Journal of Business Research
70
,
263
286
(
2017
).
2.
C.-M.
Chen
,
APSIPA Transactions on Signal and Information Processing
5
,
1
7
(
2016
).
3.
L.
Rabhi
,
N.
Falih
,
A.
Afraites
and
B.
Bouikhalene
,
Procedia Computer Science
155
,
599
605
(
2019
).
4.
J.
Bughin
,
Telcos: The untapped promise of Big Data [Internet].
McKinsey & Company
;
2016
[cited 2021 Nov 22]. Available from: https://www.mckinsey.com/industries/technology-media-and-telecommunications/our-insights/telcos-the-untapped-promise-of-big-data.
5.
P.
Lalwani
,
M. K.
Mishra
,
J. S.
Chadha
and
P.
Sethi
,
Computing
,
1
24
(
2021
).
6.
S.
Wu
,
W.-C.
Yau
,
T.-S.
Ong
and
S.-C.
Chong
,
Journal of IEEE Access
9
,
62118
62136
(
2021
).
7.
A. K.
Ahmad
,
A.
Jafar
and
K.
Aljoumaa
,
Journal of Big Data
6
,
1
24
(
2019
).
8.
R.
Suguna
,
M. S.
Devi
and
R. M.
Mathew
,
International Journal of Innovative Technology and Exploring Engineering (IJITEE)
8
,
2329
2333
(
2019
).
9.
N.
Kumar
and
C.
Naik
,
International Research Journal of Engineering and Technology (IRJET)
4
,
485
489
(
2017
).
10.
S. I.
Khan
and
A. S. M. L.
Hoque
,
Journal of Big Data
7
,
1
21
(
2020
).
11.
H.
Kang
,
Korean Journal of Anesthesiology
64
,
402
406
(
2013
)
12.
C.-H.
Liu
,
C.-F
Tsai
,
K.-L.
Sue
and
M.-W.
Huang
,
Applied Sciences
10
,
2344
(
2020
).
13.
D. B.
Rubin
and
R. J.
Little
,
Statistical Analysis with Missing Data
(
Wiley
,
Hoboken, NJ, USA
,
2019
).
14.
J. W.
Graham
,
P. E.
Cumsille
and
A. E.
Shevock
, “Methods for Handling Missing Data,” in
Handbook of Psychology: Research Methods in Psychology
, edited by
I.
Weiner
,
J. A.
Schinka
and
W. F.
Velicer
(
John Wiley & Sons, Inc
.,
Washington, DC, USA
,
2013
), pp.
109
141
.
15.
D. B.
Rubin
,
Biometrika
63
,
581
592
(
1976
).
16.
P. G. M. V. D.
Heijden
,
E.
Zwane
and
D.
Hessen
,
AStA Advances in Statistical Analysis
93
,
5
21
(
2009
).
17.
J. L.
Schafer
and
J. W.
Graham
,
Psychological Methods
7
,
147
177
(
2002
).
18.
X.
Zhu
,
S.
Zhang
,
Z.
Jin
,
Z.
Zhang
and
Z.
Xu
,
IEEE Transactions on Knowlegde and Data Engineering
23
,
110
121
(
2011
).
19.
A.
Donner
,
The American Statistician
36
,
378
381
(
1982
).
20.
S. K.
Kwak
and
J. H.
Kim
,
Korean Journal of Anesthesiology
70
,
407
411
(
2017
).
21.
Z.
Zhang
,
Annals of Translational Medicine
3
(
2015
).
22.
F. J.
Valverde-Albacete
and
C.
Peláez-Moreno
,
PLOS ONE
9
,
e84217
(
2014
).
23.
G. E. A. P. A.
Batista
,
R. C.
Prati
and
M. C.
Monard
,
Association for Computing Machinery
6
,
20
29
(
2004
).
24.
A.
More
,
preprint
arXiv:1608.06048 [stat.AP] (
2016
).
25.
J.
Tang
,
S.
Alelyani
and
H.
Liu
, “Feature selection for classification: A review,” In
Data Classification: Algorithms and Applications
, edited by
C. C.
Aggarwal
(
CRC Press
,
New York, USA
,
2014
), pp.
37
64
.
26.
M. F.
Dzulkalnine
and
R.
Sallehuddin
,
SN Applied Science
1
,
362
(
2019
).
27.
G.
Van Rossum
and
F. L.
Drake
,
Python3 Reference Manual
(
CreateSpace
,
Scotts Valley, CA, USA
,
2009
).
28.
I.
Guyon
,
J.
Weston
and
S.
Barnhill
,
Machine Learning
46
,
389
422
(
2002
).
29.
A.
Bahl
,
B.
Hellack
,
M.
Balas
,
A.
Dinischiotu
,
M.
Wiemann
,
J.
Brinkmann
,
A.
Luch
,
B. Y.
Renard
and
A.
Haase
,
NanoImpact
15
,
100179
(
2007
).
30.
D.
Elavarasan
,
P. M. D. R.
Vincent
,
K.
Srinivasan
and
C.-Y.
Chang
,
Agriculture
10
,
400
(
2020
).
31.
R.-C.
Chen
,
W. E.
Manongga
and
C.
Dewi
,
Future Internet
14
,
352
(
2022
).
32.
L.
Breiman
,
Machine Learning
45
,
5
32
(
2001
).
33.
B.
Gregorutti
,
B.
Michel
and
P.
Saint-Pierre
,
Stat Comput
27
,
659
678
(
2017
).
34.
B. F.
Darst
,
K. C.
Malecki
and
C. D.
Engelman
,
BMC Genet
19
,
65
(
2018
).
35.
X.
Chen
and
J. C.
Jeong
, “Enhanced recursive feature elimination,” in
Sixth International Conference on Machine Learning and Applications
(
2007
) (
ICMLA
, 2007), pp.
429
435
.
36.
A.
Cutler
,
D.
Cutler
and
J.
Stevens
,
Machine Learning
45
,
157
176
(
2011
).
37.
V.
Simic
,
A. E.
Torkayesh
and
A. I.
Maghsoodi
,
Annals of Operations Research
328
,
1105
1150
(
2023
).
38.
A.
Azhar
,
N. M.
Ariff
,
M. A. A.
Bakar
and
A.
Roslan
.
Sustainability
14
,
4101
(
2022
).
39.
F. M.
Shrive
,
H.
Stuart
,
H.
Quan
and
W. A.
Ghali
,
BMC Medical Research Methodology
6
,
57
66
(
2006
).
40.
H.
Liu
and
M.
Zhou
, “Decision tree rule-based feature selection for large-scale imbalanced data,” in
26th Wireless and Optical Communication Conference
(
2017
) (
WOCC
, 2017), pp.
1
6
.
41.
H.
Liu
,
M.
Zhou
and
Q.
Liu
,
IEEE/CAA Journal of Automatica Sinica
6
,
703
715
(
2019
).
42.
R.
Caruana
,
Artificial Intelligence and Statistics
(
2001
), available at https://proceedings.mlr.press/r3/caruana01a/caruana01a.pdf.
43.
C.
Zhang
,
X.
Zhu
,
J.
Zhang
,
Y.
Qin
and
S.
Zhang
, “
GBKII: An imputation method for missing values
,” in
PAKDD 2007: Advances in Knowledge Discovery and Data Mining
(
2007
), pp.
1080
1087
.
44.
F.
Tang
and
H.
Ishwaran
,
Statistical Analysis and Data Mining
10
,
363
377
(
2017
).
45.
D. J.
Stekhoven
and
P.
Bühlmann
,
Bioinformatics
28
,
112
118
(
2012
).
46.
N. V.
Chawla
,
K. W.
Bowyer
,
L. O.
Hall
and
W. P.
Kegelmeyer
,
Journal of Artificial Intelligence Research
16
,
321
357
(
2002
).
47.
Y.
Sui
,
Y.
Wei
and
D.
Zhao
,
Computational and Mathematical Methods in Medicine
2015
,
368674
(
2015
).
48.
B.
Liu
and
G.
Tsoumakas
,
Knowledge-Based System
192
,
105292
(
2020
).
49.
U. M.
Khaire
and
R.
Dhanalakshmi
,
Journal of King Saud University - Computer and Information Sciences
34
,
1060
1073
(
2019
).
50.
J. L.
Schafer
,
Statistical Methods in Medical Research
8
,
3
15
(
1999
).
51.
D. A.
Bennett
,
Australian and New Zealand Journal of Public Health
25
,
464
469
(
2001
).
52.
P.
Madley-Dowd
,
R.
Hughes
,
K.
Tilling
and
J.
Heron
,
Journal of Clinical Epidemiology
110
,
63
73
(
2019
).
53.
J. C.
Jakobsen
,
C.
Gluud
,
J.
Wetterslev
and
P.
Winkel
,
BMC Medical Research Methodology
17
,
162
171
(
2017
).
54.
R. A.
Hughes
,
J.
Heron
,
J. A. C.
Sterne
and
K.
Tilling
,
International Journal of Epidemiology
48
,
1294
1304
(
2019
).
55.
J.
Scheffer
,
Research Letters in the Information and Mathematical Sciences
3
,
153
160
(
2002
).
56.
N. A.
Azhar
,
M. S. M.
Pozi
,
A. M.
Din
and
A.
Jatowt
,
IEEE Transactions on Knowledge and Data Engineering
35
,
6651
6672
(
2023
).
This content is only available via PDF.
You do not currently have access to this content.