The prevalence of gaps and outliers within datasets presents substantial challenges, particularly in the realm of time series fore-casting and various other predictive machine learning (ML) tasks. This paper, introduces an effective technique for correcting gaps and outliers in data and validates the approach by applying it to datasets with outlier zones drawn from three diverse contexts. This innovative technique holds promising potential to enhance the performance of machine learning models by treating the data to alleviate the complications posed by these issues and in doing so contributes a valuable tool to the data science toolbox.

1.
N.
Shaghaghi
,
A.
Calle
, and
G.
Kouretas
, “
Influenza forecasting
,” in
Proceedings of the 3rd ACM SIGCAS Conference on Computing and Sustainable Societies
(
2020
) pp.
339
341
.
2.
N.
Shaghaghi
,
A.
Calle
, and
Y.
Qian
, “
evision: Influenza forecasting using cdc, who, and google trends data
,” in
2020 IEEE/ITU International Conference on Artificial Intelligence for Good (AI4G)
(
IEEE
,
2020
) pp.
38
45
.
3.
N.
Shaghaghi
,
A.
Calle
,
G.
Kouretas
,
J.
Mirchandani
, and
M.
Castillo
, “
evision: Epidemic forecasting on covid-19
,”
Biomedical Engineering / Biomedizinische Technik
66
,
202
204
(
2021
).
4.
N.
Shaghaghi
,
S.
Karishetti
, and
N.
Ma
, “
Interplay of influenza a/b subtypes and covid-19
,” in
2021 4th International Conference on Bio-Engineering for Smart Technologies (BioSMART)
(
2021
) pp.
1
5
.
5.
K.
Sequitin
, “
What is an outlier?”
https://careerfoundry.com/en/blog/data-analytics/what-is-an-outlier (
2021
).
6.
Centers for Disease Control and Prevention (CDC)
, “
Changes in influenza and other respiratory virus activity during the covid-19 pandemic — united states, 2020–2021
,” https://www.cdc.gov/mmwr/volumes/70/wr/mm7029a1.htm (
2021
).
7.
W.
Goolsby
, “
Doctor concerned flu cases are being mislabeled as covid-19
,” https://ktrh.iheart.com/content/2021-01-28-doctor-concerned-flu-cases-are-being-mislabeled-as-covid-19 (
2021
).
8.
Centers for Disease Control and Prevention (CDC) and the National Center for Immunization and Respiratory Diseases (NCIRD)
, “
2020-2021 flu season summary
,” https://www.cdc.gov/flu/season/faq-flu-season-2020-2021.htm (
2021
).
9.
S. J.
Olsen
,
E.
Azziz-Baumgartner
,
A. P.
Budd
,
Brammer
,
S. Lynnette
Sullivan
,
R. F.
Pineda
,
C.
Cohen
, and
A. M.
Fry
, “
Decreased influenza activity during the covid-19 pandemic — united states, australia, chile, and south africa
,
2020
,” https://www.cdc.gov/mmwr/volumes/69/wr/mm6937a6.htm (2020).
10.
J.
Zuluaga
,
M.
Castillo
,
D.
Syal
,
A.
Calle
, and
N.
Shaghaghi
, “
evision: Forecasting the spread of tuberculosis in india with deep learning
,” in
2022 International Conference on Computational Science and Computational Intelligence (CSCI)
(
2022
) pp.
1
6
.
11.
T.
Ané
,
L.
Ureche-Rangau
,
J.-B.
Gambet
, and
J.
Bouverot
, “
Robust outlier detection for asia–pacific stock index returns
,”
Journal of International Financial Markets, Institutions and Money
18
,
326
343
(
2008
).
12.
M.
Karpinski
,
V.
Khoma
,
V.
Dudvkevych
,
Y.
Khoma
, and
D.
Sabodashko
, “
Autoencoder neural networks for outlier correction in ecg-based biometric identification
,” in
2018 IEEE 4th international symposium on wireless systems within the international conferences on intelligent data acquisition and advanced computing systems (IDAACS-SWS)
(
IEEE
,
2018
) pp.
210
215
.
13.
I.
Chatterjee
,
M.
Zhou
,
A.
Abusorrah
,
K.
Sedraoui
, and
A.
Alabdulwahab
, “
Statistics-based outlier detection and correction method for amazon customer reviews
,”
Entropy
23
,
1645
(
2021
).
14.
U.
Pujianto
,
A. P.
Wibawa
,
M. I.
Akbar
, et al, “
K-nearest neighbor (k-nn) based missing data imputation
,” in
2019 5th International Conference on Science in Information Technology (ICSITech)
(
IEEE
,
2019
) pp.
83
88
.
15.
Y.
Zhang
and
P. J.
Thorburn
, “
Handling missing data in near real-time environmental monitoring: A system and a review of selected methods
,”
Future Generation Computer Systems
128
,
63
72
(
2022
).
16.
M.
Kokla
,
J.
Virtanen
,
M.
Kolehmainen
,
J.
Paananen
, and
K.
Hanhineva
, “
Random forest-based imputation outperforms other methods for imputing lc-ms metabolomics data: a comparative study
,”
BMC bioinformatics
20
,
1
11
(
2019
).
17.
capital.com
, “
Stock market prediction
,” https://capital.com/stock-market-prediction-definition (
2021
).
18.
O.
Harrison
, “
Machine learning basics with the k-nearest neighbors algorithm
,” https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761 (
2018
).
21.
MathWorks
, “
Impute missing data in the credit scorecard workflow using the random forest algorithm
,” https://www.mathworks.com/help/finance/impute-missing-data-using-random-forest.html (
2022
).
23.
N.
Shaghaghi
,
A.
Calle
, and
G.
Kouretas
, “
Expanding evision’s scope of influenza forecasting
,” in
2020 IEEE Global Humanitarian Technology Conference (GHTC)
(
IEEE
,
2020
) pp.
1
10
.
24.
N.
Shaghaghi
,
A.
Calle
,
G.
Kouretas
,
S.
Karishetti
, and
T.
Wagh
, “
Expanding evision’s granularity of influenza forecasting
,” in
Wireless Mobile Communication and Healthcare: 9th EAI International Conference, MobiHealth 2020, Virtual Event, November 19, 2020, Proceedings
(
Springer Nature
) p.
227
.
25.
N.
Shaghaghi
,
Y.
Kamdar
,
R.
Huang
,
A.
Calle
,
J.
Mirchandani
, and
M.
Castillo
, “
Attempts at enhancing evision’s influenza forecasting using social media
,” in
2022 14th Biomedical Engineering International Conference (BMEiCON)
(
2022
) pp.
1
5
.
26.
Santa Clara University
, “
Frugal Innovation Hub
,” https://www.scu.edu/engineering/labs--research/labs/frugal-innovation-hub (
2022
).
This content is only available via PDF.
You do not currently have access to this content.