Machine learning is a promising tool for analyzing and predicting data. However, it is very time-consuming and challenging to manually process the data. There are also many missing values in the data that can affect the accuracy of the models. We propose an automated approach to pre-process the data. It efficiently performs the various steps involved in the data cleaning process by converting the categorical values to a label encoded value, sampling the data, and replacing the missing values with the appropriate central tendency. This project takes the raw data as input and provides a clean error-free dataset that is suitable for training machine learning models. It also aims to use a web application to automate data pre-processing and training of classification and regression models. It uses multiple algorithms and normalization strategies to get the optimum model based on the desired metric. It allows users to download the clean dataset as well as the trained models after the models have been trained. Multiple datasets, including binary classification and regression, are used to test this model.

1.
Gijsbers
,
P.
,
LeDell
,
E.
,
Thomas
,
J.
,
Poirier
,
S.
,
Bischl
,
B.
, &
Vanschoren
,
J.
(
2019
).
An open source AutoML benchmark
. arXiv preprint arXiv:1907.00909.
2.
LeDell
,
E.
, &
Poirier
,
S.
(
2020
, July).
H2o automl: Scalable automatic machine learning
.
In Proceedings of the AutoML Workshop at ICML
(Vol.
2020
).
3.
V.
Chheda
,
S.
Kapadia
,
B.
Lakhani
and
P.
Kanani
,
"Automated Data Driven Preprocessing and Training of Classification Models
,"
2021 4th International Conference on Computing and Communications Technologies (ICCCT)
,
2021
, pp.
27
32
, doi: .
4.
Erickson
,
N.
,
Mueller
,
J.
,
Shirkov
,
A.
,
Zhang
,
H.
,
Larroy
,
P.
,
Li
,
M.
, &
Smola
,
A.
(
2020
).
Autogluontabular: Robust and accurate automl for structured data
. arXiv preprint arXiv:2003.06505.
5.
S. N.
Haider
,
Q.
Zhao
and
B. K.
Meran
,
"Automated data cleaning for data centers: A case study
,"
2020 39th Chinese Control Conference (CCC
),
2020
, pp.
3227
3232
, doi: .
6.
Radha
,
R.
, &
Muralidhara
,
S.
(
2016
).
Removal of redundant and irrelevant data from training datasets using speedy feature selection method
.
International Journal of Computer Science and Mobile Computing
,
5
(
7
),
359
364
.
7.
Chhabra
,
G.
,
Vashisht
,
V.
, &
Ranjan
,
J.
(
2019
).
A review on missing data value estimation using imputation algorithm
.
Journal of Advanced Research in Dynamical and Control Systems
,
11
(
7
),
312
318
.
8.
Cieslak
,
D.A.
,
Chawla
,
N.V.
(
2008
). Learning Decision Trees for Unbalanced Data. In:
Daelemans
,
W.
,
Goethals
,
B.
,
Morik
,
K.
(eds)
Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2008. Lecture Notes in Computer Science
(), vol
5211
.
Springer
,
Berlin, Heidelberg
.
9.
G.
Miguel
, “Flask Web Development”,
O’Reilly Media, Inc
.
2014
This content is only available via PDF.
You do not currently have access to this content.