Due to the COVID-19 pandemic, education institutions have to rely on e-learning tools, including in programming courses. Automatic graders can be used to speed up the process of evaluating the correctness. Unfortunately, answers for coding exercises can be easily plagiarized. Manual grading of all student submissions may notbe feasible. Therefore, a system that can help detecting similar codes is needed. The detection can be done by grouping similar source codes based on their structure. This method is used in previous research by using automatic K-means iterations algorithm. That algorithm, although produced decent clusters, had a long execution time. The purpose of this research is to improve the time efficiency and clusters result quality by using bisecting K-means algorithm. The results showed a significant improvement in execution time from 11.68 seconds to 6.64 seconds. Bisecting K-means also produced fewer clusters with slightly better Rand Index than K-means iterations. We also conduct experiments using 2-grams to 6-grams and confirm that 4-grams result in the best performance.

1.
E.
Barra
,
S.
López-Pernas
,
A.
Alonso
,
J. F.
Sánchez-Rada
,
A.
Gordillo
,
J.
Quemada
. “
Automated Assessment in Programming Courses: A Case Study during the COVID-19 Era
.”
Sustainability.
12
(
18
),
7451
(
2020
).
2.
K. J.
Ottenstein
. “
An Algorithmic Approach to the Detection and Prevention of Plagiarism
.”
ACM SIGCSE Bulletin.
8
(
4
),
30
41
(
1976
).
3.
C. L.
Aasheim
,
P. S.
Rutner
,
L.
Li
,
S. R.
Williams
. “
Plagiarism and Programming: A Survey of Student Attitudes
.”
Journal of Information System Education
,
23
(
3
),
297
314
(
2012
).
4.
S. M.
Savaresi
,
D. L.
Boley
. “
A comparative analysis on the bisecting K-means and the PDDP clustering algorithms
.”
Intelligent Data Analysis
,
8
(
4
),
345
362
(
2004
).
5.
M.
Novak
,
M.
Joy
,
D.
Kermek
. “
Source-code Similarity Detection and Detection Tools Used in Academia: A Systematic Review
.”
ACM Transactions on Computing Education
,
19
(
3
),
1
37
(
2019
).
6.
S.
Burrows
,
S. M. M.
Tahaghoghi
,
J.
Zobel
. “
Efficient plagiarism detection for large code repositories
.”
Software Practice and Experience.
37
(
2
),
151
175
(
2007
).
7.
Z.
Đurić
,
D.
Gašević
. “
A Source Code Similarity System for Plagiarism Detection
.”
The Computer Journal.
56
(
1
),
70
86
(
2012
).
8.
G.
Cosma
,
M.
Joy
. “
An Approach to Source-Code Plagiarism Detection and Investigation Using Latent Semantic Analysis
.”
IEEE Transactions on Computers.
61
(
3
),
379
394
(
2012
).
9.
I.
Smeureanu
,
B.
Iancu
. “
Source Code Plagiarism Detection Method Using Protégé Built Ontologies
.”
Informatica Economică.
17
(
3
),
75
(
2013
).
10.
A.
Ohno
,
H.
Murao
. “
A Two-Step In-Class Source Code Plagiarism Detection Method Utilizing Improved CM Algorithm and SIM
.”
International Journal of Innovative Computing, Information and Control
,
7
(
8
),
4729
4739
(
2011
).
11.
N.
Shah
,
S.
Modha
,
D.
Dhruv
. “
Differential Weight Based Hybrid Approach to Detect Software Plagiarism
.”
International Conference on ICT for Sustainable Development
,
645
653
,
2016
.
12.
A.
Ridha
,
A. P.
Gumilang
. “
C Code Plagiarism Detection Using K-Means
.”
International Seminar on Sciences
,
119
222
,
2013
.
13.
E.
Flores
,
A.
Barrón-Cedeño
,
L.
Moreno
,
P.
Rosso
. “
Uncovering Source Code Reuse in Large-Scale Academic Environments
.”
Computer Applications in Engineering Education
,
23
(
3
),
383
390
(
2015
).
14.
J. H.
Ji
,
S. H.
Park
,
G.
Woo
,
H. G.
Cho
. “
Generating Pylogenetic Tree of Homogeneous Source Code in a Plagiarism Detection System
.
International Journal of Control, Automation, and Systems.
6
(
6
),
809
817
(
2008
).
15.
M.
Chilowicz
,
E.
Duris
,
G.
Roussel
. “
Viewing Functions as Token Sequences to Highlight Similarities in Source Code
.”
Science of Computer Programming
,
78
(
10
),
1871
1891
(
2013
).
16.
Sukono
,
H. Napitupulu
,
A.
Sambas
,
A.
Murniati
,
V. A.
Kusumaningtyas
. “
Artificial Neural Network-Based Machine Learning Approach to Stock Market Prediction Model on the Indonesia Stock Exchange During the COVID-19
”.
Engineering Letters
,
30
(
3
),
988
1000
(
2022
).
17.
L.
Prechelt
,
G.
Malpohl
,
M.
Philippsen
. “
Finding plagiarisms among a set of programs with JPlag
.”
Journal of Universal Computer Science
,
8
(
11
),
1016
(
2002
).
18.
K.
Abirami
,
P.
Mayilvahanan
. “
Performance Analysis of K-Means and Bisecting K-Means Algorithms in Weblog Data
.”
International Journal of Emerging Technologies in Engineering Research
,
4
(
8
),
119
124
(
2016
).
19.
B. S. V.
Krishna
,
P.
Satheesh
,
R. S.
Kumar
. “
Comparative Study of K-means and Bisecting K-Means Techniques in Wordnet Based Document Clustering
.”
International Journal of Engineering and Advanced Technology.
1
(
6
),
1
4
(
2012
).
20.
M. R.
Mahmud
,
M. A.
Mamun
,
M. A.
Hossain
,
M. P.
Uddin
. “
Comparative Analysis of K-Means and Bisecting K-Means Algorithms for Brain Tumor Detection
.”
2018 International Conference on Computer, Communication, Chemical, Material and Electronic Engineering
,
1
4
,
2018
.
21.
C. D.
Manning
,
P.
Raghavan
,
H.
Schütze
. “
An Introduction to Information Retrieval
.” (
Cambridge University Press
,
Cambridge
,
2008
).
22.
M.
Chawla
. “
An indexing technique for efficiently detecting plagiarism in large volume of source code
.” PhD Thesis,
RMIT University
,
2003
.
23.
M.
Steinbach
,
G.
Karypis
,
V.
Kumar
. “
A comparison of document clustering techniques
.” PhD Thesis,
University of Minnesota
,
2000
.
24.
W. M.
Rand
. “
Objective Criteria for the Evaluation of Clustering Methods
.”
Journal of the American Statistical Association.
66
(
336
)
846
850
(
1971
).
This content is only available via PDF.
You do not currently have access to this content.