Author Details
( * ) denotes Corresponding author
With abundant textual content available on the internet, Plagiarism detection has become very important to safeguard original works, and reduce plagiarized content on the internet. In this research we present potential methods of plagiarism detection for textual documents using a corpus that mimics different levels of plagiarism committed by students in academics, assignments and theses by using Machine Learning Classifiers based on similarity features - cosine, jaccard similarity and dice coefficient values of textual documents. We used 2 strategies (models), the first strategy only uses jaccard and dice coefficient for training and testing which resulted in an accuracy score of 78.95%. The second strategy was created as a consequential improvement of the first strategy by involving all three i.e cosine, jaccard and dice coefficient for both target feature creation and for training and testing. Our research reveals that the Strategy-2, proves to be an improvement of Strategy-1, achieving an accuracy score of 97.31%.
Keywords
Plagiarism; Vectorization; TF-IDF; Word embedding; Cosine similarity; Jaccard similarity; Dice coefficient.