Plagiarism Detection using Machine Learning Techniques and Cosine, Jaccard and Dice Similarity Measures

Mr. shahzeb Khan; Mr. Deepankar Krishna; Mr. Samrailatpam Mukherjee; Mr. Rohit Kumar; Dr. Mohd Tajammul

doi:https://doi.org/doi:10.17492/computology.v4i2.2404

Journal Press India^®

www.journalpressindia.com

Submit Manuscript Login / Register Subscribe

Home

Editorial Board Members

Mission, Aims & Scope

Current Issue

Plagiarism Detection using Machine Learning Techniques and Cosine, Jaccard and Dice Similarity Measures

shahzeb Khan, Deepankar Krishna, Samrailatpam Mukherjee, Rohit Kumar, Mohd Tajammul

https://doi.org/doi:10.17492/computology.v4i2.2404

Author Details ( * ) denotes Corresponding author

1. * shahzeb Khan, Assistant Professor, Department of Computer Science & Applications, Sharda University, Greater Noida, Uttar Pradesh, India (shahzeb.khan@sharda.ac.in)

2. Deepankar Krishna, Student, Department of Computer Science & Applications, Sharda University, Greater Noida, Uttar Pradesh, India (2023564421.deepankar@ug.sharda.ac.in)

3. Samrailatpam Mukherjee, Student, Department of Computer Science & Applications, Sharda University, Greater Noida, Uttar Pradesh, India (2023420324.samrailatpam@ug.sharda.ac.in)

4. Rohit Kumar, Student, Department of Computer Science & Applications, Sharda University, Greater Noida, Uttar Pradesh, India (2023547376.rohit@ug.sharda.ac.in)

5. Mohd Tajammul, Assistant Professor, Department of Computer Science & Applications, Sharda University, Greater Noida, Uttar Pradesh, India (mohd.tajammul@sharda.ac.in)

With abundant textual content available on the internet, Plagiarism detection has become very important to safeguard original works, and reduce plagiarized content on the internet. In this research we present potential methods of plagiarism detection for textual documents using a corpus that mimics different levels of plagiarism committed by students in academics, assignments and theses by using Machine Learning Classifiers based on similarity features - cosine, jaccard similarity and dice coefficient values of textual documents. We used 2 strategies (models), the first strategy only uses jaccard and dice coefficient for training and testing which resulted in an accuracy score of 78.95%. The second strategy was created as a consequential improvement of the first strategy by involving all three i.e cosine, jaccard and dice coefficient for both target feature creation and for training and testing. Our research reveals that the Strategy-2, proves to be an improvement of Strategy-1, achieving an accuracy score of 97.31%.

Keywords

Plagiarism; Vectorization; TF-IDF; Word embedding; Cosine similarity; Jaccard similarity; Dice coefficient.

Afzali, M. & Kumar, S. (2017). Comparative analysis of various similarity measures for finding similarity of two documents. International Journal of Database Theory and Application, 10(2), 23-30.
Alfikri, Z. F. & Purwarianti, A. (2014). Detailed analysis of extrinsic plagiarism detection system using machine learning approach (Naive Bayes and SVM). TELKOMNIKA Indonesian Journal of Electrical Engineering, 12(11), 7884-7894. Retrieved from https://doi.org/10.11591/telkomnika.v12i11.4352
Ali, A. & Taqa, A. Y. (2022). Analytical study of traditional and intelligent textual plagiarism detection approaches. Journal of Education & Science, 31(1), 1-10.
Bandara, U. & Wijayarathna, G. (2011). A machine learning-based tool for source code plagiarism detection. International Journal of Machine Learning and Computing, 1 (4), 337-341. Retrieved from https://doi.org/10.7763/IJMLC.2011.V1.65
Chitra, A. & Rajkumar, A. (2016). Plagiarism detection using machine learning-based paraphrase recognizer. Journal of Intelligent Systems, 25(3), 351-359. Retrieved from https://doi.org/10.1515/jisys-2015-0167
Clough, P. & Stevenson, M. (2009). Creating a corpus of plagiarised academic texts. In Proceedings of Corpus Linguistics Conference, CL’09 (to appear). Retrieved from https://www.researchgate.net/publication/228527079_Creating_a_Corpus_of_Plagiarised_Academic_Texts
El-Rashidy, M.A., Mohamed, R.G., El-Fishawy, N.A. & Shouman, M.A., (2022). Reliable plagiarism detection system based on deep learning approaches. Neural Computing and Applications, 34(21), 18837-18858.
Hunt, E., Janamsetty, R., Kinares, C., Koh, C., Sanchez, A., Zhan, F., Ozdemir, M., Waseem, S., Yolcu, O., Dahal, B. & Zhan, J. (2019). Machine learning models for paraphrase identification and its applications on plagiarism detection. In 2019 IEEE International Conference on Big Knowledge (ICBK) (pp. 97-104). IEEE. Retrieved from https://doi.org/10.1109/ICBK.2019.00021
Husain, M. I., Khan, S. & Ahmad, M. (2024). Algorithm analysis using machine learning in plagiarism detection at universities.
Kholodna, N., Makarov, O., Kholodny, V. & Pochepkina, A. (2022). Machine learning model for paraphrases detection based on text content pair binary classification. Retrieved from https://ceur-ws.org/Vol-3312/paper23.pdf
Pereira, C., Iyer, S. & Raut, C. (2017). Recommendation system based on cosine similarity algorithm. International Journal of Recent Trends in Engineering Research, 3(9), 2455-1457. Retrieved from https://doi.org/10.13140/RG.2.2.13587.71206
Rogozin, A., Medvedeva, M. & Ford, V. (2019). Vectorization of documents and analysis of their identity using a neural network. CEUR Workshop Proceedings, 2562, 1-9. Retrieved from https://ceur-ws.org/Vol-2562/
Rosu, R., Lungu, I. & Smarandache, F. (2020). NLP-based deep learning approach for plagiarism detection. International Journal of User-System Interaction, 13(1), 48-60. Retrieved from https://doi.org/10.4018/IJUSI.2020010104
Sabri, T., El Beggar, O. & Kissi, M. (2022). Comparative study of Arabic text classification using feature vectorization methods. Procedia Computer Science, 198, 269-275. Retrieved from https://doi.org/10.1016/j.procs.2022.01.038
Sadhin, I. H., Hassan, T. & Nayim, M. A. M. (2024). Plagiarism detection using artificial intelligence. International Journal of Computer and Information System (IJCIS), 5(2), 102-108.
Saeed, A. A. M. (2023). Designing and implementing intelligent textual plagiarism detection models (Doctoral dissertation). College of Computer Science and Mathematics, University of Mosul, Nineveh, Iraq.
Sahu, M. (2016). Plagiarism detection using artificial intelligence technique in multiple files. International Journal of Scientific and Technology Research, 5(4), 15-18.
Sari, L. P. (2023). Cosine similarity-based plagiarism detection on electronic documents. Journal of Computer Science Application and Engineering (JOSAPEN), 1(2), 44-48.
Shahmirzadi, O., Lugowski, A., & Younge, K. (2019). Text similarity in vector space models: A comparative study. In 2019 18th IEEE International Conference on Machine Learning and Applications (ICMLA) (pp. 89-96). IEEE. Retrieved from https://doi.org/ 10.1109/ICMLA.2019.00025
Subroto, I. M. I. & Selamat, A. (2014). Plagiarism detection through internet using hybrid artificial neural network and support vector machine. TELKOMNIKA (Telecommunication Computing Electronics and Control), 12(1), 209-218. Retrieved from https://doi.org/ 10.11591/telkomnika.v12i1.3489
Tessari, F. & Hogan, N. (2024). Surpassing cosine similarity for multidimensional comparisons: Dimension insensitive Euclidean metric (DIEM). arXiv Preprint. Retrieved from https://arxiv.org/abs/2407.08623
Ullah, F., Wang, J., Farhan, M., Habib, M. & Khalid, S. (2021). Software plagiarism detection in multiprogramming languages using machine learning approach. Concurrency and Computation: Practice and Experience, 33(4), e5000.
Yülüce, İ. & Dalkılıç, F. (2022). Author identification with machine learning algorithms. International Journal of Multidisciplinary Studies and Innovative Technologies, 6(1), 45-50.
Zahid, M.M., Abid, K., Rehman, A., Fuzail, M. & Aslam, N. (2023). An efficient machine learning approach for plagiarism detection in text documents. Journal of Computing and Biomedical Informatics, 4(2), 241-248.
Zechner, M., Muhr, M., Kern, R. & Granitzer, M. (2009). External and intrinsic plagiarism detection using vector space models. In Proceedings of the SEPLN (Vol. 32, pp. 47-55).