Journal Press India®

Computology: Journal of Applied Computer Science and Intelligent Technologies
Vol 4 , Issue 2 , July - December 2024 | Pages: 61-88 | Research Paper

Plagiarism Detection using Machine Learning Techniques and Cosine, Jaccard and Dice Similarity Measures

Author Details ( * ) denotes Corresponding author

1. * shahzeb Khan, Assistant Professor, Department of Computer Science & Applications, Sharda University, Greater Noida, Uttar Pradesh, India (shahzeb.khan@sharda.ac.in)
2. Deepankar Krishna, Student, Department of Computer Science & Applications, Sharda University, Greater Noida, Uttar Pradesh, India (2023564421.deepankar@ug.sharda.ac.in)
3. Samrailatpam Mukherjee, Student, Department of Computer Science & Applications, Sharda University, Greater Noida, Uttar Pradesh, India (2023420324.samrailatpam@ug.sharda.ac.in)
4. Rohit Kumar, Student, Department of Computer Science & Applications, Sharda University, Greater Noida, Uttar Pradesh, India (2023547376.rohit@ug.sharda.ac.in)
5. Mohd Tajammul, Assistant Professor, Department of Computer Science & Applications, Sharda University, Greater Noida, Uttar Pradesh, India (mohd.tajammul@sharda.ac.in)

With abundant textual content available on the internet, Plagiarism detection has become very important to safeguard original works, and reduce plagiarized content on the internet. In this research we present potential methods of plagiarism detection for textual documents using a corpus that mimics different levels of plagiarism committed by students in academics, assignments and theses by using Machine Learning Classifiers based on similarity features - cosine, jaccard similarity and dice coefficient values of textual documents. We used 2 strategies (models), the first strategy only uses jaccard and dice coefficient for training and testing which resulted in an accuracy score of 78.95%. The second strategy was created as a consequential improvement of the first strategy by involving all three i.e cosine, jaccard and dice coefficient for both target feature creation and for training and testing. Our research reveals that the Strategy-2, proves to be an improvement of Strategy-1, achieving an accuracy score of 97.31%. 

Keywords

Plagiarism; Vectorization; TF-IDF; Word embedding; Cosine similarity; Jaccard similarity; Dice coefficient.

  1. Afzali, M. & Kumar, S. (2017). Comparative analysis of various similarity measures for finding similarity of two documents. International Journal of Database Theory and Application, 10(2), 23-30.
  2. Alfikri, Z. F. & Purwarianti, A. (2014). Detailed analysis of extrinsic plagiarism detection system using machine learning approach (Naive Bayes and SVM). TELKOMNIKA Indonesian Journal of Electrical Engineering, 12(11), 7884-7894. Retrieved from https://doi.org/10.11591/telkomnika.v12i11.4352
  3. Ali, A. & Taqa, A. Y. (2022). Analytical study of traditional and intelligent textual plagiarism detection approaches. Journal of Education & Science, 31(1), 1-10.
  4. Bandara, U. & Wijayarathna, G. (2011). A machine learning-based tool for source code plagiarism detection. International Journal of Machine Learning and Computing, 1 (4), 337-341. Retrieved from https://doi.org/10.7763/IJMLC.2011.V1.65
  5. Chitra, A. & Rajkumar, A. (2016). Plagiarism detection using machine learning-based paraphrase recognizer. Journal of Intelligent Systems, 25(3), 351-359. Retrieved from https://doi.org/10.1515/jisys-2015-0167
  6. Clough, P. & Stevenson, M. (2009). Creating a corpus of plagiarised academic texts. In Proceedings of Corpus Linguistics Conference, CL’09 (to appear). Retrieved from https://www.researchgate.net/publication/228527079_Creating_a_Corpus_of_Plagiarised_Academic_Texts
  7. El-Rashidy, M.A., Mohamed, R.G., El-Fishawy, N.A. & Shouman, M.A., (2022). Reliable plagiarism detection system based on deep learning approaches. Neural Computing and Applications, 34(21), 18837-18858.
  8. Hunt, E., Janamsetty, R., Kinares, C., Koh, C., Sanchez, A., Zhan, F., Ozdemir, M., Waseem, S., Yolcu, O., Dahal, B. & Zhan, J. (2019). Machine learning models for paraphrase identification and its applications on plagiarism detection. In 2019 IEEE International Conference on Big Knowledge (ICBK) (pp. 97-104). IEEE. Retrieved from https://doi.org/10.1109/ICBK.2019.00021
  9. Husain, M. I., Khan, S. & Ahmad, M. (2024). Algorithm analysis using machine learning in plagiarism detection at universities.
  10. Kholodna, N., Makarov, O., Kholodny, V. & Pochepkina, A. (2022). Machine learning model for paraphrases detection based on text content pair binary classification. Retrieved from https://ceur-ws.org/Vol-3312/paper23.pdf
  11. Pereira, C., Iyer, S. & Raut, C. (2017). Recommendation system based on cosine similarity algorithm. International Journal of Recent Trends in Engineering Research, 3(9), 2455-1457. Retrieved from https://doi.org/10.13140/RG.2.2.13587.71206
  12. Rogozin, A., Medvedeva, M. & Ford, V. (2019). Vectorization of documents and analysis of their identity using a neural network. CEUR Workshop Proceedings, 2562, 1-9. Retrieved from https://ceur-ws.org/Vol-2562/
  13. Rosu, R., Lungu, I. & Smarandache, F. (2020). NLP-based deep learning approach for plagiarism detection. International Journal of User-System Interaction, 13(1), 48-60. Retrieved from https://doi.org/10.4018/IJUSI.2020010104
  14. Sabri, T., El Beggar, O. & Kissi, M. (2022). Comparative study of Arabic text classification using feature vectorization methods. Procedia Computer Science, 198, 269-275. Retrieved from https://doi.org/10.1016/j.procs.2022.01.038
  15. Sadhin, I. H., Hassan, T. & Nayim, M. A. M. (2024). Plagiarism detection using artificial intelligence. International Journal of Computer and Information System (IJCIS), 5(2), 102-108.
  16. Saeed, A. A. M. (2023). Designing and implementing intelligent textual plagiarism detection models (Doctoral dissertation). College of Computer Science and Mathematics, University of Mosul, Nineveh, Iraq.
  17. Sahu, M. (2016). Plagiarism detection using artificial intelligence technique in multiple files. International Journal of Scientific and Technology Research, 5(4), 15-18.
  18. Sari, L. P. (2023). Cosine similarity-based plagiarism detection on electronic documents. Journal of Computer Science Application and Engineering (JOSAPEN), 1(2), 44-48.
  19. Shahmirzadi, O., Lugowski, A., & Younge, K. (2019). Text similarity in vector space models: A comparative study. In 2019 18th IEEE International Conference on Machine Learning and Applications (ICMLA) (pp. 89-96). IEEE. Retrieved from https://doi.org/ 10.1109/ICMLA.2019.00025
  20. Subroto, I. M. I. & Selamat, A. (2014). Plagiarism detection through internet using hybrid artificial neural network and support vector machine. TELKOMNIKA (Telecommunication Computing Electronics and Control), 12(1), 209-218. Retrieved from https://doi.org/ 10.11591/telkomnika.v12i1.3489
  21. Tessari, F. & Hogan, N. (2024). Surpassing cosine similarity for multidimensional comparisons: Dimension insensitive Euclidean metric (DIEM). arXiv Preprint. Retrieved from https://arxiv.org/abs/2407.08623
  22. Ullah, F., Wang, J., Farhan, M., Habib, M. & Khalid, S. (2021). Software plagiarism detection in multiprogramming languages using machine learning approach. Concurrency and Computation: Practice and Experience, 33(4), e5000.
  23. Yülüce, İ. & Dalkılıç, F. (2022). Author identification with machine learning algorithms. International Journal of Multidisciplinary Studies and Innovative Technologies, 6(1),     45-50.
  24. Zahid, M.M., Abid, K., Rehman, A., Fuzail, M. & Aslam, N. (2023). An efficient machine learning approach for plagiarism detection in text documents. Journal of Computing and Biomedical Informatics, 4(2), 241-248.
  25. Zechner, M., Muhr, M., Kern, R. & Granitzer, M. (2009). External and intrinsic plagiarism detection using vector space models. In Proceedings of the SEPLN (Vol. 32, pp. 47-55).
Abstract Views: 4
PDF Views: 36

Advanced Search

News/Events

Ramachandran Interna...

Ramachandran International Institute of Management (RIIM), Pune Org...

PCETs Pimpri Chinchw...

PCET's Pimpri Chinchwad College of Engineering and Research Org...

Institute of Managem...

Institute of Management Technology, Nagpur Organizing International...

GENDER CULTURES: Mul...

IIULM University, Milan, Italy Organizing GENDER CULTURES: Mul...

Dept. of MBA, Karnat...

Department of MBA, KLS, Gogte Institute of Technology, Belagavi Org...

Indira School of Bus...

Indira School of Mangement Studies PGDM, Pune Organizing Internatio...

Indira Institute of ...

Indira Institute of Management, Pune Organizing International Confe...

D. Y. Patil Internat...

D. Y. Patil International University, Akurdi-Pune Organizing Nation...

ISBM College of Engi...

ISBM College of Engineering, Pune Organizing International Conferen...

Periyar Maniammai In...

Department of Commerce Periyar Maniammai Institute of Science &...

By continuing to use this website, you consent to the use of cookies in accordance with our Cookie Policy.