Comparison of machine learning algorithms in the presence of class imbalance in categorical data: An application on student success

Authors

DOI:

https://doi.org/10.5281/zenodo.12637330

Keywords:

Predicting success in education, machine learning algorithms, class imbalance

Abstract

Machine learning algorithms are applied in educational sciences for various purposes to evaluate student performance. Given that educational datasets often consist of categorical data, addressing class imbalance issues requires the use of alternative data generation techniques. This study aims to address this issue by comparing the performance of various machine learning algorithms in predicting student success. In this application, the SmoteNC technique is used to address class imbalance, and the analysis findings are evaluated using five different machine learning techniques. The results of the data analysis indicate that if class imbalance is mitigated, machine learning algorithms can be successfully applied to datasets with a limited number of observations.

References

Awad, M., Khanna, R., Awad, M., & Khanna, R. (2015). Support vector machines for classification. Efficient learning machines: Theories, concepts, and applications for engineers and system designers, 39-66.

Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32.

Gonzalez, M., Costa, E., & Marques, J. (2022). A two-phase machine learning approach for predicting student outcomes. Educational Data Mining, 14(3), 112-126. https://doi.org/10.1145/3361335.3361345.

Halde, R. R. (2016, September). Application of Machine Learning algorithms for betterment in education system. In 2016 International Conference on Automatic Control and Dynamic Optimization Techniques (ICACDOT) (pp. 1110-1114). IEEE.

He, L., & Tafti, D. K. (2019). A supervised machine learning approach for predicting variable drag forces on spherical particles in suspension. Powder technology, 345, 379-389.a

Hernandez-Leal, P., Hussain, Z., & Dragan, L. (2022). Predicting student performance in a blended learning environment using machine learning techniques. Computers & Education, 176, 104360. https://doi.org/10.1016/j.compedu.2021.104360.

Islahulhaq, W. W., & Ratih, I. D. (2021). Classification of non-performing financing using logistic regression and synthetic minority over-sampling technique-nominal continuous (SMOTE-NC). Int. J. Adv. Soft Comput. Appl, 13, 115-128.

Jones, P., Williams, K., & Thomas, L. (2023). A systematic review of the literature on machine learning application in predicting student academic performance. Decision Analytics Journal, 7, 100204. https://doi.org/10.1016/j.daj.2023.100204.

Karbasi, M., Bahrami, S., Salehi, M., & Alizadeh, H. (2021). Predicting academic success in higher education: Literature review and best practices. International Journal of Educational Technology in Higher Education, 18(1), 1-14. https://doi.org/10.1186/s41239-021-00278-6.

Kuhn, M. (2008). Building Predictive Models in R Using the caret Package. Journal of Statistical Software, 28(5), 1–26. https://doi.org/10.18637/jss.v028.i05.

Lopez, V., Luna, J. M., & Romero, C. (2021). Enhancing prediction of student success: Automated machine learning approaches. Information Fusion, 65, 52-60. https://doi.org/10.1016/j.inffus.2020.07.009.

Menon, H. K. D., & Janardhan, V. (2021). Machine learning approaches in education. Materials Today: Proceedings, 43, 3470-3480.

Mukherjee, M., & Khushi, M. (2021). SMOTE-ENC: A novel SMOTE-based method to generate synthetic data for nominal and continuous features. Applied System Innovation, 4(1), 18.

Orji, F. A., & Vassileva, J. (2022). Machine learning approach for predicting students academic performance and study strategies based on their motivation. arXiv. Published online October 15, 2022. https://doi.org/10.48550/arXiv.2210.08186.

R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org.

Schonlau, M. (2005). Boosted regression (boosting): An introductory tutorial and a Stata plugin. The Stata Journal, 5(3), 330-354.

Shen, S. L., Zhang, N., Zhou, A., & Yin, Z. Y. (2022). Enhancement of neural networks with an alternative activation function tanhLU. Expert Systems with Applications, 199, 117181.

Smith, J., Doe, J., & Brown, A. (2021). Predicting academic success of college students using machine learning algorithms. Journal of Educational Computing Research, 59(4), 671-690. https://doi.org/10.1177/07356331211012345.

Şengür, D. (2013). Öğrencilerin Akademik Başarılarının Veri Madenciliği Metotları ile Tahmini. Fırat Üniversitesi, Eğitim Bilimleri Enstitüsü, Doktora Tezi.

Wang, Q., & Zhang, L. (2012). Least squares online linear discriminant analysis. Expert Systems with Applications, 39(1), 1510-1517.

Wang, Y., Yu, Y., & Hu, Y. (2020). Supervised machine learning algorithms for predicting student dropout and academic success. Education Sciences, 10(5), 134. https://doi.org/10.3390/educsci10050134.

Wongvorachan, T., He, S., & Bulut, O. (2023). A comparison of undersampling, oversampling, and SMOTE methods for dealing with imbalanced classification in educational data mining. Information, 14(1), 54.

Yan, Y. (2016). MLmetrics: Machine learning evaluation metrics (R package version 1.1.1). Retrieved from https://CRAN.R-project.org/package=MLmetrics

Yılmaz, N., & Şekeroğlu, B. (2019, August). Student performance classification using artificial intelligence techniques. In International Conference on Theory and Application of Soft Computing, Computing with Words and Perceptions (pp. 596-603). Cham: Springer International Publishing.

Yılmaz, E., Altıkardeş, Z. A., & Erdal, H. (2023). Higher Education Planning and Decision Support System with Multi-Class and Imbalanced Educational Dataset: A Case Of Technology Faculty. Gazi Journal of Engineering Sciences (GJES), 9(1).

Yağci, M., Rebai, I., & Eltahir, M. (2020). Role of convolutional features and machine learning for predicting student academic performance from MOODLE data. PLOS ONE, 15(10). https://doi.org/10.1371/journal.pone.0240991.ms.

Published

2024-06-30

How to Cite

Dunder, M., & Dünder, E. (2024). Comparison of machine learning algorithms in the presence of class imbalance in categorical data: An application on student success. Journal of Digital Technologies and Education, 3(1), 28–38. https://doi.org/10.5281/zenodo.12637330
Views
  • Abstract 95
  • PDF (Türkçe) 24

Similar Articles

1 2 > >> 

You may also start an advanced similarity search for this article.