KFWAdaBoost-Based Soft Label Learning Framework for Student Performance Prediction
Article Information
Abstract
Student performance prediction is a core task in educational data mining, as it enables early intervention, personalized learning support, and data-driven decision-making. Although existing machine learning models have shown promising results in this domain, challenges persist due to hard-to-classify samples—particularly students exhibiting borderline performance—and the discrete nature of hard labels, which together limit predictive effectiveness. To overcome these limitations, this paper proposes a KFWAdaBoost-based soft label learning framework that systematically enhances baseline model performance through a two-stage synergistic mechanism. In the first stage, K-means++ clustering is employed to generate similarity features, thereby providing structural awareness of underlying data patterns. In the second stage, probabilistic soft labels are derived from ensemble confidence scores to refine decision boundaries and better handle ambiguous cases. Experimental results on the widely used Mathematics and Portuguese Language course datasets demonstrate that the proposed framework consistently improves baseline performance across Accuracy, Precision, Recall, and F1-Score for models including LDA, Decision Tree, and SVM, with Decision Tree exhibiting the most substantial gains. This framework offers a reliable and effective approach for student performance prediction and holds strong potential for broader applications in educational data analytics.
Graphical Abstract
Keywords
Data Availability Statement
Funding
Conflicts of Interest
AI Use Statement
Ethical Approval and Consent to Participate
References
- Bai, X., Zhang, F., Li, J., Guo, T., Aziz, A., Jin, A., & Xia, F. (2021). Educational big data: Predictions, applications and challenges. Big Data Research, 26, 100270.
[CrossRef] [Google Scholar] - Rabelo, A., Rodrigues, M. W., Nobre, C., Isotani, S., & Zárate, L. (2024). Educational data mining and learning analytics: A review of educational management in e-learning. Information Discovery and Delivery, 52(2), 149--163.
[CrossRef] [Google Scholar] - Kalita, E., Oyelere, S. S., Gaftandzhieva, S., Rajesh, K. N., Jagatheesaperumal, S. K., Mohamed, A., ... & Ali, T. (2025). Educational data mining: a 10-year review. Discover Computing, 28(1), 81.
[CrossRef] [Google Scholar] - Hemdanou, A. L., Sefian, M. L., Achtoun, Y., & Tahiri, I. (2024). Comparative analysis of feature selection and extraction methods for student performance prediction across different machine learning models. Computers and Education: Artificial Intelligence, 7, 100301.
[CrossRef] [Google Scholar] - Öz, E., Bulut, O., Cellat, Z. F., & Yürekli, H. (2025). Stacking: An ensemble learning approach to predict student performance in PISA 2022. Education and Information Technologies, 30(6), 7753-7779.
[CrossRef] [Google Scholar] - Cao, W., & Mai, N. (2025). Predictive analytics for student success: AI-driven early warning systems and intervention strategies for educational risk management. Educational Research and Human Development, 2(2), 36-48.
[Google Scholar] - Bañeres, D., Rodríguez-González, M. E., Guerrero-Roldán, A. E., & Cortadas, P. (2023). An early warning system to identify and intervene online dropout learners. International Journal of Educational Technology in Higher Education, 20(1), 3.
[CrossRef] [Google Scholar] - Maiya, A. K., & Aithal, P. S. (2023). A review-based research topic identification on how to improve the quality services of higher education institutions in academic, administrative, and research areas. Maiya, AK, & Aithal, PS,(2023). A Review based Research Topic Identification on How to Improve the Quality Services of Higher Education Institutions in Academic, Administrative, and Research Areas. International Journal of Management, Technology, and Social Sciences (IJMTS), 8(3), 103-153.
[Google Scholar] - Fan, Z., Gou, J., & Wang, C. (2025). An error complementarity-based iterative learning approach via categorical boosting for student performance prediction. Engineering Applications of Artificial Intelligence, 161, 112192.
[CrossRef] [Google Scholar] - Ahmed, E. (2024). Student performance prediction using machine learning algorithms. Applied computational intelligence and soft computing, 2024(1), 4067721.
[CrossRef] [Google Scholar] - Zhang, P., Jia, Y., & Shang, Y. (2022). Research and application of XGBoost in imbalanced data. International Journal of Distributed Sensor Networks, 18(6), 15501329221106935.
[CrossRef] [Google Scholar] - Arslan, E., Gaftandzhieva, S., Gorgani Firouzjaei, A., Hassannataj Joloudari, J., & Doneva, R. (2025). Ex-ADA: a SHAP-based explainable AdaBoost framework for predicting at-risk students. Frontiers in Education, 10, 1728070.
[CrossRef] [Google Scholar] - Piernik, M., & Morzy, T. (2021). A study on using data clustering for feature extraction to improve the quality of classification. Knowledge and Information Systems, 63(7), 1771--1805.
[CrossRef] [Google Scholar] - Kapoor, A., & Singhal, A. (2017, February). A comparative study of K-Means, K-Means++ and Fuzzy C-Means clustering algorithms. In 2017 3rd international conference on computational intelligence & communication technology (CICT) (pp. 1-6). IEEE.
[CrossRef] [Google Scholar] - Gu, X., Angelov, P., & Rong, H. J. (2019). Local optimality of self-organising neuro-fuzzy inference systems. Information Sciences, 503, 351-380.
[CrossRef] [Google Scholar] - Xie, R., Chung, F. L., & Wang, S. (2026). Fuzzy Apriori classifier enhanced by stacking and adversarial knowledge assistance. Information Fusion, 125, 103483.
[CrossRef] [Google Scholar] - Gu, X., & Angelov, P. P. (2018). Self-organising fuzzy logic classifier. Information Sciences, 447, 36-51.
[CrossRef] [Google Scholar] - Tanveer, M., Tiwari, A., Akhtar, M., & Lin, C. T. (2025). Enhancing imbalance learning: A novel slack-factor fuzzy SVM approach. IEEE Transactions on Emerging Topics in Computational Intelligence, 9(4), 3112-3121.
[CrossRef] [Google Scholar] - Speiser, J. L., Miller, M. E., Tooze, J., & Ip, E. (2019). A comparison of random forest variable selection methods for classification prediction modeling. Expert Systems with Applications, 134, 93--101.
[CrossRef] [Google Scholar] - Wong, T. T., & Yeh, P. Y. (2019). Reliable accuracy estimates from k-fold cross validation. IEEE Transactions on Knowledge and Data Engineering, 32(8), 1586-1594.
[CrossRef] [Google Scholar] - Cortez, P., & Silva, A. M. G. (2008). Using data mining to predict secondary school student performance. In Proceedings of 5th Annual Future Business Technology Conference.
[Google Scholar]
Cite This Article
TY - JOUR AU - Yu, Zhihong PY - 2026 DA - 2026/02/28 TI - KFWAdaBoost-Based Soft Label Learning Framework for Student Performance Prediction JO - ICCK Transactions on Educational Data Mining T2 - ICCK Transactions on Educational Data Mining JF - ICCK Transactions on Educational Data Mining VL - 2 IS - 1 SP - 1 EP - 13 DO - 10.62762/TEDM.2026.459733 UR - https://www.icck.org/article/abs/TEDM.2026.459733 KW - soft label learning KW - KFWAdaBoost KW - K-means++ clustering KW - student performance prediction KW - educational data mining AB - Student performance prediction is a core task in educational data mining, as it enables early intervention, personalized learning support, and data-driven decision-making. Although existing machine learning models have shown promising results in this domain, challenges persist due to hard-to-classify samples—particularly students exhibiting borderline performance—and the discrete nature of hard labels, which together limit predictive effectiveness. To overcome these limitations, this paper proposes a KFWAdaBoost-based soft label learning framework that systematically enhances baseline model performance through a two-stage synergistic mechanism. In the first stage, K-means++ clustering is employed to generate similarity features, thereby providing structural awareness of underlying data patterns. In the second stage, probabilistic soft labels are derived from ensemble confidence scores to refine decision boundaries and better handle ambiguous cases. Experimental results on the widely used Mathematics and Portuguese Language course datasets demonstrate that the proposed framework consistently improves baseline performance across Accuracy, Precision, Recall, and F1-Score for models including LDA, Decision Tree, and SVM, with Decision Tree exhibiting the most substantial gains. This framework offers a reliable and effective approach for student performance prediction and holds strong potential for broader applications in educational data analytics. SN - 3070-5843 PB - Institute of Central Computation and Knowledge LA - English ER -
@article{Yu2026KFWAdaBoos,
author = {Zhihong Yu},
title = {KFWAdaBoost-Based Soft Label Learning Framework for Student Performance Prediction},
journal = {ICCK Transactions on Educational Data Mining},
year = {2026},
volume = {2},
number = {1},
pages = {1-13},
doi = {10.62762/TEDM.2026.459733},
url = {https://www.icck.org/article/abs/TEDM.2026.459733},
abstract = {Student performance prediction is a core task in educational data mining, as it enables early intervention, personalized learning support, and data-driven decision-making. Although existing machine learning models have shown promising results in this domain, challenges persist due to hard-to-classify samples—particularly students exhibiting borderline performance—and the discrete nature of hard labels, which together limit predictive effectiveness. To overcome these limitations, this paper proposes a KFWAdaBoost-based soft label learning framework that systematically enhances baseline model performance through a two-stage synergistic mechanism. In the first stage, K-means++ clustering is employed to generate similarity features, thereby providing structural awareness of underlying data patterns. In the second stage, probabilistic soft labels are derived from ensemble confidence scores to refine decision boundaries and better handle ambiguous cases. Experimental results on the widely used Mathematics and Portuguese Language course datasets demonstrate that the proposed framework consistently improves baseline performance across Accuracy, Precision, Recall, and F1-Score for models including LDA, Decision Tree, and SVM, with Decision Tree exhibiting the most substantial gains. This framework offers a reliable and effective approach for student performance prediction and holds strong potential for broader applications in educational data analytics.},
keywords = {soft label learning, KFWAdaBoost, K-means++ clustering, student performance prediction, educational data mining},
issn = {3070-5843},
publisher = {Institute of Central Computation and Knowledge}
}
Article Metrics
Publisher's Note
ICCK stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and Permissions
Portico