Improved ALS Biomarker Discovery with SMOTE-Augmented Gene Expression Data
Article Information
Abstract
The early identification of Amyotrophic Lateral Sclerosis (ALS), a progressive neurological disease, using blood-based transcriptome biomarker is gaining attention. The classification of ALS from blood transcriptomic data remains challenging due to class imbalance and high dimensionality. This extension of a previous study that utilized machine learning on the microarray dataset includes a synthetic data augmentation method employing the Synthetic Minority Over-sampling Technique (SMOTE) to improve classification accuracy. Following the use of Fisher Score, t-test, PCA, and Ant Colony Optimization for feature selection, SMOTE was employed to produce synthetic ALS samples and to imbalance the class distribution. Support Vector Machines, ensemble techniques, and k-Nearest Neighbors were used to assess the classifier's performance. The accuracy of all models improved, according to the results, with k-NN rising from 77.5% to 82% and SVM rising from 91.3% to 93%. Furthermore, a number of physiologically significant genes, such as MMP9 and SELL, appeared more noticeable after augmentation and matched known immune-related indicators in ALS. The augmentation technique improves both the predictive performance, and the biological validity of the biomarkers identified. These findings demonstrate the utility of SMOTE in enhancing transcriptomic classifiers.
Graphical Abstract
Keywords
Data Availability Statement
Funding
Conflicts of Interest
AI Use Statement
Ethical Approval and Consent to Participate
References
- Blagus, R., & Lusa, L. (2013). SMOTE for high-dimensional class-imbalanced data. BMC bioinformatics, 14(1), 106.
[CrossRef] [Google Scholar] - Daneshafrooz, N., Bagherzadeh Cham, M., Majidi, M., & Panahi, B. (2022). Identification of potentially functional modules and diagnostic genes related to amyotrophic lateral sclerosis based on the WGCNA and LASSO algorithms. Scientific reports, 12(1), 20144.
[CrossRef] [Google Scholar] - Faghri, F., Brunn, F., Dadu, A., Chiò, A., Calvo, A., Moglia, C., ... & Traynor, B. J. (2022). Identifying and predicting amyotrophic lateral sclerosis clinical subgroups: a population-based machine-learning study. The Lancet Digital Health, 4(5), e359-e369.
[CrossRef] [Google Scholar] - Founta, K., Dafou, D., Kanata, E., Sklaviadis, T., Zanos, T. P., Gounaris, A., & Xanthopoulos, K. (2023). Gene targeting in amyotrophic lateral sclerosis using causality-based feature selection and machine learning. Molecular Medicine, 29(1), 12.
[CrossRef] [Google Scholar] - Lusa, L. (2012, December). Evaluation of smote for high-dimensional class-imbalanced microarray data. In 2012 11th international conference on machine learning and applications (Vol. 2, pp. 89-94). IEEE.
[CrossRef] [Google Scholar] - Grollemund, V., Pradat, P. F., Querin, G., Delbot, F., Le Chat, G., Pradat-Peyre, J. F., & Bede, P. (2019). Machine learning in amyotrophic lateral sclerosis: achievements, pitfalls, and future directions. Frontiers in neuroscience, 13, 135.
[CrossRef] [Google Scholar] - He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9), 1263-1284.
[CrossRef] [Google Scholar] - Hu, L. Y., Huang, M. W., Ke, S. W., & Tsai, C. F. (2016). The distance function effect on k-nearest neighbor classification for medical datasets. SpringerPlus, 5(1), 1304.
[CrossRef] [Google Scholar] - Cady, J., Allred, P., Bali, T., Pestronk, A., Goate, A., Miller, T. M., ... & Baloh, R. H. (2015). Amyotrophic lateral sclerosis onset is influenced by the burden of rare variants in known amyotrophic lateral sclerosis genes. Annals of neurology, 77(1), 100-113.
[CrossRef] [Google Scholar] - Marriott, H., Kabiljo, R., Hunt, G. P., Khleifat, A. A., Jones, A., Troakes, C., ... & Iacoangeli, A. (2023). Unsupervised machine learning identifies distinct ALS molecular subtypes in post-mortem motor cortex and blood expression data. Acta neuropathologica communications, 11(1), 208.
[CrossRef] [Google Scholar] - Nguyen, H. L., Vu, D. L., & Le, H. C. (2024, July). Exploiting machine learning and gene expression analysis in amyotrophic lateral sclerosis diagnosis. In 2024 Tenth International Conference on Communications and Electronics (ICCE) (pp. 363-368). IEEE.
[CrossRef] [Google Scholar] - Rad, H. N., Su, Z., Trinh, A., Newton, M. H., Shamsani, J., Karim, A., ... & Nygc Als Consortium. (2024). Amyotrophic lateral sclerosis diagnosis using machine learning and multi-omic data integration. Heliyon, 10(20).
[CrossRef] [Google Scholar] - Edgar, R., Domrachev, M., & Lash, A. E. (2002). Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic acids research, 30(1), 207-210.
[CrossRef] [Google Scholar] - Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern Classification (2nd ed.). Wiley.
[Google Scholar] - Van Rheenen, W., Diekstra, F. P., Harschnitz, O., Westeneng, H. J., van Eijk, K. R., Saris, C. G., ... & van den Berg, L. H. (2018). Whole blood transcriptome analysis in amyotrophic lateral sclerosis: A biomarker study. PloS one, 13(6), e0198874.
[CrossRef] [Google Scholar]
[Google Scholar]- p.
[CrossRef] [Google Scholar] - Wang, X., Liu, J., Zhang, Y., Liu, F., & Shen, B. (2009, June). Bioinformatics Analysis of Amyotrophic Lateral Sclerosis Associated Amino Acid Mutations. In 2009 3rd International Conference on Bioinformatics and Biomedical Engineering (pp. 1-4). IEEE.
[CrossRef] [Google Scholar] - Yang, A., Wang, X., Shang, C., Hu, Y., Yu, C., Zhang, J., & Hong, Y. (2022). Identification of cuproptosis related genes in diagnosis and subtype classification of ALS using the Gene Expression Omnibus Database.
[CrossRef] [Google Scholar] - Tiwari, S., & Shukla, A. (2025). Review on classification of amyotrophic lateral sclerosis using ensemble classifiers. Engineering Proceedings, 82(1), 114.
[CrossRef] [Google Scholar] - Yang, Y., & Ma, G. (2010). Ensemble-based active learning for class imbalance problem. Journal of Biomedical Science and Engineering, 3(10), 1021.
[CrossRef] [Google Scholar]
Cite This Article
TY - JOUR AU - Elmakki, Shimaa M. AU - Hashem, Esraa M. AU - Hadhoud, Marwa M. A. AU - Ghoneim, Vidan F. PY - 2026 DA - 2026/04/07 TI - Improved ALS Biomarker Discovery with SMOTE-Augmented Gene Expression Data JO - Journal of Computational Intelligence in Biomedicine T2 - Journal of Computational Intelligence in Biomedicine JF - Journal of Computational Intelligence in Biomedicine VL - 1 IS - 1 SP - 1 EP - 9 DO - 10.62762/JCIB.2025.140919 UR - https://www.icck.org/article/abs/JCIB.2025.140919 KW - amyotrophic lateral sclerosis KW - synthetic minority over-sampling technique KW - transcriptome KW - biomarker AB - The early identification of Amyotrophic Lateral Sclerosis (ALS), a progressive neurological disease, using blood-based transcriptome biomarker is gaining attention. The classification of ALS from blood transcriptomic data remains challenging due to class imbalance and high dimensionality. This extension of a previous study that utilized machine learning on the microarray dataset includes a synthetic data augmentation method employing the Synthetic Minority Over-sampling Technique (SMOTE) to improve classification accuracy. Following the use of Fisher Score, t-test, PCA, and Ant Colony Optimization for feature selection, SMOTE was employed to produce synthetic ALS samples and to imbalance the class distribution. Support Vector Machines, ensemble techniques, and k-Nearest Neighbors were used to assess the classifier's performance. The accuracy of all models improved, according to the results, with k-NN rising from 77.5% to 82% and SVM rising from 91.3% to 93%. Furthermore, a number of physiologically significant genes, such as MMP9 and SELL, appeared more noticeable after augmentation and matched known immune-related indicators in ALS. The augmentation technique improves both the predictive performance, and the biological validity of the biomarkers identified. These findings demonstrate the utility of SMOTE in enhancing transcriptomic classifiers. SN - request pending PB - Institute of Central Computation and Knowledge LA - English ER -
@article{Elmakki2026Improved,
author = {Shimaa M. Elmakki and Esraa M. Hashem and Marwa M. A. Hadhoud and Vidan F. Ghoneim},
title = {Improved ALS Biomarker Discovery with SMOTE-Augmented Gene Expression Data},
journal = {Journal of Computational Intelligence in Biomedicine},
year = {2026},
volume = {1},
number = {1},
pages = {1-9},
doi = {10.62762/JCIB.2025.140919},
url = {https://www.icck.org/article/abs/JCIB.2025.140919},
abstract = {The early identification of Amyotrophic Lateral Sclerosis (ALS), a progressive neurological disease, using blood-based transcriptome biomarker is gaining attention. The classification of ALS from blood transcriptomic data remains challenging due to class imbalance and high dimensionality. This extension of a previous study that utilized machine learning on the microarray dataset includes a synthetic data augmentation method employing the Synthetic Minority Over-sampling Technique (SMOTE) to improve classification accuracy. Following the use of Fisher Score, t-test, PCA, and Ant Colony Optimization for feature selection, SMOTE was employed to produce synthetic ALS samples and to imbalance the class distribution. Support Vector Machines, ensemble techniques, and k-Nearest Neighbors were used to assess the classifier's performance. The accuracy of all models improved, according to the results, with k-NN rising from 77.5\% to 82\% and SVM rising from 91.3\% to 93\%. Furthermore, a number of physiologically significant genes, such as MMP9 and SELL, appeared more noticeable after augmentation and matched known immune-related indicators in ALS. The augmentation technique improves both the predictive performance, and the biological validity of the biomarkers identified. These findings demonstrate the utility of SMOTE in enhancing transcriptomic classifiers.},
keywords = {amyotrophic lateral sclerosis, synthetic minority over-sampling technique, transcriptome, biomarker},
issn = {request pending},
publisher = {Institute of Central Computation and Knowledge}
}
Article Metrics
Publisher's Note
ICCK stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and Permissions
Copyright © 2026 by the Author(s). Published by Institute of Central Computation and Knowledge. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
Portico