Improved ALS Biomarker Discovery with SMOTE-Augmented Gene Expression Data
Research Article  ·  Published: 07 April 2026
Issue cover
Journal of Computational Intelligence in Biomedicine
Volume 1, Issue 1, 2026: 1-9
Research Article Open Access

Improved ALS Biomarker Discovery with SMOTE-Augmented Gene Expression Data

1 Department of Biomedical Engineering, Faculty of Engineering, Helwan University, Cairo, Egypt
2 Biomedical Engineering Department, Faculty of Engineering Science and Technology, Misr University for Science and Technology (MUST), 6th of October City, Giza, Egypt
3 Biomedical Engineering Department, College of Engineering, King Faisal University, Al-Ahsa, Saudi Arabia
4 Department of Biomedical Engineering, College of Engineering, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia
* Corresponding Author: Esraa M. Hashem, [email protected]
Volume 1, Issue 1

Article Information

Abstract

The early identification of Amyotrophic Lateral Sclerosis (ALS), a progressive neurological disease, using blood-based transcriptome biomarker is gaining attention. The classification of ALS from blood transcriptomic data remains challenging due to class imbalance and high dimensionality. This extension of a previous study that utilized machine learning on the microarray dataset includes a synthetic data augmentation method employing the Synthetic Minority Over-sampling Technique (SMOTE) to improve classification accuracy. Following the use of Fisher Score, t-test, PCA, and Ant Colony Optimization for feature selection, SMOTE was employed to produce synthetic ALS samples and to imbalance the class distribution. Support Vector Machines, ensemble techniques, and k-Nearest Neighbors were used to assess the classifier's performance. The accuracy of all models improved, according to the results, with k-NN rising from 77.5% to 82% and SVM rising from 91.3% to 93%. Furthermore, a number of physiologically significant genes, such as MMP9 and SELL, appeared more noticeable after augmentation and matched known immune-related indicators in ALS. The augmentation technique improves both the predictive performance, and the biological validity of the biomarkers identified. These findings demonstrate the utility of SMOTE in enhancing transcriptomic classifiers.

Graphical Abstract

Improved ALS Biomarker Discovery with SMOTE-Augmented Gene Expression Data

Keywords

amyotrophic lateral sclerosis synthetic minority over-sampling technique transcriptome biomarker

Data Availability Statement

The data used in this study are publicly available from the NCBI Gene Expression Omnibus (GEO) repository under accession number GSE112676: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE112676.

Funding

This work was supported without any funding.

Conflicts of Interest

The authors declare no conflicts of interest.

AI Use Statement

The authors declare that no generative AI was used in the preparation of this manuscript.

Ethical Approval and Consent to Participate

Not applicable. The study utilized only publicly available, de-identified data from GEO (GSE112676); thus, ethical approval was not required.

References

  1. Blagus, R., & Lusa, L. (2013). SMOTE for high-dimensional class-imbalanced data. BMC bioinformatics, 14(1), 106.
    [CrossRef] [Google Scholar]
  2. Daneshafrooz, N., Bagherzadeh Cham, M., Majidi, M., & Panahi, B. (2022). Identification of potentially functional modules and diagnostic genes related to amyotrophic lateral sclerosis based on the WGCNA and LASSO algorithms. Scientific reports, 12(1), 20144.
    [CrossRef] [Google Scholar]
  3. Faghri, F., Brunn, F., Dadu, A., Chiò, A., Calvo, A., Moglia, C., ... & Traynor, B. J. (2022). Identifying and predicting amyotrophic lateral sclerosis clinical subgroups: a population-based machine-learning study. The Lancet Digital Health, 4(5), e359-e369.
    [CrossRef] [Google Scholar]
  4. Founta, K., Dafou, D., Kanata, E., Sklaviadis, T., Zanos, T. P., Gounaris, A., & Xanthopoulos, K. (2023). Gene targeting in amyotrophic lateral sclerosis using causality-based feature selection and machine learning. Molecular Medicine, 29(1), 12.
    [CrossRef] [Google Scholar]
  5. Lusa, L. (2012, December). Evaluation of smote for high-dimensional class-imbalanced microarray data. In 2012 11th international conference on machine learning and applications (Vol. 2, pp. 89-94). IEEE.
    [CrossRef] [Google Scholar]
  6. Grollemund, V., Pradat, P. F., Querin, G., Delbot, F., Le Chat, G., Pradat-Peyre, J. F., & Bede, P. (2019). Machine learning in amyotrophic lateral sclerosis: achievements, pitfalls, and future directions. Frontiers in neuroscience, 13, 135.
    [CrossRef] [Google Scholar]
  7. He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9), 1263-1284.
    [CrossRef] [Google Scholar]
  8. Hu, L. Y., Huang, M. W., Ke, S. W., & Tsai, C. F. (2016). The distance function effect on k-nearest neighbor classification for medical datasets. SpringerPlus, 5(1), 1304.
    [CrossRef] [Google Scholar]
  9. Cady, J., Allred, P., Bali, T., Pestronk, A., Goate, A., Miller, T. M., ... & Baloh, R. H. (2015). Amyotrophic lateral sclerosis onset is influenced by the burden of rare variants in known amyotrophic lateral sclerosis genes. Annals of neurology, 77(1), 100-113.
    [CrossRef] [Google Scholar]
  10. Marriott, H., Kabiljo, R., Hunt, G. P., Khleifat, A. A., Jones, A., Troakes, C., ... & Iacoangeli, A. (2023). Unsupervised machine learning identifies distinct ALS molecular subtypes in post-mortem motor cortex and blood expression data. Acta neuropathologica communications, 11(1), 208.
    [CrossRef] [Google Scholar]
  11. Nguyen, H. L., Vu, D. L., & Le, H. C. (2024, July). Exploiting machine learning and gene expression analysis in amyotrophic lateral sclerosis diagnosis. In 2024 Tenth International Conference on Communications and Electronics (ICCE) (pp. 363-368). IEEE.
    [CrossRef] [Google Scholar]
  12. Rad, H. N., Su, Z., Trinh, A., Newton, M. H., Shamsani, J., Karim, A., ... & Nygc Als Consortium. (2024). Amyotrophic lateral sclerosis diagnosis using machine learning and multi-omic data integration. Heliyon, 10(20).
    [CrossRef] [Google Scholar]
  13. Edgar, R., Domrachev, M., & Lash, A. E. (2002). Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic acids research, 30(1), 207-210.
    [CrossRef] [Google Scholar]
  14. Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern Classification (2nd ed.). Wiley.
    [Google Scholar]
  15. Van Rheenen, W., Diekstra, F. P., Harschnitz, O., Westeneng, H. J., van Eijk, K. R., Saris, C. G., ... & van den Berg, L. H. (2018). Whole blood transcriptome analysis in amyotrophic lateral sclerosis: A biomarker study. PloS one, 13(6), e0198874.
    [CrossRef] [Google Scholar]

  16. [Google Scholar]
  17. p.
    [CrossRef] [Google Scholar]
  18. Wang, X., Liu, J., Zhang, Y., Liu, F., & Shen, B. (2009, June). Bioinformatics Analysis of Amyotrophic Lateral Sclerosis Associated Amino Acid Mutations. In 2009 3rd International Conference on Bioinformatics and Biomedical Engineering (pp. 1-4). IEEE.
    [CrossRef] [Google Scholar]
  19. Yang, A., Wang, X., Shang, C., Hu, Y., Yu, C., Zhang, J., & Hong, Y. (2022). Identification of cuproptosis related genes in diagnosis and subtype classification of ALS using the Gene Expression Omnibus Database.
    [CrossRef] [Google Scholar]
  20. Tiwari, S., & Shukla, A. (2025). Review on classification of amyotrophic lateral sclerosis using ensemble classifiers. Engineering Proceedings, 82(1), 114.
    [CrossRef] [Google Scholar]
  21. Yang, Y., & Ma, G. (2010). Ensemble-based active learning for class imbalance problem. Journal of Biomedical Science and Engineering, 3(10), 1021.
    [CrossRef] [Google Scholar]

Cite This Article

APA Style
Elmakki, S. M., Hashem, E. M., Hadhoud, M. M. A., & Ghoneim, V. F. (2026). Improved ALS Biomarker Discovery with SMOTE-Augmented Gene Expression Data. Journal of Computational Intelligence in Biomedicine, 1(1), 1–9. https://doi.org/10.62762/JCIB.2025.140919
Export Citation
RIS Format
Compatible with EndNote, Zotero, Mendeley, and other reference managers
TY  - JOUR
AU  - Elmakki, Shimaa M.
AU  - Hashem, Esraa M.
AU  - Hadhoud, Marwa M. A.
AU  - Ghoneim, Vidan F.
PY  - 2026
DA  - 2026/04/07
TI  - Improved ALS Biomarker Discovery with SMOTE-Augmented Gene Expression Data
JO  - Journal of Computational Intelligence in Biomedicine
T2  - Journal of Computational Intelligence in Biomedicine
JF  - Journal of Computational Intelligence in Biomedicine
VL  - 1
IS  - 1
SP  - 1
EP  - 9
DO  - 10.62762/JCIB.2025.140919
UR  - https://www.icck.org/article/abs/JCIB.2025.140919
KW  - amyotrophic lateral sclerosis
KW  - synthetic minority over-sampling technique
KW  - transcriptome
KW  - biomarker
AB  - The early identification of Amyotrophic Lateral Sclerosis (ALS), a progressive neurological disease, using blood-based transcriptome biomarker is gaining attention. The classification of ALS from blood transcriptomic data remains challenging due to class imbalance and high dimensionality. This extension of a previous study that utilized machine learning on the microarray dataset includes a synthetic data augmentation method employing the Synthetic Minority Over-sampling Technique (SMOTE) to improve classification accuracy. Following the use of Fisher Score, t-test, PCA, and Ant Colony Optimization for feature selection, SMOTE was employed to produce synthetic ALS samples and to imbalance the class distribution. Support Vector Machines, ensemble techniques, and k-Nearest Neighbors were used to assess the classifier's performance. The accuracy of all models improved, according to the results, with k-NN rising from 77.5% to 82% and SVM rising from 91.3% to 93%. Furthermore, a number of physiologically significant genes, such as MMP9 and SELL, appeared more noticeable after augmentation and matched known immune-related indicators in ALS. The augmentation technique improves both the predictive performance, and the biological validity of the biomarkers identified. These findings demonstrate the utility of SMOTE in enhancing transcriptomic classifiers.
SN  - request pending
PB  - Institute of Central Computation and Knowledge
LA  - English
ER  - 
BibTeX Format
Compatible with LaTeX, BibTeX, and other reference managers
@article{Elmakki2026Improved,
  author = {Shimaa M. Elmakki and Esraa M. Hashem and Marwa M. A. Hadhoud and Vidan F. Ghoneim},
  title = {Improved ALS Biomarker Discovery with SMOTE-Augmented Gene Expression Data},
  journal = {Journal of Computational Intelligence in Biomedicine},
  year = {2026},
  volume = {1},
  number = {1},
  pages = {1-9},
  doi = {10.62762/JCIB.2025.140919},
  url = {https://www.icck.org/article/abs/JCIB.2025.140919},
  abstract = {The early identification of Amyotrophic Lateral Sclerosis (ALS), a progressive neurological disease, using blood-based transcriptome biomarker is gaining attention. The classification of ALS from blood transcriptomic data remains challenging due to class imbalance and high dimensionality. This extension of a previous study that utilized machine learning on the microarray dataset includes a synthetic data augmentation method employing the Synthetic Minority Over-sampling Technique (SMOTE) to improve classification accuracy. Following the use of Fisher Score, t-test, PCA, and Ant Colony Optimization for feature selection, SMOTE was employed to produce synthetic ALS samples and to imbalance the class distribution. Support Vector Machines, ensemble techniques, and k-Nearest Neighbors were used to assess the classifier's performance. The accuracy of all models improved, according to the results, with k-NN rising from 77.5\% to 82\% and SVM rising from 91.3\% to 93\%. Furthermore, a number of physiologically significant genes, such as MMP9 and SELL, appeared more noticeable after augmentation and matched known immune-related indicators in ALS. The augmentation technique improves both the predictive performance, and the biological validity of the biomarkers identified. These findings demonstrate the utility of SMOTE in enhancing transcriptomic classifiers.},
  keywords = {amyotrophic lateral sclerosis, synthetic minority over-sampling technique, transcriptome, biomarker},
  issn = {request pending},
  publisher = {Institute of Central Computation and Knowledge}
}

Article Metrics

Citations
Crossref
0
Scopus
0
Views
241
PDF Downloads
105

Publisher's Note

ICCK stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and Permissions

CC BY Copyright © 2026 by the Author(s). Published by Institute of Central Computation and Knowledge. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
Journal of Computational Intelligence in Biomedicine
Journal of Computational Intelligence in Biomedicine
ISSN: request pending (Online) | ISSN: request pending (Print)
Portico
Preserved at
Portico