-
CiteScore
-
Impact Factor
Volume 2, Issue 1, ICCK Transactions on Intelligent Systematics
Volume 2, Issue 1, 2025
Submit Manuscript Edit a Special Issue
Academic Editor
Rashid Mirzavand
Rashid Mirzavand
University of Alberta, Canada
Article QR Code
Article QR Code
Scan the QR code for reading
Popular articles
ICCK Transactions on Intelligent Systematics, Volume 2, Issue 1, 2025: 1-13

Free to Read | Research Article | 22 December 2024
Electronic Health Records-Based Data-Driven Diabetes Knowledge Unveiling and Risk Prognosis
1 Georgia Institute of Technology, Atlanta, GA 30332, United States
2 Faculty of Management, McGill University, Montreal, QC H3B0C7, Canada
3 Department of Mechanical Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, United States
4 School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR 97333, United States
5 University of Pennsylvania, Philadelphia, PA 19104, United States
6 College of Engineering, Northeastern University, Boston, MA 02115, United States
7 Department of Electrical and Computer Engineering, University of California, San Diego, CA 92037, United States
† Huadong Pang and Li Zhou contributed equally to this work
* Corresponding Author: Huadong Pang, [email protected]
ARK: ark:/57805/tis.2025.367320
Received: 17 October 2024, Accepted: 05 December 2024, Published: 22 December 2024  
Cited by: 3  (Source: Web of Science), 9  (Source: Scopus ), 29  (Source: Google Scholar)
Abstract
In the healthcare sector, the application of deep learning technologies has revolutionized data analysis and disease forecasting. This is particularly evident in diabetes research, where in-depth analysis of Electronic Health Records (EHR) has unlocked new opportunities for early detection and effective intervention strategies. Our research presents an innovative model that synergizes the capabilities of Bidirectional Long Short-Term Memory Networks-Conditional Random Field (BiLSTM-CRF) with a fusion of XGBoost and Logistic Regression. This model is designed to enhance the accuracy of diabetes risk prediction by conducting an in-depth analysis of electronic medical records data. The first phase of our approach involves employing BiLSTM-CRF to delve into the temporal characteristics and latent patterns present in EHR data. This method effectively uncovers the progression trends of diabetes, which are often hidden in the complex data structures of medical records. The second phase leverages the combined strength of XGBoost and Logistic Regression to classify these extracted features and evaluate associated risks. This dual approach facilitates a more nuanced and precise prediction of diabetes, outperforming traditional models, particularly in handling multifaceted and nonlinear medical datasets. Our research demonstrates a notable advancement in diabetes prediction over traditional methods, showcasing the effectiveness of our combined BiLSTM-CRF, XGBoost, and Logistic Regression model. This study highlights the value of data-driven strategies in clinical decision-making, equipping healthcare professionals with precise tools for early detection and intervention. By enabling personalized treatment and timely care, our approach signifies progress in incorporating advanced analytics in healthcare, potentially improving outcomes for diabetes and other chronic conditions.

Graphical Abstract
Electronic Health Records-Based Data-Driven Diabetes Knowledge Unveiling and Risk Prognosis

Keywords
deep learning
electronic health records
BiLSTM-CRF
XGBoost
healthcare analytics

Data Availability Statement
Data will be made available on request.

Funding
This work was supported without any funding.

Conflicts of Interest
The authors declare no conflicts of interest.

Ethical Approval and Consent to Participate
This study is a retrospective analysis based solely on fully anonymized historical physical examination records. According to the regulations of the National Health Commission of China and the institutional policy of the data-providing health check center, retrospective studies using completely de-identified routine clinical data are exempt from formal ethics approval and individual informed consent. All data processing strictly followed national regulations on personal information protection and patient privacy.

References
  1. Colombo, F., Oderkirk, J., & Slawomirski, L. (2020). Health information systems, electronic medical records, and big data in global healthcare: Progress and challenges in oecd countries. Handbook of global health, 1-31.
    [CrossRef]   [Google Scholar]
  2. Auffray, C., Balling, R., Barroso, I., Bencze, L., Benson, M., Bergeron, J., ... & Zanetti, G. (2016). Making sense of big data in health research: towards an EU action plan. Genome medicine, 8, 1-13.
    [CrossRef]   [Google Scholar]
  3. Roski, J., Bo-Linn, G. W., & Andrews, T. A. (2014). Creating value in health care through big data: opportunities and policy implications. Health affairs, 33(7), 1115-1122.
    [CrossRef]   [Google Scholar]
  4. Heitmueller, A., Henderson, S., Warburton, W., Elmagarmid, A., Pentland, A. S., & Darzi, A. (2014). Developing public policy to advance the use of big data in health care. Health Affairs, 33(9), 1523-1530.
    [CrossRef]   [Google Scholar]
  5. Andreu-Perez, J., Poon, C. C., Merrifield, R. D., Wong, S. T., & Yang, G. Z. (2015). Big data for health. IEEE journal of biomedical and health informatics, 19(4), 1193-1208.
    [CrossRef]   [Google Scholar]
  6. Safran, C., Bloomrosen, M., Hammond, W. E., Labkoff, S., Markel-Fox, S., Tang, P. C., & Detmer, D. E. (2007). Toward a national framework for the secondary use of health data: an American Medical Informatics Association White Paper. Journal of the American Medical Informatics Association, 14(1), 1-9.
    [CrossRef]   [Google Scholar]
  7. Graffy, J., Eaton, S., Sturt, J., & Chadwick, P. (2009). Personalized care planning for diabetes: policy lessons from systematic reviews of consultation and self-management interventions. Primary Health Care Research & Development, 10(3), 210-222.
    [CrossRef]   [Google Scholar]
  8. Hu, J., Perer, A., & Wang, F. (2016). Data driven analytics for personalized healthcare. Healthcare Information Management Systems: Cases, Strategies, and Solutions, 529-554.
    [CrossRef]   [Google Scholar]
  9. Woldaregay, A. Z., Årsand, E., Walderhaug, S., Albers, D., Mamykina, L., Botsis, T., & Hartvigsen, G. (2019). Data-driven modeling and prediction of blood glucose dynamics: Machine learning applications in type 1 diabetes. Artificial intelligence in medicine, 98, 109-134.
    [CrossRef]   [Google Scholar]
  10. Gatiti, P., Ndirangu, E., Mwangi, J., Mwanzu, A., & Ramadhani, T. (2021). Enhancing healthcare quality in hospitals through electronic health records: a systematic review. AJournal of Health Informatics in Developing Countries, 15(2), 1. https://ecommons.aku.edu/libraries/64/
    [Google Scholar]
  11. Kruse, C. S., Goswamy, R., Raval, Y. J., & Marawi, S. (2016). Challenges and opportunities of big data in health care: a systematic review. JMIR medical informatics, 4(4), e5359.
    [CrossRef]   [Google Scholar]
  12. Kumari, J., Kumar, E., & Kumar, D. (2023). A structured analysis to study the role of machine learning and deep learning in the healthcare sector with big data analytics. Archives of Computational Methods in Engineering, 30(6), 3673-3701.
    [CrossRef]   [Google Scholar]
  13. Mehta, S., Lyles, C. R., Rubinsky, A. D., Kemper, K. E., Auerbach, J., Sarkar, U., ... & Brown III, W. (2023). Social determinants of health documentation in structured and unstructured clinical data of patients with diabetes: Comparative analysis. JMIR medical informatics, 11, e46159.
    [CrossRef]   [Google Scholar]
  14. Majnarić, L. T., Babič, F., O’Sullivan, S., & Holzinger, A. (2021). AI and big data in healthcare: towards a more comprehensive research framework for multimorbidity. Journal of Clinical Medicine, 10(4), 766.
    [CrossRef]   [Google Scholar]
  15. Kreimeyer, K., Foster, M., Pandey, A., Arya, N., Halford, G., Jones, S. F., ... & Botsis, T. (2017). Natural language processing systems for capturing and standardizing unstructured clinical information: a systematic review. Journal of biomedical informatics, 73, 14-29.
    [CrossRef]   [Google Scholar]
  16. Sun, W., Cai, Z., Li, Y., Liu, F., Fang, S., & Wang, G. (2018). Data processing and text mining technologies on electronic medical records: a review. Journal of healthcare engineering, 2018(1), 4302425.
    [CrossRef]   [Google Scholar]
  17. Juhn, Y., & Liu, H. (2020). Artificial intelligence approaches using natural language processing to advance EHR-based clinical research. Journal of Allergy and Clinical Immunology, 145(2), 463-469.
    [CrossRef]   [Google Scholar]
  18. Sheikhalishahi, S., Miotto, R., Dudley, J. T., Lavelli, A., Rinaldi, F., & Osmani, V. (2019). Natural language processing of clinical notes on chronic diseases: systematic review. JMIR medical informatics, 7(2), e12239.
    [CrossRef]   [Google Scholar]
  19. Rout, S. K., Sahu, B., Panigrahi, A., Nayak, B., & Pati, A. (2022). Early detection of sepsis using LSTM neural network with electronic health record. In Ambient Intelligence in Health Care: Proceedings of ICAIHC 2022 (pp. 201-207). Singapore: Springer Nature Singapore.
    [CrossRef]   [Google Scholar]
  20. Henrard, S., Speybroeck, N., & Hermans, C. (2015). Classification and regression tree analysis vs. multivariable linear and logistic regression methods as statistical tools for studying haemophilia. Haemophilia, 21(6), 715-722.
    [CrossRef]   [Google Scholar]
  21. Kang, Y., McHugh, M. D., Chittams, J., & Bowles, K. H. (2016). Utilizing home healthcare electronic health records for telehomecare patients with heart failure: a decision tree approach to detect associations with rehospitalizations. CIN: Computers, Informatics, Nursing, 34(4), 175-182.
    [CrossRef]   [Google Scholar]
  22. Zhang, D., Yin, C., Zeng, J., Yuan, X., & Zhang, P. (2020). Combining structured and unstructured data for predictive models: a deep learning approach. BMC medical informatics and decision making, 20, 1-11.
    [CrossRef]   [Google Scholar]
  23. Guo, A., Beheshti, R., Khan, Y. M., Langabeer, J. R., & Foraker, R. E. (2021). Predicting cardiovascular health trajectories in time-series electronic health records with LSTM models. BMC medical informatics and decision making, 21, 1-10.
    [CrossRef]   [Google Scholar]
  24. Zhu, T., Kuang, L., Daniels, J., Herrero, P., Li, K., & Georgiou, P. (2022). IoMT-enabled real-time blood glucose prediction with deep learning and edge computing. IEEE Internet of Things Journal, 10(5), 3706-3719.
    [CrossRef]   [Google Scholar]
  25. Latif, J., Xiao, C., Tu, S., Rehman, S. U., Imran, A., & Bilal, A. (2020). Implementation and use of disease diagnosis systems for electronic medical records based on machine learning: A complete review. IEEE Access, 8, 150489-150513.
    [CrossRef]   [Google Scholar]
  26. Wang, Y., Wang, L., Rastegar-Mojarad, M., Moon, S., Shen, F., Afzal, N., ... & Liu, H. (2018). Clinical information extraction applications: A literature review. Journal of Biomedical Informatics, 77, 34-49.
    [CrossRef]   [Google Scholar]
  27. Qin, Y., & Zeng, Y. (2018). Research of clinical named entity recognition based on Bi-LSTM-CRF. Journal of Shanghai Jiaotong University (Science), 23, 392-397.
    [CrossRef]   [Google Scholar]
  28. Xu, Q., Zhou, Y., Liao, B., Xin, Z., Xie, W., Hu, C., & Luo, A. (2023). Named entity recognition of diabetes online health community data using multiple machine learning models. Bioengineering, 10(6), 659.
    [CrossRef]   [Google Scholar]
  29. Wang, J., Deng, H., Liu, B., Hu, A., Liang, J., Fan, L., ... & Lei, J. (2020). Systematic evaluation of research progress on natural language processing in medicine over the past 20 years: bibliometric study on PubMed. Journal of medical Internet research, 22(1), e16816.
    [CrossRef]   [Google Scholar]
  30. Yuanyuan, F., & Zhongmin, L. I. (2022). Research and application progress of Chinese medical knowledge graph. Journal of Frontiers of Computer Science & Technology, 16(10), 2219.
    [Google Scholar]
  31. Zou, Q., Qu, K., Luo, Y., Yin, D., Ju, Y., & Tang, H. (2018). Predicting diabetes mellitus with machine learning techniques. Frontiers in genetics, 9, 515.
    [CrossRef]   [Google Scholar]
  32. Raghavendra, U., Acharya, U. R., & Adeli, H. (2020). Artificial intelligence techniques for automated diagnosis of neurological disorders. European neurology, 82(1-3), 41-64.
    [CrossRef]   [Google Scholar]
  33. Sidey-Gibbons, J. A., & Sidey-Gibbons, C. J. (2019). Machine learning in medicine: a practical introduction. BMC medical research methodology, 19(1), 64.
    [CrossRef]   [Google Scholar]
  34. Huang, Y., McCullagh, P., Black, N., & Harper, R. (2007). Feature selection and classification model construction on type 2 diabetic patients’ data. Artificial intelligence in medicine, 41(3), 251-262.
    [CrossRef]   [Google Scholar]
  35. Ali, M. S., Islam, M. K., Das, A. A., Duranta, D. U. S., Haque, M. F., & Rahman, M. H. (2023). A novel approach for best parameters selection and feature engineering to analyze and detect diabetes: Machine learning insights. BioMed Research International, 2023(1), 8583210.
    [CrossRef]   [Google Scholar]
  36. Sinaga, K. P., & Yang, M. S. (2020). Unsupervised K-means clustering algorithm. IEEE access, 8, 80716-80727.
    [CrossRef]   [Google Scholar]
  37. Banday, M., Zafar, S., Agarwal, P., & Alam, M. A. (2023, November). Diabetes Prediction Using Random Forest Classifier with Feature Augmentation. In Proceedings of the 5th International Conference on Information Management & Machine Intelligence (pp. 1-7).
    [CrossRef]   [Google Scholar]
  38. Abnoosian, K., Farnoosh, R., & Behzadi, M. H. (2023). Prediction of diabetes disease using an ensemble of machine learning multi-classifier models. BMC bioinformatics, 24(1), 337.
    [CrossRef]   [Google Scholar]
  39. Bernardini, M., Romeo, L., Misericordia, P., & Frontoni, E. (2019). Discovering the type 2 diabetes in electronic health records using the sparse balanced support vector machine. IEEE journal of biomedical and health informatics, 24(1), 235-246.
    [CrossRef]   [Google Scholar]
  40. Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., ... & Liu, T. Y. (2017). Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30.
    [Google Scholar]
  41. Li, Z., Dong, M., Wen, S., Hu, X., Zhou, P., & Zeng, Z. (2019). CLU-CNNs: Object detection for medical images. Neurocomputing, 350, 53-59.
    [CrossRef]   [Google Scholar]
  42. Singh, S. P., Wang, L., Gupta, S., Goli, H., Padmanabhan, P., & Gulyás, B. (2020). 3D deep learning on medical images: a review. Sensors, 20(18), 5097.
    [CrossRef]   [Google Scholar]
  43. Singh, Y., & Tiwari, M. (2024). Revolutionizing diabetes disease prediction through novel machine learning techniques. Nano, 19(04), 2350056.
    [CrossRef]   [Google Scholar]
  44. Madan, P., Singh, V., Chaudhari, V., Albagory, Y., Dumka, A., Singh, R., ... & AlGhamdi, A. S. (2022). An optimization-based diabetes prediction model using CNN and Bi-directional LSTM in real-time environment. Applied Sciences, 12(8), 3989.
    [CrossRef]   [Google Scholar]
  45. Ju, R., Zhou, P., Wen, S., Wei, W., Xue, Y., Huang, X., & Yang, X. (2020). 3D-CNN-SPP: A patient risk prediction system from electronic health records via 3D CNN and spatial pyramid pooling. IEEE Transactions on Emerging Topics in Computational Intelligence, 5(2), 247-261.
    [CrossRef]   [Google Scholar]
  46. Rasmy, L., Xiang, Y., Xie, Z., Tao, C., & Zhi, D. (2021). Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ digital medicine, 4(1), 86.
    [CrossRef]   [Google Scholar]

Cite This Article
APA Style
Pang, H., Zhou, L., Dong, Y., Chen, P., Gu, D., Lyu, T.& Zhang, H. (2024). Electronic Health Records-Based Data-Driven Diabetes Knowledge Unveiling and Risk Prognosis. ICCK Transactions on Intelligent Systematics, 2(1), 1–13. https://doi.org/10.62762/TIS.2025.367320
Export Citation
RIS Format
Compatible with EndNote, Zotero, Mendeley, and other reference managers
RIS format data for reference managers
TY  - JOUR
AU  - Pang, Huadong
AU  - Zhou, Li
AU  - Dong, Yiping
AU  - Chen, Peiyuan
AU  - Gu, Dian
AU  - Lyu, Tianyi
AU  - Zhang, Hansong
PY  - 2024
DA  - 2024/12/22
TI  - Electronic Health Records-Based Data-Driven Diabetes Knowledge Unveiling and Risk Prognosis
JO  - ICCK Transactions on Intelligent Systematics
T2  - ICCK Transactions on Intelligent Systematics
JF  - ICCK Transactions on Intelligent Systematics
VL  - 2
IS  - 1
SP  - 1
EP  - 13
DO  - 10.62762/TIS.2025.367320
UR  - https://www.icck.org/article/abs/TIS.2025.367320
KW  - deep learning
KW  - electronic health records
KW  - BiLSTM-CRF
KW  - XGBoost
KW  - healthcare analytics
AB  - In the healthcare sector, the application of deep learning technologies has revolutionized data analysis and disease forecasting. This is particularly evident in diabetes research, where in-depth analysis of Electronic Health Records (EHR) has unlocked new opportunities for early detection and effective intervention strategies. Our research presents an innovative model that synergizes the capabilities of Bidirectional Long Short-Term Memory Networks-Conditional Random Field (BiLSTM-CRF) with a fusion of XGBoost and Logistic Regression. This model is designed to enhance the accuracy of diabetes risk prediction by conducting an in-depth analysis of electronic medical records data. The first phase of our approach involves employing BiLSTM-CRF to delve into the temporal characteristics and latent patterns present in EHR data. This method effectively uncovers the progression trends of diabetes, which are often hidden in the complex data structures of medical records. The second phase leverages the combined strength of XGBoost and Logistic Regression to classify these extracted features and evaluate associated risks. This dual approach facilitates a more nuanced and precise prediction of diabetes, outperforming traditional models, particularly in handling multifaceted and nonlinear medical datasets. Our research demonstrates a notable advancement in diabetes prediction over traditional methods, showcasing the effectiveness of our combined BiLSTM-CRF, XGBoost, and Logistic Regression model. This study highlights the value of data-driven strategies in clinical decision-making, equipping healthcare professionals with precise tools for early detection and intervention. By enabling personalized treatment and timely care, our approach signifies progress in incorporating advanced analytics in healthcare, potentially improving outcomes for diabetes and other chronic conditions.
SN  - 3068-5079
PB  - Institute of Central Computation and Knowledge
LA  - English
ER  - 
BibTeX Format
Compatible with LaTeX, BibTeX, and other reference managers
BibTeX format data for LaTeX and reference managers
@article{Pang2024Electronic,
  author = {Huadong Pang and Li Zhou and Yiping Dong and Peiyuan Chen and Dian Gu and Tianyi Lyu and Hansong Zhang},
  title = {Electronic Health Records-Based Data-Driven Diabetes Knowledge Unveiling and Risk Prognosis},
  journal = {ICCK Transactions on Intelligent Systematics},
  year = {2024},
  volume = {2},
  number = {1},
  pages = {1-13},
  doi = {10.62762/TIS.2025.367320},
  url = {https://www.icck.org/article/abs/TIS.2025.367320},
  abstract = {In the healthcare sector, the application of deep learning technologies has revolutionized data analysis and disease forecasting. This is particularly evident in diabetes research, where in-depth analysis of Electronic Health Records (EHR) has unlocked new opportunities for early detection and effective intervention strategies. Our research presents an innovative model that synergizes the capabilities of Bidirectional Long Short-Term Memory Networks-Conditional Random Field (BiLSTM-CRF) with a fusion of XGBoost and Logistic Regression. This model is designed to enhance the accuracy of diabetes risk prediction by conducting an in-depth analysis of electronic medical records data. The first phase of our approach involves employing BiLSTM-CRF to delve into the temporal characteristics and latent patterns present in EHR data. This method effectively uncovers the progression trends of diabetes, which are often hidden in the complex data structures of medical records. The second phase leverages the combined strength of XGBoost and Logistic Regression to classify these extracted features and evaluate associated risks. This dual approach facilitates a more nuanced and precise prediction of diabetes, outperforming traditional models, particularly in handling multifaceted and nonlinear medical datasets. Our research demonstrates a notable advancement in diabetes prediction over traditional methods, showcasing the effectiveness of our combined BiLSTM-CRF, XGBoost, and Logistic Regression model. This study highlights the value of data-driven strategies in clinical decision-making, equipping healthcare professionals with precise tools for early detection and intervention. By enabling personalized treatment and timely care, our approach signifies progress in incorporating advanced analytics in healthcare, potentially improving outcomes for diabetes and other chronic conditions.},
  keywords = {deep learning, electronic health records, BiLSTM-CRF, XGBoost, healthcare analytics},
  issn = {3068-5079},
  publisher = {Institute of Central Computation and Knowledge}
}

Article Metrics
Citations:

Crossref

3

Scopus

9

Web of Science

3
Article Access Statistics:
Views: 2450
PDF Downloads: 442

Publisher's Note
ICCK stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and Permissions
Institute of Central Computation and Knowledge (ICCK) or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
ICCK Transactions on Intelligent Systematics

ICCK Transactions on Intelligent Systematics

ISSN: 3068-5079 (Online) | ISSN: 3069-003X (Print)

Email: [email protected]

Portico

Portico

All published articles are preserved here permanently:
https://www.portico.org/publishers/icck/