-
CiteScore
-
Impact Factor
Volume 3, Issue 1, ICCK Transactions on Emerging Topics in Artificial Intelligence
Volume 3, Issue 1, 2026
Submit Manuscript Edit a Special Issue
Article QR Code
Article QR Code
Scan the QR code for reading
Popular articles
ICCK Transactions on Emerging Topics in Artificial Intelligence, Volume 3, Issue 1, 2026: 1-8

Open Access | Research Article | 12 November 2025
Hybrid Large Language Model and Rule-Based Framework for Automated PHI De-Identification in Clinical Notes
by
1 School of Mathematics and Computer Science, Hezhou University, Hezhou 542899, China
* Corresponding Author: Kai Ye, [email protected]
Received: 07 August 2025, Accepted: 25 August 2025, Published: 12 November 2025  
Abstract
The growing demand for secondary use of electronic health records (EHRs) in clinical research has amplified the importance of effective de-identification of protected health information (PHI) to comply with privacy regulations such as HIPAA. Manual annotation remains error-prone, time-consuming, and inconsistent across healthcare institutions, while existing automated systems often face trade-offs between accuracy, interpretability, and computational cost. This study proposes a novel hybrid de-identification framework that integrates neural, statistical, and rule-based approaches to achieve high recall, operational efficiency, and deployment feasibility in real-world healthcare settings.

Graphical Abstract
Hybrid Large Language Model and Rule-Based Framework for Automated PHI De-Identification in Clinical Notes

Keywords
PHI de-identification
clinical NLP
large language models
hybrid systems
parameter-efficient fine-tuning (PEFT)
electronic health records
privacy preservation
retrieval-augmented generation (RAG)
rule-based NLP
biomedical text processing

Data Availability Statement
Data will be made available on request.

Funding
This work was supported without any funding.

Conflicts of Interest
The author declares no conflicts of interest.

Ethical Approval and Consent to Participate
This study involves the secondary analysis of de-identified clinical notes from the Lifespan Health Network (20,000 notes) and the publicly available MIMIC-III v1.4 dataset (1,200 ICU notes), with no direct interaction with human subjects or access to identifiable private information. All data were processed in accordance with HIPAA regulations and IRB-approved data governance protocols at the originating institutions, ensuring compliance with data minimization principles. As this research qualifies as exempt secondary research using de-identified data under 45 CFR 46.104(d)(4), no additional Institutional Review Board (IRB) approval or informed consent was required.

References
  1. Tschider, C. A. (2021). AI's Legitimate Interest: Towards a public benefit privacy model. Hous. J. Health L. & Pol'y, 21, 125.
    [Google Scholar]
  2. Denecke, K., May, R., LLMHealthGroup, & Rivera Romero, O. (2024). Potential of large language models in health care: Delphi study. Journal of Medical Internet Research, 26, e52399.
    [CrossRef]   [Google Scholar]
  3. Wang, L., Chen, S., Jiang, L., Pan, S., Cai, R., Yang, S., & Yang, F. (2025). Parameter-efficient fine-tuning in large language models: a survey of methodologies. Artificial Intelligence Review, 58(8), 227.
    [CrossRef]   [Google Scholar]
  4. Huang, J., Xu, Y., Wang, Q., Wang, Q. C., Liang, X., Wang, F., ... & Fei, A. (2025). Foundation models and intelligent decision-making: Progress, challenges, and perspectives. The Innovation.
    [CrossRef]   [Google Scholar]
  5. Dehghan, A., Kovacevic, A., Karystianis, G., Keane, J. A., & Nenadic, G. (2015). Combining knowledge-and data-driven methods for de-identification of clinical narratives. Journal of biomedical informatics, 58, S53-S59.
    [CrossRef]   [Google Scholar]
  6. Hanauer, D., Aberdeen, J., Bayer, S., Wellner, B., Clark, C., Zheng, K., & Hirschman, L. (2013). Bootstrapping a de-identification system for narrative patient records: cost-performance tradeoffs. International journal of medical informatics, 82(9), 821-831.
    [CrossRef]   [Google Scholar]
  7. Di Martino, F., & Delmastro, F. (2023). Explainable AI for clinical and remote health applications: a survey on tabular and time series data. Artificial Intelligence Review, 56(6), 5261-5315.
    [CrossRef]   [Google Scholar]
  8. Naddeo, K., Koutsoubis, N., Krish, R., Rasool, G., Bouaynaya, N., OSullivan, T., & Krish, R. (2025). DICOM De-Identification via Hybrid AI and Rule-Based Framework for Scalable, Uncertainty-Aware Redaction. arXiv preprint arXiv:2507.23736.
    [Google Scholar]
  9. Liu, Z. (2025). Human-AI co-creation: a framework for collaborative design in intelligent systems. arXiv preprint arXiv:2507.17774.
    [Google Scholar]
  10. Kuo, R., Soltan, A. A., O’Hanlon, C., Hasanic, A., Clifton, D. A., Gary, C., ...& Eyre, D. W. (2025). Benchmarking transformer-based models for medical record deidentification: A single centre, multi-specialty evaluation. medRxiv, 2025-05.
    [Google Scholar]
  11. Urbain, J., Kowalski, G., Osinski, K., Spaniol, R., Liu, M., Taylor, B., & Waitman, L. R. (2022). Natural language processing for enterprise-scale de-identification of protected health information in clinical notes. AMIA Summits on Translational Science Proceedings, 2022, 92.
    [Google Scholar]
  12. Sylolypavan, A., Sleeman, D., Wu, H., & Sim, M. (2023). The impact of inconsistent human annotations on AI driven clinical decision making. NPJ Digital Medicine, 6(1), 26.
    [CrossRef]   [Google Scholar]
  13. Abo El-Enen, M., Saad, S., & Nazmy, T. (2025). A survey on retrieval-augmentation generation (RAG) models for healthcare applications. Neural Computing and Applications, 37, 28191–28267.
    [CrossRef]   [Google Scholar]
  14. Gu, B., Desai, R. J., Lin, K. J., & Yang, J. (2024). Probabilistic medical predictions of large language models. npj Digital Medicine, 7(1), 367.
    [CrossRef]   [Google Scholar]
  15. PAULRAJ, N. J. (2025). Natural Language Processing on Clinical Notes: Advanced Techniques for Risk Prediction and Summarization. Journal of Computer Science and Technology Studies, 7(3), 494-502.
    [CrossRef]   [Google Scholar]
  16. Torres-Silva, E. A., Rúa, S., Giraldo-Forero, A. F., Durango, M. C., Flórez-Arango, J. F., & Orozco-Duque, A. (2023). Classification of severe maternal morbidity from electronic health records written in Spanish using natural language processing. Applied Sciences, 13(19), 10725.
    [CrossRef]   [Google Scholar]
  17. Dai, H. J., Mir, T. H., Chen, C. T., Chen, C. C., Yang, H. P., Lee, C. H., ... & Jonnagaddala, J. (2025). Leveraging large language models for the deidentification and temporal normalization of sensitive health information in electronic health records. npj digital medicine, 8(1), 517.
    [CrossRef]   [Google Scholar]
  18. Eyre, H., Gan, Q., Hu, M., Bowles, A., Stanley, J., Shi, J., ... & Alba, P. R. (2025). Evaluating Clinical Note Deidentification Tools and Transformer Transferability between Public and Private Data from the US Department of Veterans Affairs. medRxiv, 2025-03.
    [Google Scholar]
  19. Aden, I., Child, C. H., & Reyes-Aldasoro, C. C. (2024). International classification of diseases prediction from mimiic-iii clinical text using pre-trained clinicalbert and nlp deep learning models achieving state of the art. Big Data and Cognitive Computing, 8(5), 47.
    [CrossRef]   [Google Scholar]
  20. Rahman, M. A., Barek, M. A., Riad, A. K. I., Rahman, M. M., Rashid, M. B., Mia, M. R., ... & Ahamed, S. I. (2025, July). Embedding with large language models for classification of hipaa safeguard compliance rules. In 2025 IEEE 49th Annual Computers, Software, and Applications Conference (COMPSAC) (pp. 1040-1046). IEEE.
    [CrossRef]   [Google Scholar]
  21. Cunningham, J. W., Singh, P., Reeder, C., Claggett, B., Marti-Castellote, P. M., Lau, E. S., ... & Ho, J. E. (2024). Natural language processing for adjudication of heart failure in a multicenter clinical trial: a secondary analysis of a randomized clinical trial. JAMA cardiology, 9(2), 174-181.
    [CrossRef]   [Google Scholar]
  22. Martínez-García, M., & Hernández-Lemus, E. (2022). Data integration challenges for machine learning in precision medicine. Frontiers in medicine, 8, 784455.
    [CrossRef]   [Google Scholar]
  23. Gardner, J., Xiong, L., Wang, F., Post, A., Saltz, J., & Grandison, T. (2010, November). An evaluation of feature sets and sampling techniques for de-identification of medical records. In Proceedings of the 1st ACM International Health Informatics Symposium (pp. 183-190).
    [CrossRef]   [Google Scholar]
  24. Mortadi, A., Nazih, W., I. Eldesouki, M., & Hifny, Y. (2025). Intelligent de-identification of medical discharge summaries using hybrid nlp techniques. ACM Transactions on Asian and Low-Resource Language Information Processing, 24(5), 1-17.
    [CrossRef]   [Google Scholar]
  25. Wu, S., & Huang, X. (2025). Psychological Health Prediction Based on the Fusion of Structured and Unstructured Data in EHR: a Case Study of Low-Income Populations.
    [CrossRef]   [Google Scholar]
  26. Dernoncourt, F., Lee, J. Y., Uzuner, O., & Szolovits, P. (2017). De-identification of patient notes with recurrent neural networks. Journal of the American Medical Informatics Association, 24(3), 596-606.
    [CrossRef]   [Google Scholar]
  27. Vakili, T., & Dalianis, H. (2022, May). Utility preservation of clinical text after De-Identification. In Proceedings of the 21st workshop on biomedical language processing (pp. 383-388).
    [CrossRef]   [Google Scholar]
  28. Patel, Z. M. (2022). Panacea: Making the World’s Biomedical Information Computable to Develop Data Platforms for Machine Learning (Doctoral dissertation, Harvard University).
    [Google Scholar]
  29. Liu, Y., Ju, S., & Wang, J. (2024). Exploring the potential of ChatGPT in medical dialogue summarization: a study on consistency with human preferences. BMC Medical Informatics and Decision Making, 24(1), 75.
    [CrossRef]   [Google Scholar]
  30. Chaddad, A., Lu, Q., Li, J., Katib, Y., Kateb, R., Tanougast, C., ... & Abdulkadir, A. (2023). Explainable, domain-adaptive, and federated artificial intelligence in medicine. IEEE/CAA Journal of Automatica Sinica, 10(4), 859-876.
    [CrossRef]   [Google Scholar]
  31. Ramesh, K., Gandhi, N., Madaan, P., Bauer, L., Peris, C., & Field, A. (2024). Evaluating differentially private synthetic data generation in high-stakes domains. arXiv preprint arXiv:2410.08327.
    [Google Scholar]
  32. Sharma, P., Pathak, L., Doke, R., & Mane, S. (2024). Artificial Intelligence in Clinical Trials: The Present Scenario and Future Prospects. In AI Innovations in Drug Delivery and Pharmaceutical Sciences; Advancing Therapy through Technology (pp. 229-257). Bentham Science Publishers.
    [CrossRef]   [Google Scholar]
  33. Jullien, M., Valentino, M., Ranaldi, L., & Freitas, A. (2025). Dissecting Clinical Reasoning in Language Models: A Comparative Study of Prompts and Model Adaptation Strategies. arXiv preprint arXiv:2507.04142.
    [Google Scholar]
  34. Wu, S., Roberts, K., Datta, S., Du, J., Ji, Z., Si, Y., ... & Xu, H. (2020). Deep learning in clinical natural language processing: a methodical review. Journal of the American Medical Informatics Association, 27(3), 457-470.
    [CrossRef]   [Google Scholar]

Cite This Article
APA Style
Ye, K. (2025). Hybrid Large Language Model and Rule-Based Framework for Automated PHI De-Identification in Clinical Notes. ICCK Transactions on Emerging Topics in Artificial Intelligence, 3(1), 1–8. https://doi.org/10.62762/TETAI.2025.518010
Export Citation
RIS Format
Compatible with EndNote, Zotero, Mendeley, and other reference managers
RIS format data for reference managers
TY  - JOUR
AU  - Ye, Kai
PY  - 2025
DA  - 2025/11/12
TI  - Hybrid Large Language Model and Rule-Based Framework for Automated PHI De-Identification in Clinical Notes
JO  - ICCK Transactions on Emerging Topics in Artificial Intelligence
T2  - ICCK Transactions on Emerging Topics in Artificial Intelligence
JF  - ICCK Transactions on Emerging Topics in Artificial Intelligence
VL  - 3
IS  - 1
SP  - 1
EP  - 8
DO  - 10.62762/TETAI.2025.518010
UR  - https://www.icck.org/article/abs/TETAI.2025.518010
KW  - PHI de-identification
KW  - clinical NLP
KW  - large language models
KW  - hybrid systems
KW  - parameter-efficient fine-tuning (PEFT)
KW  - electronic health records
KW  - privacy preservation
KW  - retrieval-augmented generation (RAG)
KW  - rule-based NLP
KW  - biomedical text processing
AB  - The growing demand for secondary use of electronic health records (EHRs) in clinical research has amplified the importance of effective de-identification of protected health information (PHI) to comply with privacy regulations such as HIPAA. Manual annotation remains error-prone, time-consuming, and inconsistent across healthcare institutions, while existing automated systems often face trade-offs between accuracy, interpretability, and computational cost. This study proposes a novel hybrid de-identification framework that integrates neural, statistical, and rule-based approaches to achieve high recall, operational efficiency, and deployment feasibility in real-world healthcare settings.
SN  - 3068-6652
PB  - Institute of Central Computation and Knowledge
LA  - English
ER  - 
BibTeX Format
Compatible with LaTeX, BibTeX, and other reference managers
BibTeX format data for LaTeX and reference managers
@article{Ye2025Hybrid,
  author = {Kai Ye},
  title = {Hybrid Large Language Model and Rule-Based Framework for Automated PHI De-Identification in Clinical Notes},
  journal = {ICCK Transactions on Emerging Topics in Artificial Intelligence},
  year = {2025},
  volume = {3},
  number = {1},
  pages = {1-8},
  doi = {10.62762/TETAI.2025.518010},
  url = {https://www.icck.org/article/abs/TETAI.2025.518010},
  abstract = {The growing demand for secondary use of electronic health records (EHRs) in clinical research has amplified the importance of effective de-identification of protected health information (PHI) to comply with privacy regulations such as HIPAA. Manual annotation remains error-prone, time-consuming, and inconsistent across healthcare institutions, while existing automated systems often face trade-offs between accuracy, interpretability, and computational cost. This study proposes a novel hybrid de-identification framework that integrates neural, statistical, and rule-based approaches to achieve high recall, operational efficiency, and deployment feasibility in real-world healthcare settings.},
  keywords = {PHI de-identification, clinical NLP, large language models, hybrid systems, parameter-efficient fine-tuning (PEFT), electronic health records, privacy preservation, retrieval-augmented generation (RAG), rule-based NLP, biomedical text processing},
  issn = {3068-6652},
  publisher = {Institute of Central Computation and Knowledge}
}

Article Metrics
Citations:

Crossref

0

Scopus

0

Web of Science

0
Article Access Statistics:
Views: 283
PDF Downloads: 44

Publisher's Note
ICCK stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and Permissions
CC BY Copyright © 2025 by the Author(s). Published by Institute of Central Computation and Knowledge. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
ICCK Transactions on Emerging Topics in Artificial Intelligence

ICCK Transactions on Emerging Topics in Artificial Intelligence

ISSN: 3068-6652 (Online)

Email: [email protected]

Portico

Portico

All published articles are preserved here permanently:
https://www.portico.org/publishers/icck/