Reliability Optimization for Large Language Model Training Infrastructure: Challenges, Advances, and Future Directions
Review Article  ·  Published: 02 March 2026
Issue cover
ICCK Transactions on Systems Safety and Reliability
Volume 2, Issue 1, 2026: 36-53
Review Article Free to Read

Reliability Optimization for Large Language Model Training Infrastructure: Challenges, Advances, and Future Directions

1 School of Computer Science and Technology, Zhejiang University of Water Resources and Electric Power, Hangzhou 310018, China
2 School of Computer Science and Technology, Zhejiang Normal University, Jinhua 321004, China
3 School of Information Science and Engineering, Huaqiao University, Quanzhou 362021, China
4 Hangzhou Anheng Information Technology Co., Ltd., Hangzhou, China
5 Razzakov Kyrgyz State Technical University, Bishkek 720044, Kyrgyzstan
6 Zhejiang Keepsoft Information Technology Co., Ltd., Hangzhou, China
7 Zhejiang YuGong Information Technology Co., Ltd., Hangzhou, China
8 School of Mathematical Sciences, Huaqiao University, Quanzhou 362021, China
9 Zhejiang Institute of Hydraulics and Estuary, Hangzhou 310020, China
Corresponding Author: Yuchang Mo, [email protected]
Volume 2, Issue 1

Article Information

Abstract

The ultra-large scale and prolonged runtime of Large Language Model (LLM) training—often involving thousands of GPUs and spanning weeks—render reliability a pivotal bottleneck. Hardware failures, stragglers, and runtime issues can waste over 30% of GPU resources, delaying model rollout and driving up costs. This survey focuses on reliability optimization for LLM training systems. The discussion centers on three pillars of reliability: fault detection, fault recovery, and straggler mitigation. For each pillar, we dissect innovative mechanisms, which range from communication-aware fault detection to adaptive load balancing, and we assess their impact on critical reliability metrics such as error-induced downtime reduction, Mean Time to Failure (MTTF), and slowdown mitigation rate. Additionally, we pinpoint open challenges, such as dynamic fault prediction for mixed workloads and cross-layer reliability coordination, and outline future directions to construct more resilient LLM training systems.

Keywords

reliability optimization large language model training infrastructure fault detection failure recovery straggler mitigation

Data Availability Statement

Not applicable.

Funding

This work was supported by the Joint Fund of Zhejiang Provincial Natural Science Foundation of China under Grant LGEZ26F030002, and Scientific Research Foundation of Zhejiang University of Water Resources and Electric Power under Grant JBGS2025009.

Conflicts of Interest

Yuan Fan and Chunyu Miao are affiliated with the Hangzhou Anheng Information Technology Co., Ltd., Hangzhou, China; Faer Gui is affiliated with the Zhejiang Keepsoft Information Technology Co., Ltd., Hangzhou, China; Rengui Zhang is affiliated with the Zhejiang YuGong Information Technology Co., Ltd., Hangzhou, China. The authors declare that these affiliations had no influence on the study design, data collection, analysis, interpretation, or the decision to publish, and that no other competing interests exist.

AI Use Statement

The authors declare that no generative AI was used in the preparation of this manuscript.

Ethical Approval and Consent to Participate

Not applicable.

References

  1. Kokolis, A., Kuchnik, M., Hoffman, J., Kumar, A., Malani, P., Ma, F., ... & Wu, C. J. (2025, March). Revisiting reliability in large-scale machine learning research clusters. In 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA) (pp. 1259-1274). IEEE.
    [CrossRef] [Google Scholar]
  2. Dong, J., Luo, B., Zhang, J., Zhang, P., Feng, F., Zhu, Y., ... & Fu, B. (2024). Boosting large-scale parallel training efficiency with c4: A communication-driven approach. arXiv preprint arXiv:2406.04594.
    [Google Scholar]
  3. Jiang, Z., Lin, H., Zhong, Y., Huang, Q., Chen, Y., Zhang, Z., ... & Liu, X. (2024). {MegaScale: Scaling large language model training to more than 10,000 {GPUs. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24) (pp. 745-760).
    [Google Scholar]
  4. Narayanan, D., Shoeybi, M., Casper, J., LeGresley, P., Patwary, M., Korthikanti, V., ... & Zaharia, M. (2021, November). Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the international conference for high performance computing, networking, storage and analysis (pp. 1-15).
    [CrossRef] [Google Scholar]
  5. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., ... & Lample, G. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
    [Google Scholar]
  6. Levy, S., Ferreira, K. B., DeBardeleben, N., Siddiqua, T., Sridharan, V., & Baseman, E. (2018, November). Lessons learned from memory errors observed over the lifetime of cielo. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 554-565). IEEE.
    [CrossRef] [Google Scholar]
  7. Qian, K., Xi, Y., Cao, J., Gao, J., Xu, Y., Guan, Y., ... & Cai, D. (2024, August). Alibaba hpn: A data center network for large language model training. In Proceedings of the ACM SIGCOMM 2024 Conference (pp. 691-706).
    [CrossRef] [Google Scholar]
  8. Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., ... & Catanzaro, B. (2022). Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990.
    [Google Scholar]
  9. Rasley, J., Rajbhandari, S., Ruwase, O., & He, Y. (2020, August). Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining (pp. 3505-3506).
    [CrossRef] [Google Scholar]
  10. Yi, X., Zhang, S., Luo, Z., Long, G., Diao, L., Wu, C., ... & Lin, W. (2020, November). Optimizing distributed training deployment in heterogeneous GPU clusters. In Proceedings of the 16th International Conference on emerging Networking EXperiments and Technologies (pp. 93-107).
    [CrossRef] [Google Scholar]
  11. Villalobos, P., Sevilla, J., Besiroglu, T., Heim, L., Ho, A., & Hobbhahn, M. (2022). Machine learning model sizes and the parameter gap. arXiv preprint arXiv:2207.02852.
    [Google Scholar]
  12. Xiong, Y., Jiang, Y., Yang, Z., Qu, L., Zhao, G., Liu, S., ... & Zhou, L. (2024). SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation. In 2024 USENIX Annual Technical Conference (USENIX ATC 24) (pp. 835-850).
    [Google Scholar]
  13. Kong, X., Zhu, Y., Zhou, H., Jiang, Z., Ye, J., Guo, C., & Zhuo, D. (2022). Collie: Finding performance anomalies in RDMA subsystems. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22) (pp. 287-305).
    [Google Scholar]
  14. Zhu, Y., Eran, H., Firestone, D., Guo, C., Lipshteyn, M., Liron, Y., ... & Zhang, M. (2015). Congestion control for large-scale RDMA deployments. ACM SIGCOMM Computer Communication Review, 45(4), 523-536.
    [CrossRef] [Google Scholar]
  15. Zhang, S., Zhao, Y., Xiong, X., Sun, Y., Nie, X., Zhang, J., ... & Pei, D. (2024, July). Illuminating the gray zone: Non-intrusive gray failure localization in server operating systems. In Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (pp. 126-137).
    [CrossRef] [Google Scholar]
  16. He, Y., Hutton, M., Chan, S., De Gruijl, R., Govindaraju, R., Patil, N., & Li, Y. (2023, June). Understanding and mitigating hardware failures in deep learning training systems. In Proceedings of the 50th Annual International Symposium on Computer Architecture (pp. 1-16).
    [CrossRef] [Google Scholar]
  17. Wang, Z., Jia, Z., Zheng, S., Zhang, Z., Fu, X., Ng, T. E., & Wang, Y. (2023, October). Gemini: Fast failure recovery in distributed training with in-memory checkpoints. In Proceedings of the 29th Symposium on Operating Systems Principles (pp. 364-381).
    [CrossRef] [Google Scholar]
  18. Wu, T., Wang, W., Yu, Y., Yang, S., Wu, W., Duan, Q., ... & Zhang, L. (2024). FALCON: Pinpointing and mitigating stragglers for large-scale hybrid-parallel training. arXiv preprint arXiv:2410.12588.
    [Google Scholar]
  19. Korthikanti, V. A., Casper, J., Lym, S., McAfee, L., Andersch, M., Shoeybi, M., & Catanzaro, B. (2023). Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5, 341-353.
    [Google Scholar]
  20. Maeng, K., Bharuka, S., Gao, I., Jeffrey, M., Saraph, V., Su, B. Y., ... & Wu, C. J. (2021). Understanding and improving failure tolerant training for deep learning recommendation with partial recovery. Proceedings of Machine Learning and Systems, 3, 637-651.
    [Google Scholar]
  21. Zhou, K., Hao, Y., Mellor-Crummey, J., Meng, X., & Liu, X. (2020). Gvprof: A value profiler for gpu-based clusters. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 1–16). IEEE.
    [CrossRef] [Google Scholar]
  22. Liu, K., Jiang, Z., Zhang, J., Wei, H., Zhong, X., Tan, L., ... & Huang, T. (2023). Hostping: Diagnosing intra-host network bottlenecks in RDMA servers. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23) (pp. 15-29).
    [Google Scholar]
  23. Zhong, Y., Sheng, G., Liu, J., Yuan, J., & Wu, C. (2023). Swift: Expedited failure recovery for large-scale dnn training. In Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (pp. 447–449).
    [CrossRef] [Google Scholar]
  24. Hu, Q., Ye, Z., Wang, Z., Wang, G., Zhang, M., Chen, Q., ... & Zhang, T. (2024). Characterization of large language model development in the datacenter. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24) (pp. 709-729).
    [Google Scholar]
  25. Taherin, A., Patel, T., Georgakoudis, G., Laguna, I., & Tiwari, D. (2021, June). Examining failures and repairs on supercomputers with multi-gpu compute nodes. In 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) (pp. 305-313). IEEE.
    [CrossRef] [Google Scholar]
  26. Jang, I., Yang, Z., Zhang, Z., Jin, X., & Chowdhury, M. (2023, October). Oobleck: Resilient distributed training of large models using pipeline templates. In Proceedings of the 29th Symposium on Operating Systems Principles (pp. 382-395).
    [CrossRef] [Google Scholar]
  27. Bautista-Gomez, L., Benoit, A., Di, S., Herault, T., Robert, Y., & Sun, H. (2024). A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?. Future Generation Computer Systems, 161, 315-328.
    [CrossRef] [Google Scholar]
  28. Young, J. W. (1974). A first order approximation to the optimum checkpoint interval. Communications of the ACM, 17(9), 530-531.
    [CrossRef] [Google Scholar]
  29. Zimmer, C., Maxwell, D., McNally, S., Atchley, S., & Vazhkudai, S. S. (2018, November). Gpu age-aware scheduling to improve the reliability of leadership jobs on titan. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 83-93). IEEE.
    [CrossRef] [Google Scholar]
  30. Agudelo-España, D., Gomez-Gonzalez, S., Bauer, S., Schölkopf, B., & Peters, J. (2020, August). Bayesian online prediction of change points. In Conference on uncertainty in artificial intelligence (pp. 320-329). PMLR.
    [Google Scholar]
  31. Burns, B., Grant, B., Oppenheimer, D., Brewer, E., & Wilkes, J. (2016). Borg, omega, and kubernetes. Communications of the ACM, 59(5), 50-57. https://dx.doi.org/10.1145/2890784
    [Google Scholar]
  32. Yu, P., & Chowdhury, M. (2020). Fine-grained GPU sharing primitives for deep learning applications. Proceedings of Machine Learning and Systems, 2, 98-111.
    [Google Scholar]
  33. Li, J., Xu, H., Zhu, Y., Liu, Z., Guo, C., & Wang, C. (2023, May). Lyra: Elastic scheduling for deep learning clusters. In Proceedings of the Eighteenth European Conference on Computer Systems (pp. 835-850).
    [CrossRef] [Google Scholar]
  34. Thorpe, J., Zhao, P., Eyolfson, J., Qiao, Y., Jia, Z., Zhang, M., ... & Xu, G. H. (2023). Bamboo: Making preemptible instances resilient for affordable training of large DNNs. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23) (pp. 497-513).
    [Google Scholar]
  35. Xiao, W., Ren, S., Li, Y., Zhang, Y., Hou, P., Li, Z., ... & Jia, Y. (2020). \{AntMan\: Dynamic scaling on \{GPU\ clusters for deep learning. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20) (pp. 533-548).
    [Google Scholar]
  36. Chen, Y., Xie, C., Ma, M., Gu, J., Peng, Y., Lin, H., ... & Zhu, Y. (2022). Sapipe: Staleness-aware pipeline for data parallel dnn training. Advances in neural information processing systems, 35, 17981-17993.
    [Google Scholar]
  37. Zhang, R., Xiao, W., Zhang, H., Liu, Y., Lin, H., & Yang, M. (2020, June). An empirical study on program failures of deep learning jobs. In Proceedings of the ACM/IEEE 42nd international conference on software engineering (pp. 1159-1170).
    [CrossRef] [Google Scholar]
  38. Li, Z., Zheng, L., Zhong, Y., Liu, V., Sheng, Y., Jin, X., ... & Stoica, I. (2023). AlpaServe: Statistical multiplexing with model parallelism for deep learning serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23) (pp. 663-679).
    [Google Scholar]
  39. Chen, C., Weng, Q., Wang, W., Li, B., & Li, B. (2020, October). Semi-dynamic load balancing: Efficient distributed learning in non-dedicated environments. In Proceedings of the 11th ACM Symposium on Cloud Computing (pp. 431-446).
    [CrossRef] [Google Scholar]
  40. You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., ... & Hsieh, C. J. (2019). Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962.
    [Google Scholar]
  41. Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., & Catanzaro, B. (2019). Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053.
    [Google Scholar]
  42. Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D., Chen, M., ... & Wu, Y. (2019). Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32.
    [Google Scholar]
  43. Narayanan, D., Harlap, A., Phanishayee, A., Seshadri, V., Devanur, N. R., Ganger, G. R., ... & Zaharia, M. (2019, October). PipeDream: Generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM symposium on operating systems principles (pp. 1-15).
    [CrossRef] [Google Scholar]
  44. Ren, J., Rajbhandari, S., Aminabadi, R. Y., Ruwase, O., Yang, S., Zhang, M., ... & He, Y. (2021). Zero-offload: Democratizing billion-scale model training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21) (pp. 551-564).
    [Google Scholar]
  45. Li, Z., Wallace, E., Shen, S., Lin, K., Keutzer, K., Klein, D., & Gonzalez, J. (2020, November). Train big, then compress: Rethinking model size for efficient training and inference of transformers. In International Conference on machine learning (pp. 5958-5968). PMLR.
    [Google Scholar]
  46. Zhang, C., Ma, L., Xue, J., Shi, Y., Miao, Z., Yang, F., ... & Yang, M. (2023). Cocktailer: Analyzing and optimizing dynamic control flow in deep learning. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23) (pp. 681-699).
    [Google Scholar]
  47. Tan, C., Jin, Z., Guo, C., Zhang, T., Wu, H., Deng, K., ... & Xiang, D. (2019). NetBouncer: Active device and link failure localization in data center networks. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19) (pp. 599-614).
    [Google Scholar]
  48. Kumar, G., Dukkipati, N., Jang, K., Wassel, H. M., Wu, X., Montazeri, B., ... & Vahdat, A. (2020, July). Swift: Delay is simple and effective for congestion control in the datacenter. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication (pp. 514-528).
    [CrossRef] [Google Scholar]
  49. Alistarh, D., Grubic, D., Li, J., Tomioka, R., & Vojnovic, M. (2017). QSGD: Communication-efficient SGD via gradient quantization and encoding. Advances in neural information processing systems, 30.
    [Google Scholar]
  50. Barham, P., Chowdhery, A., Dean, J., Ghemawat, S., Hand, S., Hurt, D., ... & Wu, Y. (2022). Pathways: Asynchronous distributed dataflow for ml. Proceedings of Machine Learning and Systems, 4, 430-449.
    [Google Scholar]

Cite This Article

APA Style
Mo, Y., Wan, J., Peng, H., Fang, R., Fan, Y., Miao, C., Chynybaev, M., Gui, F., Zhang, R., Zhai, S., Wu, W., Zhu, J., Hu, J., & Mu, J. (2026). Reliability Optimization for Large Language Model Training Infrastructure: Challenges, Advances, and Future Directions. ICCK Transactions on Systems Safety and Reliability, 2(1), 36–53. https://doi.org/10.62762/TSSR.2025.806733
Export Citation
RIS Format
Compatible with EndNote, Zotero, Mendeley, and other reference managers
TY  - JOUR
AU  - Mo, Yuchang
AU  - Wan, Jian
AU  - Peng, Hao
AU  - Fang, Ruiming
AU  - Fan, Yuan
AU  - Miao, Chunyu
AU  - Chynybaev, Mirlan
AU  - Gui, Faer
AU  - Zhang, Rengui
AU  - Zhai, Shuying
AU  - Wu, Wen
AU  - Zhu, Jifeng
AU  - Hu, Jianyong
AU  - Mu, Jinbin
PY  - 2026
DA  - 2026/03/02
TI  - Reliability Optimization for Large Language Model Training Infrastructure: Challenges, Advances, and Future Directions
JO  - ICCK Transactions on Systems Safety and Reliability
T2  - ICCK Transactions on Systems Safety and Reliability
JF  - ICCK Transactions on Systems Safety and Reliability
VL  - 2
IS  - 1
SP  - 36
EP  - 53
DO  - 10.62762/TSSR.2025.806733
UR  - https://www.icck.org/article/abs/TSSR.2025.806733
KW  - reliability optimization
KW  - large language model
KW  - training infrastructure
KW  - fault detection
KW  - failure recovery
KW  - straggler mitigation
AB  - The ultra-large scale and prolonged runtime of Large Language Model (LLM) training—often involving thousands of GPUs and spanning weeks—render reliability a pivotal bottleneck. Hardware failures, stragglers, and runtime issues can waste over 30% of GPU resources, delaying model rollout and driving up costs. This survey focuses on reliability optimization for LLM training systems. The discussion centers on three pillars of reliability: fault detection, fault recovery, and straggler mitigation. For each pillar, we dissect innovative mechanisms, which range from communication-aware fault detection to adaptive load balancing, and we assess their impact on critical reliability metrics such as error-induced downtime reduction, Mean Time to Failure (MTTF), and slowdown mitigation rate. Additionally, we pinpoint open challenges, such as dynamic fault prediction for mixed workloads and cross-layer reliability coordination, and outline future directions to construct more resilient LLM training systems.
SN  - 3069-1087
PB  - Institute of Central Computation and Knowledge
LA  - English
ER  - 
BibTeX Format
Compatible with LaTeX, BibTeX, and other reference managers
@article{Mo2026Reliabilit,
  author = {Yuchang Mo and Jian Wan and Hao Peng and Ruiming Fang and Yuan Fan and Chunyu Miao and Mirlan Chynybaev and Faer Gui and Rengui Zhang and Shuying Zhai and Wen Wu and Jifeng Zhu and Jianyong Hu and Jinbin Mu},
  title = {Reliability Optimization for Large Language Model Training Infrastructure: Challenges, Advances, and Future Directions},
  journal = {ICCK Transactions on Systems Safety and Reliability},
  year = {2026},
  volume = {2},
  number = {1},
  pages = {36-53},
  doi = {10.62762/TSSR.2025.806733},
  url = {https://www.icck.org/article/abs/TSSR.2025.806733},
  abstract = {The ultra-large scale and prolonged runtime of Large Language Model (LLM) training—often involving thousands of GPUs and spanning weeks—render reliability a pivotal bottleneck. Hardware failures, stragglers, and runtime issues can waste over 30\% of GPU resources, delaying model rollout and driving up costs. This survey focuses on reliability optimization for LLM training systems. The discussion centers on three pillars of reliability: fault detection, fault recovery, and straggler mitigation. For each pillar, we dissect innovative mechanisms, which range from communication-aware fault detection to adaptive load balancing, and we assess their impact on critical reliability metrics such as error-induced downtime reduction, Mean Time to Failure (MTTF), and slowdown mitigation rate. Additionally, we pinpoint open challenges, such as dynamic fault prediction for mixed workloads and cross-layer reliability coordination, and outline future directions to construct more resilient LLM training systems.},
  keywords = {reliability optimization, large language model, training infrastructure, fault detection, failure recovery, straggler mitigation},
  issn = {3069-1087},
  publisher = {Institute of Central Computation and Knowledge}
}

Article Metrics

Citations
Google Scholar
0
Crossref
0
Scopus
0
Web of Science
0
Views
105
PDF Downloads
34

Publisher's Note

ICCK stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and Permissions

Institute of Central Computation and Knowledge (ICCK) or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
ICCK Transactions on Systems Safety and Reliability
ICCK Transactions on Systems Safety and Reliability
ISSN: 3069-1087 (Online)
Portico
Preserved at
Portico