Reliability Optimization for Large Language Model Training Infrastructure: Challenges, Advances, and Future Directions
Article Information
Abstract
The ultra-large scale and prolonged runtime of Large Language Model (LLM) training—often involving thousands of GPUs and spanning weeks—render reliability a pivotal bottleneck. Hardware failures, stragglers, and runtime issues can waste over 30% of GPU resources, delaying model rollout and driving up costs. This survey focuses on reliability optimization for LLM training systems. The discussion centers on three pillars of reliability: fault detection, fault recovery, and straggler mitigation. For each pillar, we dissect innovative mechanisms, which range from communication-aware fault detection to adaptive load balancing, and we assess their impact on critical reliability metrics such as error-induced downtime reduction, Mean Time to Failure (MTTF), and slowdown mitigation rate. Additionally, we pinpoint open challenges, such as dynamic fault prediction for mixed workloads and cross-layer reliability coordination, and outline future directions to construct more resilient LLM training systems.
Keywords
Data Availability Statement
Funding
Conflicts of Interest
AI Use Statement
Ethical Approval and Consent to Participate
References
- Kokolis, A., Kuchnik, M., Hoffman, J., Kumar, A., Malani, P., Ma, F., ... & Wu, C. J. (2025, March). Revisiting reliability in large-scale machine learning research clusters. In 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA) (pp. 1259-1274). IEEE.
[CrossRef] [Google Scholar] - Dong, J., Luo, B., Zhang, J., Zhang, P., Feng, F., Zhu, Y., ... & Fu, B. (2024). Boosting large-scale parallel training efficiency with c4: A communication-driven approach. arXiv preprint arXiv:2406.04594.
[Google Scholar] - Jiang, Z., Lin, H., Zhong, Y., Huang, Q., Chen, Y., Zhang, Z., ... & Liu, X. (2024). {MegaScale: Scaling large language model training to more than 10,000 {GPUs. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24) (pp. 745-760).
[Google Scholar] - Narayanan, D., Shoeybi, M., Casper, J., LeGresley, P., Patwary, M., Korthikanti, V., ... & Zaharia, M. (2021, November). Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the international conference for high performance computing, networking, storage and analysis (pp. 1-15).
[CrossRef] [Google Scholar] - Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., ... & Lample, G. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
[Google Scholar] - Levy, S., Ferreira, K. B., DeBardeleben, N., Siddiqua, T., Sridharan, V., & Baseman, E. (2018, November). Lessons learned from memory errors observed over the lifetime of cielo. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 554-565). IEEE.
[CrossRef] [Google Scholar] - Qian, K., Xi, Y., Cao, J., Gao, J., Xu, Y., Guan, Y., ... & Cai, D. (2024, August). Alibaba hpn: A data center network for large language model training. In Proceedings of the ACM SIGCOMM 2024 Conference (pp. 691-706).
[CrossRef] [Google Scholar] - Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., ... & Catanzaro, B. (2022). Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990.
[Google Scholar] - Rasley, J., Rajbhandari, S., Ruwase, O., & He, Y. (2020, August). Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining (pp. 3505-3506).
[CrossRef] [Google Scholar] - Yi, X., Zhang, S., Luo, Z., Long, G., Diao, L., Wu, C., ... & Lin, W. (2020, November). Optimizing distributed training deployment in heterogeneous GPU clusters. In Proceedings of the 16th International Conference on emerging Networking EXperiments and Technologies (pp. 93-107).
[CrossRef] [Google Scholar] - Villalobos, P., Sevilla, J., Besiroglu, T., Heim, L., Ho, A., & Hobbhahn, M. (2022). Machine learning model sizes and the parameter gap. arXiv preprint arXiv:2207.02852.
[Google Scholar] - Xiong, Y., Jiang, Y., Yang, Z., Qu, L., Zhao, G., Liu, S., ... & Zhou, L. (2024). SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation. In 2024 USENIX Annual Technical Conference (USENIX ATC 24) (pp. 835-850).
[Google Scholar] - Kong, X., Zhu, Y., Zhou, H., Jiang, Z., Ye, J., Guo, C., & Zhuo, D. (2022). Collie: Finding performance anomalies in RDMA subsystems. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22) (pp. 287-305).
[Google Scholar] - Zhu, Y., Eran, H., Firestone, D., Guo, C., Lipshteyn, M., Liron, Y., ... & Zhang, M. (2015). Congestion control for large-scale RDMA deployments. ACM SIGCOMM Computer Communication Review, 45(4), 523-536.
[CrossRef] [Google Scholar] - Zhang, S., Zhao, Y., Xiong, X., Sun, Y., Nie, X., Zhang, J., ... & Pei, D. (2024, July). Illuminating the gray zone: Non-intrusive gray failure localization in server operating systems. In Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (pp. 126-137).
[CrossRef] [Google Scholar] - He, Y., Hutton, M., Chan, S., De Gruijl, R., Govindaraju, R., Patil, N., & Li, Y. (2023, June). Understanding and mitigating hardware failures in deep learning training systems. In Proceedings of the 50th Annual International Symposium on Computer Architecture (pp. 1-16).
[CrossRef] [Google Scholar] - Wang, Z., Jia, Z., Zheng, S., Zhang, Z., Fu, X., Ng, T. E., & Wang, Y. (2023, October). Gemini: Fast failure recovery in distributed training with in-memory checkpoints. In Proceedings of the 29th Symposium on Operating Systems Principles (pp. 364-381).
[CrossRef] [Google Scholar] - Wu, T., Wang, W., Yu, Y., Yang, S., Wu, W., Duan, Q., ... & Zhang, L. (2024). FALCON: Pinpointing and mitigating stragglers for large-scale hybrid-parallel training. arXiv preprint arXiv:2410.12588.
[Google Scholar] - Korthikanti, V. A., Casper, J., Lym, S., McAfee, L., Andersch, M., Shoeybi, M., & Catanzaro, B. (2023). Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5, 341-353.
[Google Scholar] - Maeng, K., Bharuka, S., Gao, I., Jeffrey, M., Saraph, V., Su, B. Y., ... & Wu, C. J. (2021). Understanding and improving failure tolerant training for deep learning recommendation with partial recovery. Proceedings of Machine Learning and Systems, 3, 637-651.
[Google Scholar] - Zhou, K., Hao, Y., Mellor-Crummey, J., Meng, X., & Liu, X. (2020). Gvprof: A value profiler for gpu-based clusters. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 1–16). IEEE.
[CrossRef] [Google Scholar] - Liu, K., Jiang, Z., Zhang, J., Wei, H., Zhong, X., Tan, L., ... & Huang, T. (2023). Hostping: Diagnosing intra-host network bottlenecks in RDMA servers. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23) (pp. 15-29).
[Google Scholar] - Zhong, Y., Sheng, G., Liu, J., Yuan, J., & Wu, C. (2023). Swift: Expedited failure recovery for large-scale dnn training. In Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (pp. 447–449).
[CrossRef] [Google Scholar] - Hu, Q., Ye, Z., Wang, Z., Wang, G., Zhang, M., Chen, Q., ... & Zhang, T. (2024). Characterization of large language model development in the datacenter. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24) (pp. 709-729).
[Google Scholar] - Taherin, A., Patel, T., Georgakoudis, G., Laguna, I., & Tiwari, D. (2021, June). Examining failures and repairs on supercomputers with multi-gpu compute nodes. In 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) (pp. 305-313). IEEE.
[CrossRef] [Google Scholar] - Jang, I., Yang, Z., Zhang, Z., Jin, X., & Chowdhury, M. (2023, October). Oobleck: Resilient distributed training of large models using pipeline templates. In Proceedings of the 29th Symposium on Operating Systems Principles (pp. 382-395).
[CrossRef] [Google Scholar] - Bautista-Gomez, L., Benoit, A., Di, S., Herault, T., Robert, Y., & Sun, H. (2024). A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?. Future Generation Computer Systems, 161, 315-328.
[CrossRef] [Google Scholar] - Young, J. W. (1974). A first order approximation to the optimum checkpoint interval. Communications of the ACM, 17(9), 530-531.
[CrossRef] [Google Scholar] - Zimmer, C., Maxwell, D., McNally, S., Atchley, S., & Vazhkudai, S. S. (2018, November). Gpu age-aware scheduling to improve the reliability of leadership jobs on titan. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 83-93). IEEE.
[CrossRef] [Google Scholar] - Agudelo-España, D., Gomez-Gonzalez, S., Bauer, S., Schölkopf, B., & Peters, J. (2020, August). Bayesian online prediction of change points. In Conference on uncertainty in artificial intelligence (pp. 320-329). PMLR.
[Google Scholar] - Burns, B., Grant, B., Oppenheimer, D., Brewer, E., & Wilkes, J. (2016). Borg, omega, and kubernetes. Communications of the ACM, 59(5), 50-57. https://dx.doi.org/10.1145/2890784
[Google Scholar] - Yu, P., & Chowdhury, M. (2020). Fine-grained GPU sharing primitives for deep learning applications. Proceedings of Machine Learning and Systems, 2, 98-111.
[Google Scholar] - Li, J., Xu, H., Zhu, Y., Liu, Z., Guo, C., & Wang, C. (2023, May). Lyra: Elastic scheduling for deep learning clusters. In Proceedings of the Eighteenth European Conference on Computer Systems (pp. 835-850).
[CrossRef] [Google Scholar] - Thorpe, J., Zhao, P., Eyolfson, J., Qiao, Y., Jia, Z., Zhang, M., ... & Xu, G. H. (2023). Bamboo: Making preemptible instances resilient for affordable training of large DNNs. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23) (pp. 497-513).
[Google Scholar] - Xiao, W., Ren, S., Li, Y., Zhang, Y., Hou, P., Li, Z., ... & Jia, Y. (2020). \{AntMan\: Dynamic scaling on \{GPU\ clusters for deep learning. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20) (pp. 533-548).
[Google Scholar] - Chen, Y., Xie, C., Ma, M., Gu, J., Peng, Y., Lin, H., ... & Zhu, Y. (2022). Sapipe: Staleness-aware pipeline for data parallel dnn training. Advances in neural information processing systems, 35, 17981-17993.
[Google Scholar] - Zhang, R., Xiao, W., Zhang, H., Liu, Y., Lin, H., & Yang, M. (2020, June). An empirical study on program failures of deep learning jobs. In Proceedings of the ACM/IEEE 42nd international conference on software engineering (pp. 1159-1170).
[CrossRef] [Google Scholar] - Li, Z., Zheng, L., Zhong, Y., Liu, V., Sheng, Y., Jin, X., ... & Stoica, I. (2023). AlpaServe: Statistical multiplexing with model parallelism for deep learning serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23) (pp. 663-679).
[Google Scholar] - Chen, C., Weng, Q., Wang, W., Li, B., & Li, B. (2020, October). Semi-dynamic load balancing: Efficient distributed learning in non-dedicated environments. In Proceedings of the 11th ACM Symposium on Cloud Computing (pp. 431-446).
[CrossRef] [Google Scholar] - You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., ... & Hsieh, C. J. (2019). Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962.
[Google Scholar] - Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., & Catanzaro, B. (2019). Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053.
[Google Scholar] - Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D., Chen, M., ... & Wu, Y. (2019). Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32.
[Google Scholar] - Narayanan, D., Harlap, A., Phanishayee, A., Seshadri, V., Devanur, N. R., Ganger, G. R., ... & Zaharia, M. (2019, October). PipeDream: Generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM symposium on operating systems principles (pp. 1-15).
[CrossRef] [Google Scholar] - Ren, J., Rajbhandari, S., Aminabadi, R. Y., Ruwase, O., Yang, S., Zhang, M., ... & He, Y. (2021). Zero-offload: Democratizing billion-scale model training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21) (pp. 551-564).
[Google Scholar] - Li, Z., Wallace, E., Shen, S., Lin, K., Keutzer, K., Klein, D., & Gonzalez, J. (2020, November). Train big, then compress: Rethinking model size for efficient training and inference of transformers. In International Conference on machine learning (pp. 5958-5968). PMLR.
[Google Scholar] - Zhang, C., Ma, L., Xue, J., Shi, Y., Miao, Z., Yang, F., ... & Yang, M. (2023). Cocktailer: Analyzing and optimizing dynamic control flow in deep learning. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23) (pp. 681-699).
[Google Scholar] - Tan, C., Jin, Z., Guo, C., Zhang, T., Wu, H., Deng, K., ... & Xiang, D. (2019). NetBouncer: Active device and link failure localization in data center networks. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19) (pp. 599-614).
[Google Scholar] - Kumar, G., Dukkipati, N., Jang, K., Wassel, H. M., Wu, X., Montazeri, B., ... & Vahdat, A. (2020, July). Swift: Delay is simple and effective for congestion control in the datacenter. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication (pp. 514-528).
[CrossRef] [Google Scholar] - Alistarh, D., Grubic, D., Li, J., Tomioka, R., & Vojnovic, M. (2017). QSGD: Communication-efficient SGD via gradient quantization and encoding. Advances in neural information processing systems, 30.
[Google Scholar] - Barham, P., Chowdhery, A., Dean, J., Ghemawat, S., Hand, S., Hurt, D., ... & Wu, Y. (2022). Pathways: Asynchronous distributed dataflow for ml. Proceedings of Machine Learning and Systems, 4, 430-449.
[Google Scholar]
Cite This Article
TY - JOUR AU - Mo, Yuchang AU - Wan, Jian AU - Peng, Hao AU - Fang, Ruiming AU - Fan, Yuan AU - Miao, Chunyu AU - Chynybaev, Mirlan AU - Gui, Faer AU - Zhang, Rengui AU - Zhai, Shuying AU - Wu, Wen AU - Zhu, Jifeng AU - Hu, Jianyong AU - Mu, Jinbin PY - 2026 DA - 2026/03/02 TI - Reliability Optimization for Large Language Model Training Infrastructure: Challenges, Advances, and Future Directions JO - ICCK Transactions on Systems Safety and Reliability T2 - ICCK Transactions on Systems Safety and Reliability JF - ICCK Transactions on Systems Safety and Reliability VL - 2 IS - 1 SP - 36 EP - 53 DO - 10.62762/TSSR.2025.806733 UR - https://www.icck.org/article/abs/TSSR.2025.806733 KW - reliability optimization KW - large language model KW - training infrastructure KW - fault detection KW - failure recovery KW - straggler mitigation AB - The ultra-large scale and prolonged runtime of Large Language Model (LLM) training—often involving thousands of GPUs and spanning weeks—render reliability a pivotal bottleneck. Hardware failures, stragglers, and runtime issues can waste over 30% of GPU resources, delaying model rollout and driving up costs. This survey focuses on reliability optimization for LLM training systems. The discussion centers on three pillars of reliability: fault detection, fault recovery, and straggler mitigation. For each pillar, we dissect innovative mechanisms, which range from communication-aware fault detection to adaptive load balancing, and we assess their impact on critical reliability metrics such as error-induced downtime reduction, Mean Time to Failure (MTTF), and slowdown mitigation rate. Additionally, we pinpoint open challenges, such as dynamic fault prediction for mixed workloads and cross-layer reliability coordination, and outline future directions to construct more resilient LLM training systems. SN - 3069-1087 PB - Institute of Central Computation and Knowledge LA - English ER -
@article{Mo2026Reliabilit,
author = {Yuchang Mo and Jian Wan and Hao Peng and Ruiming Fang and Yuan Fan and Chunyu Miao and Mirlan Chynybaev and Faer Gui and Rengui Zhang and Shuying Zhai and Wen Wu and Jifeng Zhu and Jianyong Hu and Jinbin Mu},
title = {Reliability Optimization for Large Language Model Training Infrastructure: Challenges, Advances, and Future Directions},
journal = {ICCK Transactions on Systems Safety and Reliability},
year = {2026},
volume = {2},
number = {1},
pages = {36-53},
doi = {10.62762/TSSR.2025.806733},
url = {https://www.icck.org/article/abs/TSSR.2025.806733},
abstract = {The ultra-large scale and prolonged runtime of Large Language Model (LLM) training—often involving thousands of GPUs and spanning weeks—render reliability a pivotal bottleneck. Hardware failures, stragglers, and runtime issues can waste over 30\% of GPU resources, delaying model rollout and driving up costs. This survey focuses on reliability optimization for LLM training systems. The discussion centers on three pillars of reliability: fault detection, fault recovery, and straggler mitigation. For each pillar, we dissect innovative mechanisms, which range from communication-aware fault detection to adaptive load balancing, and we assess their impact on critical reliability metrics such as error-induced downtime reduction, Mean Time to Failure (MTTF), and slowdown mitigation rate. Additionally, we pinpoint open challenges, such as dynamic fault prediction for mixed workloads and cross-layer reliability coordination, and outline future directions to construct more resilient LLM training systems.},
keywords = {reliability optimization, large language model, training infrastructure, fault detection, failure recovery, straggler mitigation},
issn = {3069-1087},
publisher = {Institute of Central Computation and Knowledge}
}
Article Metrics
Publisher's Note
ICCK stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and Permissions
Portico