Academic Profile

Academic Profile

No academic profile information available at the moment.

Editorial Roles

No Editorial Roles

This user currently does not serve as an editor for any ICCK journals.

ICCK Publications

Total Publications: 1
Free Access | Review Article | 02 March 2026
Reliability Optimization for Large Language Model Training Infrastructure: Challenges, Advances, and Future Directions
ICCK Transactions on Systems Safety and Reliability | Volume 2, Issue 1: 36-53, 2026 | DOI: 10.62762/TSSR.2025.806733
Abstract
The ultra-large scale and prolonged runtime of Large Language Model (LLM) training—often involving thousands of GPUs and spanning weeks—render reliability a pivotal bottleneck. Hardware failures, stragglers, and runtime issues can waste over 30% of GPU resources, delaying model rollout and driving up costs. This survey focuses on reliability optimization for LLM training systems. The discussion centers on three pillars of reliability: fault detection, fault recovery, and straggler mitigation. For each pillar, we dissect innovative mechanisms, which range from communication-aware fault detection to adaptive load balancing, and we assess their impact on critical reliability metrics such as... More >