Academic Profile

Academic Profile

No academic profile information available at the moment.

Editorial Roles

No Editorial Roles

This user currently does not serve as an editor for any ICCK journals.

ICCK Publications

Total Publications: 2
Free Access | Review Article | 02 March 2026
Reliability Optimization for Large Language Model Training Infrastructure: Challenges, Advances, and Future Directions
ICCK Transactions on Systems Safety and Reliability | Volume 2, Issue 1: 36-53, 2026 | DOI: 10.62762/TSSR.2025.806733
Abstract
The ultra-large scale and prolonged runtime of Large Language Model (LLM) training—often involving thousands of GPUs and spanning weeks—render reliability a pivotal bottleneck. Hardware failures, stragglers, and runtime issues can waste over 30% of GPU resources, delaying model rollout and driving up costs. This survey focuses on reliability optimization for LLM training systems. The discussion centers on three pillars of reliability: fault detection, fault recovery, and straggler mitigation. For each pillar, we dissect innovative mechanisms, which range from communication-aware fault detection to adaptive load balancing, and we assess their impact on critical reliability metrics such as... More >
Free Access | Review Article | 31 October 2025 | Cited: 1 , Scopus 1
Performability Analysis for Large-Scale Multi-State Computing Systems: Methodologies, Advances, and Future Directions
ICCK Transactions on Systems Safety and Reliability | Volume 1, Issue 2: 81-97, 2025 | DOI: 10.62762/TSSR.2025.527003
Abstract
Large-scale computing systems, such as cloud data centers, grid infrastructures, and high-performance computing clusters, are the backbone of modern information technology ecosystems. These systems typically consist of numerous heterogeneous, multi-state computing nodes that exhibit varying performance levels due to component failures, degradation, or dynamic resource allocation. Performability analysis, which integrates both system reliability and performance evaluations to quantify the probability of the system operating at a specified performance level, is critical for ensuring the efficient, reliable, and cost-effective operation of these complex systems. This paper presents a comprehens... More >

Graphical Abstract
Performability Analysis for Large-Scale Multi-State Computing Systems: Methodologies, Advances, and Future Directions