Abstract
Large-scale computing systems, such as cloud data centers, grid infrastructures, and high-performance computing clusters, are the backbone of modern information technology ecosystems. These systems typically consist of numerous heterogeneous, multi-state computing nodes that exhibit varying performance levels due to component failures, degradation, or dynamic resource allocation. Performability analysis, which integrates both system reliability and performance evaluations to quantify the probability of the system operating at a specified performance level, is critical for ensuring the efficient, reliable, and cost-effective operation of these complex systems. This paper presents a comprehensive review of recent advancements in performability analysis for large-scale multi-state computing systems over the past decade. It classifies existing research into three core methodological categories: binary decision diagram (BDD)-based approaches, multi-valued decision diagram (MDD)-based approaches, and comparative benchmarking with traditional methods (e.g., continuous-time Markov chains (CTMC), universal generating function (UGF)). For each category, the paper details key methodologies, algorithmic innovations, and practical applications. Additionally, the promising future directions are proposed to address emerging challenges, such as handling dynamic system behaviors, integrating real-time data, and optimizing resource allocation for performability. This review provides a valuable reference for researchers, system designers, and operators seeking to enhance the performability of large-scale computing systems and mitigate risks associated with service level agreement (SLA) violations.
Keywords
performability analysis
large-scale computing systems
multi-state systems
binary decision diagrams (BDD)
multi-valued decision diagrams (MDD)
reliability
system performance
Data Availability Statement
Not applicable.
Funding
This work was supported without any funding.
Conflicts of Interest
Yuan Fan and Chunyu Miao are employees of Hangzhou Anheng Information Technology Co., Ltd., Hangzhou 310051, China; Faer Gui is an employee of Zhejiang Keepsoft Information Technology Corp.,Ltd., Hangzhou 310051, China; Rengui Zhang is an employee of Zhejiang YuGong Information Technology Co., Ltd., Hangzhou 310002, China; Jianyong Hu is an employee of Engineering Research Center of Digital Twin Basin of Zhejiang Province, Hangzhou 310018, China; Jinbin Mu is an employee of Zhejiang Institute of Hydraulics and Estuary, Hangzhou 310020, China.
Ethical Approval and Consent to Participate
Not applicable.
Cite This Article
APA Style
Mo, Y., Fan, Y., Miao, C., Chynybaev, M., Gui, F., Zhang, R., Hu, J., Mu, J. & Chymyrov, A. (2025). Performability Analysis for Large-Scale Multi-State Computing Systems: Methodologies, Advances, and Future Directions. ICCK Transactions on Systems Safety and Reliability, 1(2), 81–97. https://doi.org/10.62762/TSSR.2025.527003
Publisher's Note
ICCK stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and Permissions
Institute of Central Computation and Knowledge (ICCK) or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.