Performability Analysis for Large-Scale Multi-State Computing Systems: Methodologies, Advances, and Future Directions

Yuchang Mo; Yuan Fan; Chunyu Miao; Mirlan Chynybaev; Faer Gui; Rengui Zhang; Jianyong Hu; Jinbin Mu; Akylbek Chymyrov

doi:10.62762/TSSR.2025.527003

Article Information

Published in ICCK Transactions on Systems Safety and Reliability

Volume/Issue Volume 1, Issue 2, 2025

Pages 81-97

Cited by 2 (Crossref) 1 (Scopus)

Abstract

Large-scale computing systems, such as cloud data centers, grid infrastructures, and high-performance computing clusters, are the backbone of modern information technology ecosystems. These systems typically consist of numerous heterogeneous, multi-state computing nodes that exhibit varying performance levels due to component failures, degradation, or dynamic resource allocation. Performability analysis, which integrates both system reliability and performance evaluations to quantify the probability of the system operating at a specified performance level, is critical for ensuring the efficient, reliable, and cost-effective operation of these complex systems. This paper presents a comprehensive review of recent advancements in performability analysis for large-scale multi-state computing systems over the past decade. It classifies existing research into three core methodological categories: binary decision diagram (BDD)-based approaches, multi-valued decision diagram (MDD)-based approaches, and comparative benchmarking with traditional methods (e.g., continuous-time Markov chains (CTMC), universal generating function (UGF)). For each category, the paper details key methodologies, algorithmic innovations, and practical applications. Additionally, the promising future directions are proposed to address emerging challenges, such as handling dynamic system behaviors, integrating real-time data, and optimizing resource allocation for performability. This review provides a valuable reference for researchers, system designers, and operators seeking to enhance the performability of large-scale computing systems and mitigate risks associated with service level agreement (SLA) violations.

Graphical Abstract

Performability Analysis for Large-Scale Multi-State Computing Systems: Methodologies, Advances, and Future Directions

Keywords

performability analysis large-scale computing systems multi-state systems binary decision diagrams (BDD) multi-valued decision diagrams (MDD) reliability system performance

Data Availability Statement

Not applicable.

Funding

This work was supported without any funding.

Conflicts of Interest

Yuan Fan and Chunyu Miao are employees of Hangzhou Anheng Information Technology Co., Ltd., Hangzhou 310051, China; Faer Gui is an employee of Zhejiang Keepsoft Information Technology Corp.,Ltd., Hangzhou 310051, China; Rengui Zhang is an employee of Zhejiang YuGong Information Technology Co., Ltd., Hangzhou 310002, China; Jianyong Hu is an employee of Engineering Research Center of Digital Twin Basin of Zhejiang Province, Hangzhou 310018, China; Jinbin Mu is an employee of Zhejiang Institute of Hydraulics and Estuary, Hangzhou 310020, China.

Ethical Approval and Consent to Participate

Not applicable.

References

Hayes, B. (2008). Cloud computing. http://doi.acm.org/10.1145/1364782.1364786
[Google Scholar]
A Vouk, M. (2008). Cloud computing–issues, research and implementations. Journal of computing and information technology, 16(4), 235-246.
[CrossRef] [Google Scholar]
Kurmann, C., Rauch, F., & Stricker, T. M. (2003, April). Cost/performance tradeoffs in network interconnects for clusters of commodity PCs. In Proceedings International Parallel and Distributed Processing Symposium (pp. 10-pp). IEEE.
[CrossRef] [Google Scholar]
Hu, S., Chen, K., Wu, H., Bai, W., Lan, C., Wang, H., ... & Guo, C. (2015). Explicit path control in commodity data centers: Design and applications. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15) (pp. 15-28).
[CrossRef] [Google Scholar]
Qouneh, A., Liu, M., & Li, T. (2015, September). Optimization of resource allocation and energy efficiency in heterogeneous cloud data centers. In 2015 44th International Conference on Parallel Processing (pp. 1-10). IEEE.
[CrossRef] [Google Scholar]
Lisnianski, A., & Levitin, G. (2003). Multi-state system reliability: assessment, optimization and applications. World scientific.
[Google Scholar]
Xing, L. (2007, May). Efficient analysis of systems with multiple states. In 21st International Conference on Advanced Information Networking and Applications (AINA'07) (pp. 666-672). IEEE.
[CrossRef] [Google Scholar]
Amari, S. V., Xing, L., Shrestha, A., Akers, J., & Trivedi, K. S. (2010). Performability analysis of multistate computing systems using multivalued decision diagrams. IEEE Transactions on Computers, 59(10), 1419-1433.
[CrossRef] [Google Scholar]
Jiang, T., & Liu, Y. (2017). Parameter inference for non-repairable multi-state system reliability models by multi-level observation sequences. Reliability Engineering & System Safety, 166, 3-15.
[CrossRef] [Google Scholar]
Harish, P., & Narayanan, P. J. (2007, December). Accelerating large graph algorithms on the GPU using CUDA. In International conference on high-performance computing (pp. 197-208). Berlin, Heidelberg: Springer Berlin Heidelberg.
[CrossRef] [Google Scholar]
Pinheiro, E., Weber, W. D., & Barroso, L. A. (2007, February). Failure Trends in a Large Disk Drive Population. In Fast (Vol. 7, No. 1, pp. 17-23).
[Google Scholar]
Gill, P., Jain, N., & Nagappan, N. (2011, August). Understanding network failures in data centers: measurement, analysis, and implications. In Proceedings of the ACM SIGCOMM 2011 Conference (pp. 350-361).
[CrossRef] [Google Scholar]
Smith, R. M., Trivedi, K. S., & Ramesh, A. V. (2002). Performability analysis: measures, an algorithm, and a case study. IEEE Transactions on Computers, 37(4), 406-417.
[CrossRef] [Google Scholar]
Clemente, R., Bartoli, M., Bossi, M. C., D'Orazio, G., & Cosmo, G. (2005, October). Risk management in availability SLA. In DRCN 2005). Proceedings. 5th International Workshop on Design of Reliable Communication Networks, 2005. (pp. 8-pp). IEEE.
[CrossRef] [Google Scholar]
Snow, A. P., & Weckman, G. R. (2007, April). What are the chances an availability SLA will be violated?. In Sixth International Conference on Networking (ICN'07) (pp. 35-35). IEEE.
[CrossRef] [Google Scholar]
Shen, Z., Lee, P. P., Shu, J., & Guo, W. (2017). Cross-rack-aware single failure recovery for clustered file systems. IEEE Transactions on Dependable and Secure Computing, 17(2), 248-261.
[CrossRef] [Google Scholar]
Chen, P., Qi, Y., Li, X., Hou, D., & Lyu, M. R. T. (2016). ARF-predictor: Effective prediction of aging-related failure using entropy. IEEE Transactions on Dependable and Secure Computing, 15(4), 675-693.
[CrossRef] [Google Scholar]
El-Sayed, N., & Schroeder, B. (2016). Understanding practical tradeoffs in HPC checkpoint-scheduling policies. IEEE Transactions on Dependable and Secure Computing, 15(2), 336-350.
[CrossRef] [Google Scholar]
Liu, Y., & Chen, C. J. (2017). Dynamic reliability assessment for nonrepairable multistate systems by aggregating multilevel imperfect inspection data. IEEE Transactions on Reliability, 66(2), 281-297.
[CrossRef] [Google Scholar]
Murchland, J. D. (1975). Fundamental concepts and relations for reliability analysis of multi-state systems. In Reliability and fault tree analysis.
[Google Scholar]
Reibman, A., & Trivedi, K. (1988). Numerical transient analysis of Markov models. Computers & Operations Research, 15(1), 19-36.
[CrossRef] [Google Scholar]
Trivedi, K. S. (2001). Probability and statistics with reliability, queuing, and computer science applications. John Wiley & Sons.
[Google Scholar]
Kulkarni, V. G. (1995). Modeling and analysis of stochastic systems.
[CrossRef] [Google Scholar]
Entezari-Maleki, R., Trivedi, K. S., & Movaghar, A. (2014). Performability evaluation of grid environments using stochastic reward nets. IEEE Transactions on Dependable and Secure Computing, 12(2), 204-216.
[CrossRef] [Google Scholar]
Zang, X., Wang, D., Sun, H., & Trivedi, K. S. (2003). A BDD-based algorithm for analysis of multistate systems with multistate components. IEEE Transactions on computers, 52(12), 1608-1618.
[CrossRef] [Google Scholar]
Chang, Y. R., Amari, S. V., & Kuo, S. Y. (2005). OBDD-based evaluation of reliability and importance measures for multistate systems subject to imperfect fault coverage. IEEE Transactions on Dependable and Secure Computing, 2(4), 336-347.
[CrossRef] [Google Scholar]
Shrestha, A., & Xing, L. (2008). A logarithmic binary decision diagram-based method for multistate system analysis. IEEE Transactions on Reliability, 57(4), 595-606.
[CrossRef] [Google Scholar]
Ushakov, I. A. (1986). A universal generating function. Soviet J Comput Syst Sci, 24(5), 37.
[Google Scholar]
Levitin, G., Xing, L., & Dai, Y. (2016). Optimizing dynamic performance of multistate systems with heterogeneous 1-out-of-N warm standby components. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 48(6), 920-929.
[CrossRef] [Google Scholar]
Levitin, G., & Xing, L. (2017). Dynamic performance of series parallel multi-state systems with standby subsystems or repairable binary elements. In Recent Advances in Multi-state Systems Reliability: Theory and Applications (pp. 159-178). Cham: Springer International Publishing.
[CrossRef] [Google Scholar]
Pock, M., Malass'e, O., & Walter, M. (2011). Combining different binary decision diagram techniques for solving models with multiple failure states. Proceedings of the Institution of Mechanical Engineers, Part O: Journal of Risk and Reliability, 225(1), 18-27.
[CrossRef] [Google Scholar]
Xing, L., & Dai, Y. S. (2008). A new decision-diagram-based method for efficient analysis on multistate systems. IEEE Transactions on Dependable and Secure Computing, 6(3), 161-174.
[CrossRef] [Google Scholar]
Ren, Y., Zeng, C., Fan, D., Liu, L., & Feng, Q. (2018). Multi-state reliability assessment method based on the MDD-GO model. IEEE Access, 6, 5151-5161.
[CrossRef] [Google Scholar]
Ryu, S. M., & Park, D. J. (2005, December). Checkpointing for the reliability of real-time systems with on-line fault detection. In International Conference on Embedded and Ubiquitous Computing (pp. 194-202). Berlin, Heidelberg: Springer Berlin Heidelberg.
[CrossRef] [Google Scholar]
Valdez, L. D., Shekhtman, L., La Rocca, C. E., Zhang, X., Buldyrev, S. V., Trunfio, P. A., ... & Havlin, S. (2020). Cascading failures in complex networks. Journal of Complex Networks, 8(2), cnaa013.
[CrossRef] [Google Scholar]
Morshedlou, H., & Meybodi, M. R. (2014). Decreasing impact of sla violations: a proactive resource allocation approachfor cloud computing environments. IEEE Transactions on Cloud Computing, 2(2), 156-167.
[CrossRef] [Google Scholar]
Mo, Y., Xing, L., & Dugan, J. B. (2015). Performability analysis of k-to-l-out-of-n computing systems using binary decision diagrams. IEEE Transactions on Dependable and Secure Computing, 15(1), 126-137.
[CrossRef] [Google Scholar]
Mo, Y., Xing, L., Zhong, F., Pan, Z., & Chen, Z. (2014). Choosing a heuristic and root node for edge ordering in BDD-based network reliability analysis. Reliability Engineering & System Safety, 131, 83-93.
[CrossRef] [Google Scholar]
Xing, L., & Amari, S. V. (2015). Binary decision diagrams and extensions for system reliability analysis. John Wiley & Sons.
[CrossRef] [Google Scholar]
Xing, L., & Dugan, J. B. (2002, June). Dependability analysis using multiple-valued decision diagrams. In Proc. of 6th International Conference on Probabilistic Safety Assessment and Management.
[Google Scholar]
Shrestha, A., Xing, L., & Dai, Y. (2009). Decision diagram based methods and complexity analysis for multi-state systems. IEEE Transactions on Reliability, 59(1), 145-161.
[CrossRef] [Google Scholar]
Ammar, M., Hamad, G. B., Ait Mohamed, O., & Savaria, Y. (2017). System-level analysis of the vulnerability of processors exposed to single-event upsets via probabilistic model checking. IEEE Transactions on Nuclear Science, 64(9), 2523-2530.
[CrossRef] [Google Scholar]
Antonelli, F., Cortellessa, V., Gribaudo, M., Pinciroli, R., Trivedi, K. S., & Trubiani, C. (2020). Analytical modeling of performance indices under epistemic uncertainty applied to cloud computing systems. Future Generation Computer Systems, 102, 746-761.
[CrossRef] [Google Scholar]
Vasireddy, R., & Trivedi, K. S. (2006). Defining Steady-State Service Level Agreeability using Semi-Markov Process. DSN 2006, 172.
[Google Scholar]
Li, K., Tang, X., Veeravalli, B., & Li, K. (2013). Scheduling precedence constrained stochastic tasks on heterogeneous cluster systems. IEEE Transactions on computers, 64(1), 191-204.
[CrossRef] [Google Scholar]
Xu, Y., Li, K., He, L., Zhang, L., & Li, K. (2014). A hybrid chemical reaction optimization scheme for task scheduling on heterogeneous computing systems. IEEE Transactions on parallel and distributed systems, 26(12), 3208-3222.
[CrossRef] [Google Scholar]
Mo, Y., & Xing, L. (2013). An enhanced decision diagram-based method for common-cause failure analysis. Proceedings of the Institution of Mechanical Engineers, Part O: Journal of Risk and Reliability, 227(5), 557-566.
[CrossRef] [Google Scholar]
Peng, R., Zhai, Q., Xing, L., & Yang, J. (2014). Reliability of demand-based phased-mission systems subject to fault level coverage. Reliability Engineering & System Safety, 121, 18-25.
[CrossRef] [Google Scholar]
Xing, L. (2007). An efficient binary-decision-diagram-based approach for network reliability and sensitivity analysis. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 38(1), 105-115.
[CrossRef] [Google Scholar]
Xia, R., Yin, X., Lopez, J. A., Machida, F., & Trivedi, K. S. (2013). Performance and availability modeling of ITSystems with data backup and restore. IEEE Transactions on Dependable and Secure Computing, 11(4), 375-389.
[CrossRef] [Google Scholar]
Gonzalez, A. J., & Helvik, B. E. (2013, August). Hybrid cloud management to comply efficiently with SLA availability guarantees. In 2013 IEEE 12th International Symposium on Network Computing and Applications (pp. 127-134). IEEE.
[CrossRef] [Google Scholar]
Gonzalez, A. J., & Helvik, B. E. (2012, December). System management to comply with SLA availability guarantees in cloud computing. In 4th IEEE International Conference on Cloud Computing Technology and Science Proceedings (pp. 325-332). IEEE.
[CrossRef] [Google Scholar]
Zhai, Q., Xing, L., Peng, R., & Yang, J. (2015). Multi-Valued Decision Diagram-Based Reliability Analysis of $ k $-out-of-$ n $ Cold Standby Systems Subject to Scheduled Backups. IEEE Transactions on Reliability, 64(4), 1310-1324.
[Google Scholar]
Entezari-Maleki, R., Trivedi, K. S., & Movaghar, A. (2014). Performability evaluation of grid environments using stochastic reward nets. IEEE Transactions on Dependable and Secure Computing, 12(2), 204-216.
[CrossRef] [Google Scholar]

Cited By (2)

Peng Su, Xu Yang, Rui Peng, Linmin Hu, Ting Li. Reliability analysis of multi-state aggregation power grid systems under the Vehicle-to-Grid mode. Reliability Engineering & System Safety, 2026 , 272 .
[CrossRef]
Liudong Gu, Guanjun Wang, Yifan Zhou. Reliability analysis of a complex series-parallel performance sharing system with performance excess failure and storage units. Reliability Engineering & System Safety, 2026 , 269 .
[CrossRef]

* Citation data provided by Crossref Cited-by.

Cite This Article

APA Style

Mo, Y., Fan, Y., Miao, C., Chynybaev, M., Gui, F., Zhang, R., Hu, J., Mu, J. & Chymyrov, A. (2025). Performability Analysis for Large-Scale Multi-State Computing Systems: Methodologies, Advances, and Future Directions. ICCK Transactions on Systems Safety and Reliability, 1(2), 81–97. https://doi.org/10.62762/TSSR.2025.527003

Export Citation

RIS Format

Compatible with EndNote, Zotero, Mendeley, and other reference managers

TY  - JOUR
AU  - Mo, Yuchang
AU  - Fan, Yuan
AU  - Miao, Chunyu
AU  - Chynybaev, Mirlan
AU  - Gui, Faer
AU  - Zhang, Rengui
AU  - Hu, Jianyong
AU  - Mu, Jinbin
AU  - Chymyrov, Akylbek
PY  - 2025
DA  - 2025/10/31
TI  - Performability Analysis for Large-Scale Multi-State Computing Systems: Methodologies, Advances, and Future Directions
JO  - ICCK Transactions on Systems Safety and Reliability
T2  - ICCK Transactions on Systems Safety and Reliability
JF  - ICCK Transactions on Systems Safety and Reliability
VL  - 1
IS  - 2
SP  - 81
EP  - 97
DO  - 10.62762/TSSR.2025.527003
UR  - https://www.icck.org/article/abs/TSSR.2025.527003
KW  - performability analysis
KW  - large-scale computing systems
KW  - multi-state systems
KW  - binary decision diagrams (BDD)
KW  - multi-valued decision diagrams (MDD)
KW  - reliability
KW  - system performance
AB  - Large-scale computing systems, such as cloud data centers, grid infrastructures, and high-performance computing clusters, are the backbone of modern information technology ecosystems. These systems typically consist of numerous heterogeneous, multi-state computing nodes that exhibit varying performance levels due to component failures, degradation, or dynamic resource allocation. Performability analysis, which integrates both system reliability and performance evaluations to quantify the probability of the system operating at a specified performance level, is critical for ensuring the efficient, reliable, and cost-effective operation of these complex systems. This paper presents a comprehensive review of recent advancements in performability analysis for large-scale multi-state computing systems over the past decade. It classifies existing research into three core methodological categories: binary decision diagram (BDD)-based approaches, multi-valued decision diagram (MDD)-based approaches, and comparative benchmarking with traditional methods (e.g., continuous-time Markov chains (CTMC), universal generating function (UGF)). For each category, the paper details key methodologies, algorithmic innovations, and practical applications. Additionally, the promising future directions are proposed to address emerging challenges, such as handling dynamic system behaviors, integrating real-time data, and optimizing resource allocation for performability. This review provides a valuable reference for researchers, system designers, and operators seeking to enhance the performability of large-scale computing systems and mitigate risks associated with service level agreement (SLA) violations.
SN  - 3069-1087
PB  - Institute of Central Computation and Knowledge
LA  - English
ER  -

BibTeX Format

Compatible with LaTeX, BibTeX, and other reference managers

@article{Mo2025Performabi,
  author = {Yuchang Mo and Yuan Fan and Chunyu Miao and Mirlan Chynybaev and Faer Gui and Rengui Zhang and Jianyong Hu and Jinbin Mu and Akylbek Chymyrov},
  title = {Performability Analysis for Large-Scale Multi-State Computing Systems: Methodologies, Advances, and Future Directions},
  journal = {ICCK Transactions on Systems Safety and Reliability},
  year = {2025},
  volume = {1},
  number = {2},
  pages = {81-97},
  doi = {10.62762/TSSR.2025.527003},
  url = {https://www.icck.org/article/abs/TSSR.2025.527003},
  abstract = {Large-scale computing systems, such as cloud data centers, grid infrastructures, and high-performance computing clusters, are the backbone of modern information technology ecosystems. These systems typically consist of numerous heterogeneous, multi-state computing nodes that exhibit varying performance levels due to component failures, degradation, or dynamic resource allocation. Performability analysis, which integrates both system reliability and performance evaluations to quantify the probability of the system operating at a specified performance level, is critical for ensuring the efficient, reliable, and cost-effective operation of these complex systems. This paper presents a comprehensive review of recent advancements in performability analysis for large-scale multi-state computing systems over the past decade. It classifies existing research into three core methodological categories: binary decision diagram (BDD)-based approaches, multi-valued decision diagram (MDD)-based approaches, and comparative benchmarking with traditional methods (e.g., continuous-time Markov chains (CTMC), universal generating function (UGF)). For each category, the paper details key methodologies, algorithmic innovations, and practical applications. Additionally, the promising future directions are proposed to address emerging challenges, such as handling dynamic system behaviors, integrating real-time data, and optimizing resource allocation for performability. This review provides a valuable reference for researchers, system designers, and operators seeking to enhance the performability of large-scale computing systems and mitigate risks associated with service level agreement (SLA) violations.},
  keywords = {performability analysis, large-scale computing systems, multi-state systems, binary decision diagrams (BDD), multi-valued decision diagrams (MDD), reliability, system performance},
  issn = {3069-1087},
  publisher = {Institute of Central Computation and Knowledge}
}

Article Metrics

Citations

Crossref

2

Scopus

1

Views

990

PDF Downloads

318

Publisher's Note

ICCK stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and Permissions

Institute of Central Computation and Knowledge (ICCK) or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

ICCK Transactions on Systems Safety and Reliability

ISSN: 3069-1087 (Online)

[email protected]

Preserved at
Portico

User

Unlimited Downloads

Complete Library Access

Membership Eligibility

Community Leadership Opportunities