-
CiteScore
-
Impact Factor
Volume 1, Issue 2, ICCK Transactions on Systems Safety and Reliability
Volume 1, Issue 2, 2025
Submit Manuscript Edit a Special Issue
Article QR Code
Article QR Code
Scan the QR code for reading
Popular articles
ICCK Transactions on Systems Safety and Reliability, Volume 1, Issue 2, 2025: 81-97

Free to Read | Review Article | 31 October 2025
Performability Analysis for Large-Scale Multi-State Computing Systems: Methodologies, Advances, and Future Directions
1 School of Computer Science and Technology, Zhejiang University of Water Resources and Electric Power, Hangzhou 310018, China
2 Hangzhou Anheng Information Technology Co., Ltd., Hangzhou 310051, China
3 Razzakov Kyrgyz State Technical University, Bishkek 720044, Kyrgyzstan
4 Zhejiang Keepsoft Information Technology Corp.,Ltd., Hangzhou 310051, China
5 Zhejiang YuGong Information Technology Co., Ltd., Hangzhou 310002, China
6 Engineering Research Center of Digital Twin Basin of Zhejiang Province, Hangzhou 310018, China
7 Zhejiang Institute of Hydraulics and Estuary, Hangzhou 310020, China
* Corresponding Author: Yuchang Mo, [email protected]
Received: 09 September 2025, Accepted: 26 September 2025, Published: 31 October 2025  
Abstract
Large-scale computing systems, such as cloud data centers, grid infrastructures, and high-performance computing clusters, are the backbone of modern information technology ecosystems. These systems typically consist of numerous heterogeneous, multi-state computing nodes that exhibit varying performance levels due to component failures, degradation, or dynamic resource allocation. Performability analysis, which integrates both system reliability and performance evaluations to quantify the probability of the system operating at a specified performance level, is critical for ensuring the efficient, reliable, and cost-effective operation of these complex systems. This paper presents a comprehensive review of recent advancements in performability analysis for large-scale multi-state computing systems over the past decade. It classifies existing research into three core methodological categories: binary decision diagram (BDD)-based approaches, multi-valued decision diagram (MDD)-based approaches, and comparative benchmarking with traditional methods (e.g., continuous-time Markov chains (CTMC), universal generating function (UGF)). For each category, the paper details key methodologies, algorithmic innovations, and practical applications. Additionally, the promising future directions are proposed to address emerging challenges, such as handling dynamic system behaviors, integrating real-time data, and optimizing resource allocation for performability. This review provides a valuable reference for researchers, system designers, and operators seeking to enhance the performability of large-scale computing systems and mitigate risks associated with service level agreement (SLA) violations.

Graphical Abstract
Performability Analysis for Large-Scale Multi-State Computing Systems: Methodologies, Advances, and Future Directions

Keywords
performability analysis
large-scale computing systems
multi-state systems
binary decision diagrams (BDD)
multi-valued decision diagrams (MDD)
reliability
system performance

Data Availability Statement
Not applicable.

Funding
This work was supported without any funding.

Conflicts of Interest
Yuan Fan and Chunyu Miao are employees of Hangzhou Anheng Information Technology Co., Ltd., Hangzhou 310051, China; Faer Gui is an employee of Zhejiang Keepsoft Information Technology Corp.,Ltd., Hangzhou 310051, China; Rengui Zhang is an employee of Zhejiang YuGong Information Technology Co., Ltd., Hangzhou 310002, China; Jianyong Hu is an employee of Engineering Research Center of Digital Twin Basin of Zhejiang Province, Hangzhou 310018, China; Jinbin Mu is an employee of Zhejiang Institute of Hydraulics and Estuary, Hangzhou 310020, China.

Ethical Approval and Consent to Participate
Not applicable.

References
  1. Hayes, B. (2008). Cloud computing. http://doi.acm.org/10.1145/1364782.1364786
    [Google Scholar]
  2. A Vouk, M. (2008). Cloud computing–issues, research and implementations. Journal of computing and information technology, 16(4), 235-246.
    [CrossRef]   [Google Scholar]
  3. Kurmann, C., Rauch, F., & Stricker, T. M. (2003, April). Cost/performance tradeoffs in network interconnects for clusters of commodity PCs. In Proceedings International Parallel and Distributed Processing Symposium (pp. 10-pp). IEEE.
    [CrossRef]   [Google Scholar]
  4. Hu, S., Chen, K., Wu, H., Bai, W., Lan, C., Wang, H., ... & Guo, C. (2015). Explicit path control in commodity data centers: Design and applications. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15) (pp. 15-28).
    [CrossRef]   [Google Scholar]
  5. Qouneh, A., Liu, M., & Li, T. (2015, September). Optimization of resource allocation and energy efficiency in heterogeneous cloud data centers. In 2015 44th International Conference on Parallel Processing (pp. 1-10). IEEE.
    [CrossRef]   [Google Scholar]
  6. Lisnianski, A., & Levitin, G. (2003). Multi-state system reliability: assessment, optimization and applications. World scientific.
    [Google Scholar]
  7. Xing, L. (2007, May). Efficient analysis of systems with multiple states. In 21st International Conference on Advanced Information Networking and Applications (AINA'07) (pp. 666-672). IEEE.
    [CrossRef]   [Google Scholar]
  8. Amari, S. V., Xing, L., Shrestha, A., Akers, J., & Trivedi, K. S. (2010). Performability analysis of multistate computing systems using multivalued decision diagrams. IEEE Transactions on Computers, 59(10), 1419-1433.
    [CrossRef]   [Google Scholar]
  9. Jiang, T., & Liu, Y. (2017). Parameter inference for non-repairable multi-state system reliability models by multi-level observation sequences. Reliability Engineering & System Safety, 166, 3-15.
    [CrossRef]   [Google Scholar]
  10. Harish, P., & Narayanan, P. J. (2007, December). Accelerating large graph algorithms on the GPU using CUDA. In International conference on high-performance computing (pp. 197-208). Berlin, Heidelberg: Springer Berlin Heidelberg.
    [CrossRef]   [Google Scholar]
  11. Pinheiro, E., Weber, W. D., & Barroso, L. A. (2007, February). Failure Trends in a Large Disk Drive Population. In Fast (Vol. 7, No. 1, pp. 17-23).
    [Google Scholar]
  12. Gill, P., Jain, N., & Nagappan, N. (2011, August). Understanding network failures in data centers: measurement, analysis, and implications. In Proceedings of the ACM SIGCOMM 2011 Conference (pp. 350-361).
    [CrossRef]   [Google Scholar]
  13. Smith, R. M., Trivedi, K. S., & Ramesh, A. V. (2002). Performability analysis: measures, an algorithm, and a case study. IEEE Transactions on Computers, 37(4), 406-417.
    [CrossRef]   [Google Scholar]
  14. Clemente, R., Bartoli, M., Bossi, M. C., D'Orazio, G., & Cosmo, G. (2005, October). Risk management in availability SLA. In DRCN 2005). Proceedings. 5th International Workshop on Design of Reliable Communication Networks, 2005. (pp. 8-pp). IEEE.
    [CrossRef]   [Google Scholar]
  15. Snow, A. P., & Weckman, G. R. (2007, April). What are the chances an availability SLA will be violated?. In Sixth International Conference on Networking (ICN'07) (pp. 35-35). IEEE.
    [CrossRef]   [Google Scholar]
  16. Shen, Z., Lee, P. P., Shu, J., & Guo, W. (2017). Cross-rack-aware single failure recovery for clustered file systems. IEEE Transactions on Dependable and Secure Computing, 17(2), 248-261.
    [CrossRef]   [Google Scholar]
  17. Chen, P., Qi, Y., Li, X., Hou, D., & Lyu, M. R. T. (2016). ARF-predictor: Effective prediction of aging-related failure using entropy. IEEE Transactions on Dependable and Secure Computing, 15(4), 675-693.
    [CrossRef]   [Google Scholar]
  18. El-Sayed, N., & Schroeder, B. (2016). Understanding practical tradeoffs in HPC checkpoint-scheduling policies. IEEE Transactions on Dependable and Secure Computing, 15(2), 336-350.
    [CrossRef]   [Google Scholar]
  19. Liu, Y., & Chen, C. J. (2017). Dynamic reliability assessment for nonrepairable multistate systems by aggregating multilevel imperfect inspection data. IEEE Transactions on Reliability, 66(2), 281-297.
    [CrossRef]   [Google Scholar]
  20. Murchland, J. D. (1975). Fundamental concepts and relations for reliability analysis of multi-state systems. In Reliability and fault tree analysis.
    [Google Scholar]
  21. Reibman, A., & Trivedi, K. (1988). Numerical transient analysis of Markov models. Computers & Operations Research, 15(1), 19-36.
    [CrossRef]   [Google Scholar]
  22. Trivedi, K. S. (2001). Probability and statistics with reliability, queuing, and computer science applications. John Wiley & Sons.
    [Google Scholar]
  23. Kulkarni, V. G. (1995). Modeling and analysis of stochastic systems.
    [CrossRef]   [Google Scholar]
  24. Entezari-Maleki, R., Trivedi, K. S., & Movaghar, A. (2014). Performability evaluation of grid environments using stochastic reward nets. IEEE Transactions on Dependable and Secure Computing, 12(2), 204-216.
    [CrossRef]   [Google Scholar]
  25. Zang, X., Wang, D., Sun, H., & Trivedi, K. S. (2003). A BDD-based algorithm for analysis of multistate systems with multistate components. IEEE Transactions on computers, 52(12), 1608-1618.
    [CrossRef]   [Google Scholar]
  26. Chang, Y. R., Amari, S. V., & Kuo, S. Y. (2005). OBDD-based evaluation of reliability and importance measures for multistate systems subject to imperfect fault coverage. IEEE Transactions on Dependable and Secure Computing, 2(4), 336-347.
    [CrossRef]   [Google Scholar]
  27. Shrestha, A., & Xing, L. (2008). A logarithmic binary decision diagram-based method for multistate system analysis. IEEE Transactions on Reliability, 57(4), 595-606.
    [CrossRef]   [Google Scholar]
  28. Ushakov, I. A. (1986). A universal generating function. Soviet J Comput Syst Sci, 24(5), 37.
    [Google Scholar]
  29. Levitin, G., Xing, L., & Dai, Y. (2016). Optimizing dynamic performance of multistate systems with heterogeneous 1-out-of-N warm standby components. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 48(6), 920-929.
    [CrossRef]   [Google Scholar]
  30. Levitin, G., & Xing, L. (2017). Dynamic performance of series parallel multi-state systems with standby subsystems or repairable binary elements. In Recent Advances in Multi-state Systems Reliability: Theory and Applications (pp. 159-178). Cham: Springer International Publishing.
    [CrossRef]   [Google Scholar]
  31. Pock, M., Malass'e, O., & Walter, M. (2011). Combining different binary decision diagram techniques for solving models with multiple failure states. Proceedings of the Institution of Mechanical Engineers, Part O: Journal of Risk and Reliability, 225(1), 18-27.
    [CrossRef]   [Google Scholar]
  32. Xing, L., & Dai, Y. S. (2008). A new decision-diagram-based method for efficient analysis on multistate systems. IEEE Transactions on Dependable and Secure Computing, 6(3), 161-174.
    [CrossRef]   [Google Scholar]
  33. Ren, Y., Zeng, C., Fan, D., Liu, L., & Feng, Q. (2018). Multi-state reliability assessment method based on the MDD-GO model. IEEE Access, 6, 5151-5161.
    [CrossRef]   [Google Scholar]
  34. Ryu, S. M., & Park, D. J. (2005, December). Checkpointing for the reliability of real-time systems with on-line fault detection. In International Conference on Embedded and Ubiquitous Computing (pp. 194-202). Berlin, Heidelberg: Springer Berlin Heidelberg.
    [CrossRef]   [Google Scholar]
  35. Valdez, L. D., Shekhtman, L., La Rocca, C. E., Zhang, X., Buldyrev, S. V., Trunfio, P. A., ... & Havlin, S. (2020). Cascading failures in complex networks. Journal of Complex Networks, 8(2), cnaa013.
    [CrossRef]   [Google Scholar]
  36. Morshedlou, H., & Meybodi, M. R. (2014). Decreasing impact of sla violations: a proactive resource allocation approachfor cloud computing environments. IEEE Transactions on Cloud Computing, 2(2), 156-167.
    [CrossRef]   [Google Scholar]
  37. Mo, Y., Xing, L., & Dugan, J. B. (2015). Performability analysis of k-to-l-out-of-n computing systems using binary decision diagrams. IEEE Transactions on Dependable and Secure Computing, 15(1), 126-137.
    [CrossRef]   [Google Scholar]
  38. Mo, Y., Xing, L., Zhong, F., Pan, Z., & Chen, Z. (2014). Choosing a heuristic and root node for edge ordering in BDD-based network reliability analysis. Reliability Engineering & System Safety, 131, 83-93.
    [CrossRef]   [Google Scholar]
  39. Xing, L., & Amari, S. V. (2015). Binary decision diagrams and extensions for system reliability analysis. John Wiley & Sons.
    [CrossRef]   [Google Scholar]
  40. Xing, L., & Dugan, J. B. (2002, June). Dependability analysis using multiple-valued decision diagrams. In Proc. of 6th International Conference on Probabilistic Safety Assessment and Management.
    [Google Scholar]
  41. Shrestha, A., Xing, L., & Dai, Y. (2009). Decision diagram based methods and complexity analysis for multi-state systems. IEEE Transactions on Reliability, 59(1), 145-161.
    [CrossRef]   [Google Scholar]
  42. Ammar, M., Hamad, G. B., Ait Mohamed, O., & Savaria, Y. (2017). System-level analysis of the vulnerability of processors exposed to single-event upsets via probabilistic model checking. IEEE Transactions on Nuclear Science, 64(9), 2523-2530.
    [CrossRef]   [Google Scholar]
  43. Antonelli, F., Cortellessa, V., Gribaudo, M., Pinciroli, R., Trivedi, K. S., & Trubiani, C. (2020). Analytical modeling of performance indices under epistemic uncertainty applied to cloud computing systems. Future Generation Computer Systems, 102, 746-761.
    [CrossRef]   [Google Scholar]
  44. Vasireddy, R., & Trivedi, K. S. (2006). Defining Steady-State Service Level Agreeability using Semi-Markov Process. DSN 2006, 172.
    [Google Scholar]
  45. Li, K., Tang, X., Veeravalli, B., & Li, K. (2013). Scheduling precedence constrained stochastic tasks on heterogeneous cluster systems. IEEE Transactions on computers, 64(1), 191-204.
    [CrossRef]   [Google Scholar]
  46. Xu, Y., Li, K., He, L., Zhang, L., & Li, K. (2014). A hybrid chemical reaction optimization scheme for task scheduling on heterogeneous computing systems. IEEE Transactions on parallel and distributed systems, 26(12), 3208-3222.
    [CrossRef]   [Google Scholar]
  47. Mo, Y., & Xing, L. (2013). An enhanced decision diagram-based method for common-cause failure analysis. Proceedings of the Institution of Mechanical Engineers, Part O: Journal of Risk and Reliability, 227(5), 557-566.
    [CrossRef]   [Google Scholar]
  48. Peng, R., Zhai, Q., Xing, L., & Yang, J. (2014). Reliability of demand-based phased-mission systems subject to fault level coverage. Reliability Engineering & System Safety, 121, 18-25.
    [CrossRef]   [Google Scholar]
  49. Xing, L. (2007). An efficient binary-decision-diagram-based approach for network reliability and sensitivity analysis. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 38(1), 105-115.
    [CrossRef]   [Google Scholar]
  50. Xia, R., Yin, X., Lopez, J. A., Machida, F., & Trivedi, K. S. (2013). Performance and availability modeling of ITSystems with data backup and restore. IEEE Transactions on Dependable and Secure Computing, 11(4), 375-389.
    [CrossRef]   [Google Scholar]
  51. Gonzalez, A. J., & Helvik, B. E. (2013, August). Hybrid cloud management to comply efficiently with SLA availability guarantees. In 2013 IEEE 12th International Symposium on Network Computing and Applications (pp. 127-134). IEEE.
    [CrossRef]   [Google Scholar]
  52. Gonzalez, A. J., & Helvik, B. E. (2012, December). System management to comply with SLA availability guarantees in cloud computing. In 4th IEEE International Conference on Cloud Computing Technology and Science Proceedings (pp. 325-332). IEEE.
    [CrossRef]   [Google Scholar]
  53. Zhai, Q., Xing, L., Peng, R., & Yang, J. (2015). Multi-Valued Decision Diagram-Based Reliability Analysis of $ k $-out-of-$ n $ Cold Standby Systems Subject to Scheduled Backups. IEEE Transactions on Reliability, 64(4), 1310-1324.
    [Google Scholar]
  54. Entezari-Maleki, R., Trivedi, K. S., & Movaghar, A. (2014). Performability evaluation of grid environments using stochastic reward nets. IEEE Transactions on Dependable and Secure Computing, 12(2), 204-216.
    [CrossRef]   [Google Scholar]

Cite This Article
APA Style
Mo, Y., Fan, Y., Miao, C., Chynybaev, M., Gui, F., Zhang, R., Hu, J., Mu, J. & Chymyrov, A. (2025). Performability Analysis for Large-Scale Multi-State Computing Systems: Methodologies, Advances, and Future Directions. ICCK Transactions on Systems Safety and Reliability, 1(2), 81–97. https://doi.org/10.62762/TSSR.2025.527003

Article Metrics
Citations:

Crossref

0

Scopus

0

Web of Science

0
Article Access Statistics:
Views: 19
PDF Downloads: 7

Publisher's Note
ICCK stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and Permissions
Institute of Central Computation and Knowledge (ICCK) or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
ICCK Transactions on Systems Safety and Reliability

ICCK Transactions on Systems Safety and Reliability

ISSN: 3069-1087 (Online)

Email: [email protected]

Portico

Portico

All published articles are preserved here permanently:
https://www.portico.org/publishers/icck/