Performability Analysis for Large-Scale Multi-State Computing Systems: Methodologies, Advances, and Future Directions
Article Information
Abstract
Large-scale computing systems, such as cloud data centers, grid infrastructures, and high-performance computing clusters, are the backbone of modern information technology ecosystems. These systems typically consist of numerous heterogeneous, multi-state computing nodes that exhibit varying performance levels due to component failures, degradation, or dynamic resource allocation. Performability analysis, which integrates both system reliability and performance evaluations to quantify the probability of the system operating at a specified performance level, is critical for ensuring the efficient, reliable, and cost-effective operation of these complex systems. This paper presents a comprehensive review of recent advancements in performability analysis for large-scale multi-state computing systems over the past decade. It classifies existing research into three core methodological categories: binary decision diagram (BDD)-based approaches, multi-valued decision diagram (MDD)-based approaches, and comparative benchmarking with traditional methods (e.g., continuous-time Markov chains (CTMC), universal generating function (UGF)). For each category, the paper details key methodologies, algorithmic innovations, and practical applications. Additionally, the promising future directions are proposed to address emerging challenges, such as handling dynamic system behaviors, integrating real-time data, and optimizing resource allocation for performability. This review provides a valuable reference for researchers, system designers, and operators seeking to enhance the performability of large-scale computing systems and mitigate risks associated with service level agreement (SLA) violations.
Graphical Abstract
Keywords
Data Availability Statement
Funding
Conflicts of Interest
Ethical Approval and Consent to Participate
References
- Hayes, B. (2008). Cloud computing. http://doi.acm.org/10.1145/1364782.1364786
[Google Scholar] - A Vouk, M. (2008). Cloud computing–issues, research and implementations. Journal of computing and information technology, 16(4), 235-246.
[CrossRef] [Google Scholar] - Kurmann, C., Rauch, F., & Stricker, T. M. (2003, April). Cost/performance tradeoffs in network interconnects for clusters of commodity PCs. In Proceedings International Parallel and Distributed Processing Symposium (pp. 10-pp). IEEE.
[CrossRef] [Google Scholar] - Hu, S., Chen, K., Wu, H., Bai, W., Lan, C., Wang, H., ... & Guo, C. (2015). Explicit path control in commodity data centers: Design and applications. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15) (pp. 15-28).
[CrossRef] [Google Scholar] - Qouneh, A., Liu, M., & Li, T. (2015, September). Optimization of resource allocation and energy efficiency in heterogeneous cloud data centers. In 2015 44th International Conference on Parallel Processing (pp. 1-10). IEEE.
[CrossRef] [Google Scholar] - Lisnianski, A., & Levitin, G. (2003). Multi-state system reliability: assessment, optimization and applications. World scientific.
[Google Scholar] - Xing, L. (2007, May). Efficient analysis of systems with multiple states. In 21st International Conference on Advanced Information Networking and Applications (AINA'07) (pp. 666-672). IEEE.
[CrossRef] [Google Scholar] - Amari, S. V., Xing, L., Shrestha, A., Akers, J., & Trivedi, K. S. (2010). Performability analysis of multistate computing systems using multivalued decision diagrams. IEEE Transactions on Computers, 59(10), 1419-1433.
[CrossRef] [Google Scholar] - Jiang, T., & Liu, Y. (2017). Parameter inference for non-repairable multi-state system reliability models by multi-level observation sequences. Reliability Engineering & System Safety, 166, 3-15.
[CrossRef] [Google Scholar] - Harish, P., & Narayanan, P. J. (2007, December). Accelerating large graph algorithms on the GPU using CUDA. In International conference on high-performance computing (pp. 197-208). Berlin, Heidelberg: Springer Berlin Heidelberg.
[CrossRef] [Google Scholar] - Pinheiro, E., Weber, W. D., & Barroso, L. A. (2007, February). Failure Trends in a Large Disk Drive Population. In Fast (Vol. 7, No. 1, pp. 17-23).
[Google Scholar] - Gill, P., Jain, N., & Nagappan, N. (2011, August). Understanding network failures in data centers: measurement, analysis, and implications. In Proceedings of the ACM SIGCOMM 2011 Conference (pp. 350-361).
[CrossRef] [Google Scholar] - Smith, R. M., Trivedi, K. S., & Ramesh, A. V. (2002). Performability analysis: measures, an algorithm, and a case study. IEEE Transactions on Computers, 37(4), 406-417.
[CrossRef] [Google Scholar] - Clemente, R., Bartoli, M., Bossi, M. C., D'Orazio, G., & Cosmo, G. (2005, October). Risk management in availability SLA. In DRCN 2005). Proceedings. 5th International Workshop on Design of Reliable Communication Networks, 2005. (pp. 8-pp). IEEE.
[CrossRef] [Google Scholar] - Snow, A. P., & Weckman, G. R. (2007, April). What are the chances an availability SLA will be violated?. In Sixth International Conference on Networking (ICN'07) (pp. 35-35). IEEE.
[CrossRef] [Google Scholar] - Shen, Z., Lee, P. P., Shu, J., & Guo, W. (2017). Cross-rack-aware single failure recovery for clustered file systems. IEEE Transactions on Dependable and Secure Computing, 17(2), 248-261.
[CrossRef] [Google Scholar] - Chen, P., Qi, Y., Li, X., Hou, D., & Lyu, M. R. T. (2016). ARF-predictor: Effective prediction of aging-related failure using entropy. IEEE Transactions on Dependable and Secure Computing, 15(4), 675-693.
[CrossRef] [Google Scholar] - El-Sayed, N., & Schroeder, B. (2016). Understanding practical tradeoffs in HPC checkpoint-scheduling policies. IEEE Transactions on Dependable and Secure Computing, 15(2), 336-350.
[CrossRef] [Google Scholar] - Liu, Y., & Chen, C. J. (2017). Dynamic reliability assessment for nonrepairable multistate systems by aggregating multilevel imperfect inspection data. IEEE Transactions on Reliability, 66(2), 281-297.
[CrossRef] [Google Scholar] - Murchland, J. D. (1975). Fundamental concepts and relations for reliability analysis of multi-state systems. In Reliability and fault tree analysis.
[Google Scholar] - Reibman, A., & Trivedi, K. (1988). Numerical transient analysis of Markov models. Computers & Operations Research, 15(1), 19-36.
[CrossRef] [Google Scholar] - Trivedi, K. S. (2001). Probability and statistics with reliability, queuing, and computer science applications. John Wiley & Sons.
[Google Scholar] - Kulkarni, V. G. (1995). Modeling and analysis of stochastic systems.
[CrossRef] [Google Scholar] - Entezari-Maleki, R., Trivedi, K. S., & Movaghar, A. (2014). Performability evaluation of grid environments using stochastic reward nets. IEEE Transactions on Dependable and Secure Computing, 12(2), 204-216.
[CrossRef] [Google Scholar] - Zang, X., Wang, D., Sun, H., & Trivedi, K. S. (2003). A BDD-based algorithm for analysis of multistate systems with multistate components. IEEE Transactions on computers, 52(12), 1608-1618.
[CrossRef] [Google Scholar] - Chang, Y. R., Amari, S. V., & Kuo, S. Y. (2005). OBDD-based evaluation of reliability and importance measures for multistate systems subject to imperfect fault coverage. IEEE Transactions on Dependable and Secure Computing, 2(4), 336-347.
[CrossRef] [Google Scholar] - Shrestha, A., & Xing, L. (2008). A logarithmic binary decision diagram-based method for multistate system analysis. IEEE Transactions on Reliability, 57(4), 595-606.
[CrossRef] [Google Scholar] - Ushakov, I. A. (1986). A universal generating function. Soviet J Comput Syst Sci, 24(5), 37.
[Google Scholar] - Levitin, G., Xing, L., & Dai, Y. (2016). Optimizing dynamic performance of multistate systems with heterogeneous 1-out-of-N warm standby components. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 48(6), 920-929.
[CrossRef] [Google Scholar] - Levitin, G., & Xing, L. (2017). Dynamic performance of series parallel multi-state systems with standby subsystems or repairable binary elements. In Recent Advances in Multi-state Systems Reliability: Theory and Applications (pp. 159-178). Cham: Springer International Publishing.
[CrossRef] [Google Scholar] - Pock, M., Malass'e, O., & Walter, M. (2011). Combining different binary decision diagram techniques for solving models with multiple failure states. Proceedings of the Institution of Mechanical Engineers, Part O: Journal of Risk and Reliability, 225(1), 18-27.
[CrossRef] [Google Scholar] - Xing, L., & Dai, Y. S. (2008). A new decision-diagram-based method for efficient analysis on multistate systems. IEEE Transactions on Dependable and Secure Computing, 6(3), 161-174.
[CrossRef] [Google Scholar] - Ren, Y., Zeng, C., Fan, D., Liu, L., & Feng, Q. (2018). Multi-state reliability assessment method based on the MDD-GO model. IEEE Access, 6, 5151-5161.
[CrossRef] [Google Scholar] - Ryu, S. M., & Park, D. J. (2005, December). Checkpointing for the reliability of real-time systems with on-line fault detection. In International Conference on Embedded and Ubiquitous Computing (pp. 194-202). Berlin, Heidelberg: Springer Berlin Heidelberg.
[CrossRef] [Google Scholar] - Valdez, L. D., Shekhtman, L., La Rocca, C. E., Zhang, X., Buldyrev, S. V., Trunfio, P. A., ... & Havlin, S. (2020). Cascading failures in complex networks. Journal of Complex Networks, 8(2), cnaa013.
[CrossRef] [Google Scholar] - Morshedlou, H., & Meybodi, M. R. (2014). Decreasing impact of sla violations: a proactive resource allocation approachfor cloud computing environments. IEEE Transactions on Cloud Computing, 2(2), 156-167.
[CrossRef] [Google Scholar] - Mo, Y., Xing, L., & Dugan, J. B. (2015). Performability analysis of k-to-l-out-of-n computing systems using binary decision diagrams. IEEE Transactions on Dependable and Secure Computing, 15(1), 126-137.
[CrossRef] [Google Scholar] - Mo, Y., Xing, L., Zhong, F., Pan, Z., & Chen, Z. (2014). Choosing a heuristic and root node for edge ordering in BDD-based network reliability analysis. Reliability Engineering & System Safety, 131, 83-93.
[CrossRef] [Google Scholar] - Xing, L., & Amari, S. V. (2015). Binary decision diagrams and extensions for system reliability analysis. John Wiley & Sons.
[CrossRef] [Google Scholar] - Xing, L., & Dugan, J. B. (2002, June). Dependability analysis using multiple-valued decision diagrams. In Proc. of 6th International Conference on Probabilistic Safety Assessment and Management.
[Google Scholar] - Shrestha, A., Xing, L., & Dai, Y. (2009). Decision diagram based methods and complexity analysis for multi-state systems. IEEE Transactions on Reliability, 59(1), 145-161.
[CrossRef] [Google Scholar] - Ammar, M., Hamad, G. B., Ait Mohamed, O., & Savaria, Y. (2017). System-level analysis of the vulnerability of processors exposed to single-event upsets via probabilistic model checking. IEEE Transactions on Nuclear Science, 64(9), 2523-2530.
[CrossRef] [Google Scholar] - Antonelli, F., Cortellessa, V., Gribaudo, M., Pinciroli, R., Trivedi, K. S., & Trubiani, C. (2020). Analytical modeling of performance indices under epistemic uncertainty applied to cloud computing systems. Future Generation Computer Systems, 102, 746-761.
[CrossRef] [Google Scholar] - Vasireddy, R., & Trivedi, K. S. (2006). Defining Steady-State Service Level Agreeability using Semi-Markov Process. DSN 2006, 172.
[Google Scholar] - Li, K., Tang, X., Veeravalli, B., & Li, K. (2013). Scheduling precedence constrained stochastic tasks on heterogeneous cluster systems. IEEE Transactions on computers, 64(1), 191-204.
[CrossRef] [Google Scholar] - Xu, Y., Li, K., He, L., Zhang, L., & Li, K. (2014). A hybrid chemical reaction optimization scheme for task scheduling on heterogeneous computing systems. IEEE Transactions on parallel and distributed systems, 26(12), 3208-3222.
[CrossRef] [Google Scholar] - Mo, Y., & Xing, L. (2013). An enhanced decision diagram-based method for common-cause failure analysis. Proceedings of the Institution of Mechanical Engineers, Part O: Journal of Risk and Reliability, 227(5), 557-566.
[CrossRef] [Google Scholar] - Peng, R., Zhai, Q., Xing, L., & Yang, J. (2014). Reliability of demand-based phased-mission systems subject to fault level coverage. Reliability Engineering & System Safety, 121, 18-25.
[CrossRef] [Google Scholar] - Xing, L. (2007). An efficient binary-decision-diagram-based approach for network reliability and sensitivity analysis. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 38(1), 105-115.
[CrossRef] [Google Scholar] - Xia, R., Yin, X., Lopez, J. A., Machida, F., & Trivedi, K. S. (2013). Performance and availability modeling of ITSystems with data backup and restore. IEEE Transactions on Dependable and Secure Computing, 11(4), 375-389.
[CrossRef] [Google Scholar] - Gonzalez, A. J., & Helvik, B. E. (2013, August). Hybrid cloud management to comply efficiently with SLA availability guarantees. In 2013 IEEE 12th International Symposium on Network Computing and Applications (pp. 127-134). IEEE.
[CrossRef] [Google Scholar] - Gonzalez, A. J., & Helvik, B. E. (2012, December). System management to comply with SLA availability guarantees in cloud computing. In 4th IEEE International Conference on Cloud Computing Technology and Science Proceedings (pp. 325-332). IEEE.
[CrossRef] [Google Scholar] - Zhai, Q., Xing, L., Peng, R., & Yang, J. (2015). Multi-Valued Decision Diagram-Based Reliability Analysis of $ k $-out-of-$ n $ Cold Standby Systems Subject to Scheduled Backups. IEEE Transactions on Reliability, 64(4), 1310-1324.
[Google Scholar] - Entezari-Maleki, R., Trivedi, K. S., & Movaghar, A. (2014). Performability evaluation of grid environments using stochastic reward nets. IEEE Transactions on Dependable and Secure Computing, 12(2), 204-216.
[CrossRef] [Google Scholar]
Cited By (2)
-
Peng Su, Xu Yang, Rui Peng, Linmin Hu, Ting Li. Reliability analysis of multi-state aggregation power grid systems under the Vehicle-to-Grid mode.
Reliability Engineering & System Safety, 2026 , 272 .
[CrossRef] -
Liudong Gu, Guanjun Wang, Yifan Zhou. Reliability analysis of a complex series-parallel performance sharing system with performance excess failure and storage units.
Reliability Engineering & System Safety, 2026 , 269 .
[CrossRef]
Cite This Article
TY - JOUR AU - Mo, Yuchang AU - Fan, Yuan AU - Miao, Chunyu AU - Chynybaev, Mirlan AU - Gui, Faer AU - Zhang, Rengui AU - Hu, Jianyong AU - Mu, Jinbin AU - Chymyrov, Akylbek PY - 2025 DA - 2025/10/31 TI - Performability Analysis for Large-Scale Multi-State Computing Systems: Methodologies, Advances, and Future Directions JO - ICCK Transactions on Systems Safety and Reliability T2 - ICCK Transactions on Systems Safety and Reliability JF - ICCK Transactions on Systems Safety and Reliability VL - 1 IS - 2 SP - 81 EP - 97 DO - 10.62762/TSSR.2025.527003 UR - https://www.icck.org/article/abs/TSSR.2025.527003 KW - performability analysis KW - large-scale computing systems KW - multi-state systems KW - binary decision diagrams (BDD) KW - multi-valued decision diagrams (MDD) KW - reliability KW - system performance AB - Large-scale computing systems, such as cloud data centers, grid infrastructures, and high-performance computing clusters, are the backbone of modern information technology ecosystems. These systems typically consist of numerous heterogeneous, multi-state computing nodes that exhibit varying performance levels due to component failures, degradation, or dynamic resource allocation. Performability analysis, which integrates both system reliability and performance evaluations to quantify the probability of the system operating at a specified performance level, is critical for ensuring the efficient, reliable, and cost-effective operation of these complex systems. This paper presents a comprehensive review of recent advancements in performability analysis for large-scale multi-state computing systems over the past decade. It classifies existing research into three core methodological categories: binary decision diagram (BDD)-based approaches, multi-valued decision diagram (MDD)-based approaches, and comparative benchmarking with traditional methods (e.g., continuous-time Markov chains (CTMC), universal generating function (UGF)). For each category, the paper details key methodologies, algorithmic innovations, and practical applications. Additionally, the promising future directions are proposed to address emerging challenges, such as handling dynamic system behaviors, integrating real-time data, and optimizing resource allocation for performability. This review provides a valuable reference for researchers, system designers, and operators seeking to enhance the performability of large-scale computing systems and mitigate risks associated with service level agreement (SLA) violations. SN - 3069-1087 PB - Institute of Central Computation and Knowledge LA - English ER -
@article{Mo2025Performabi,
author = {Yuchang Mo and Yuan Fan and Chunyu Miao and Mirlan Chynybaev and Faer Gui and Rengui Zhang and Jianyong Hu and Jinbin Mu and Akylbek Chymyrov},
title = {Performability Analysis for Large-Scale Multi-State Computing Systems: Methodologies, Advances, and Future Directions},
journal = {ICCK Transactions on Systems Safety and Reliability},
year = {2025},
volume = {1},
number = {2},
pages = {81-97},
doi = {10.62762/TSSR.2025.527003},
url = {https://www.icck.org/article/abs/TSSR.2025.527003},
abstract = {Large-scale computing systems, such as cloud data centers, grid infrastructures, and high-performance computing clusters, are the backbone of modern information technology ecosystems. These systems typically consist of numerous heterogeneous, multi-state computing nodes that exhibit varying performance levels due to component failures, degradation, or dynamic resource allocation. Performability analysis, which integrates both system reliability and performance evaluations to quantify the probability of the system operating at a specified performance level, is critical for ensuring the efficient, reliable, and cost-effective operation of these complex systems. This paper presents a comprehensive review of recent advancements in performability analysis for large-scale multi-state computing systems over the past decade. It classifies existing research into three core methodological categories: binary decision diagram (BDD)-based approaches, multi-valued decision diagram (MDD)-based approaches, and comparative benchmarking with traditional methods (e.g., continuous-time Markov chains (CTMC), universal generating function (UGF)). For each category, the paper details key methodologies, algorithmic innovations, and practical applications. Additionally, the promising future directions are proposed to address emerging challenges, such as handling dynamic system behaviors, integrating real-time data, and optimizing resource allocation for performability. This review provides a valuable reference for researchers, system designers, and operators seeking to enhance the performability of large-scale computing systems and mitigate risks associated with service level agreement (SLA) violations.},
keywords = {performability analysis, large-scale computing systems, multi-state systems, binary decision diagrams (BDD), multi-valued decision diagrams (MDD), reliability, system performance},
issn = {3069-1087},
publisher = {Institute of Central Computation and Knowledge}
}
Publisher's Note
ICCK stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and Permissions
Portico