[ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 6 ] [ 7 ] [ 8 ] [ 9 ] [ 10 ] [ 11 ] [ 12 ] [ 13 ] [ 14 ] [ 15 ]

Journal of Information Science and Engineering, Vol. 33 No. 1, pp. 81-99

A Reliability Analysis for Successful Execution of Parallel DAG Tasks

Department of Computer Science and Technology
Tongji University
Tongji Branch, National Engineering and Technology Center of High Performance Computer
Shanghai, 201804 P.R. China
E-mail: hookk@msn.com; {gszeng; wenjuanliu; wwang}@tongji.edu.cn

    Large scale parallel computing system is becoming more and more failure-prone due to the increasing number of computational nodes. This results in serious reliability problems in parallel computing. To ensure successfully running of parallel tasks such as Meta tasks and DAG tasks, it is necessary to perform reliability analysis before scheduling parallel tasks. For Meta tasks, some key factors are discussed that affect and impede successful execution of a single task. Then, the reliability formula of Meta tasks is presented. For DAG tasks, hardware failures, software failures, network link failures and subtask execution order are all taken into account. We shall calculate not only the reliability of subtasks, but also the reliability of network communication. Then two reliability algorithms of DAG tasks are designed. Finally, some experiments are conducted. Experimental results show that our reliability analysis methods are more effective and comprehensive.  

Keywords: parallel computing, meta tasks, DAG tasks, successful execution, reliability

  Retrieve PDF document (JISE_201701_06.pdf)