JISE


  [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17]


Journal of Information Science and Engineering, Vol. 31 No. 6, pp. 1937-1959


Fault Tolerant Scheduling for Parallel Loops on Shared Memory Systems


YIZHUO WANG1, ROSARIO CAMMAROTA2 AND ALEXANDRU NICOLAU3 
1School of Computer Science Beijing Institute of Technology 
Beijing, 100081 P.R. China 
E-mail: frankwyz@bit.edu.cn 
2Qualcomm Research 
San Diego, CA 92121, USA 
E-mail: rosario.c@ics.uci.edu 
3Department of Computer Science 
University of California 
Irvine, CA 92697, USA 
E-mail: nicolau@ics.uci.edu


    While multicore/multiprocessor systems achieve significant speedup for many applications by exploiting loop level parallelism, they also suffer from increased reliability problems as a result of ever scaling device size. This paper addresses the reliability of loop dominated applications, aiming to execute parallel loops efficiently in the presence of various types of hardware faults. In this paper, we present a fault tolerant work-stealing scheme which makes parallel loop execution resilient to hardware faults. A lightweight buffer-commit mechanism is applied in the proposed scheme to ensure the correctness of the re-execution of loop iterations. In addition, we split large failing chunks of loop iterations at runtime to improve load balancing, and a worker thread is discarded when faults occur frequently on it. We evaluated our techniques on a multi-socket multicore system, using a set of loop dominated benchmarks. The proposed scheme achieves the minimum overhead of supporting fault tolerance and optimal load balancing.


Keywords: fault tolerance, loop scheduling, work-stealing, multicore and multiprocessor, self-scheduling

  Retrieve PDF document (JISE_201506_07.pdf)