JISE


  [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24]


Journal of Information Science and Engineering, Vol. 27 No. 3, pp. 1137-1152


Improving MapReduce Performance by Exploiting Input Redundancy


SHIN-GYU KIM, HYUCK HAN, HYUNGSOO JUNG+, HYEONSANG EOM AND HEON Y. YEOM
School of Computer Science and Engineering 
Seoul National University 
Seoul, 151-744, Korea 
+School of Information Technologies 
University of Sydney 
NSW 2006, Australia


    The proliferation of data parallel programming on large clusters has set a new research avenue: accommodating numerous types of data-intensive applications with a feasible plan. Behind the many research efforts, we can observe that there exists a nontrivial amount of redundant I/O in the execution of data-intensive applications. This redundancy problem arises as an emerging issue in the recent literature because even the locality- aware scheduling policy in a MapReduce framework is not effective in a cluster environment where storage nodes cannot provide a computation service. In this article, we introduce SplitCache for improving the performance of data-intensive OLAP-style applications by reducing redundant I/O in a MapReduce framework. The key strategy to achieve the goal is to eliminate such I/O redundancy especially when different applications read common input data within an overlapped time period; SplitCache caches the first input stream in the computing nodes and reuses them for future demands. We also design a cache-aware task scheduler that plays an important role in achieving high cache utilization. In execution of the TPC-H benchmark, we achieved 64.3% faster execution and 83.48% reduction in network traffic in average.


Keywords: mapreduce, I/O redundancy, task scheduling, distributed system, cloud computing

  Retrieve PDF document (JISE_201103_21.pdf)