JISE


  [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]


Journal of Information Science and Engineering, Vol. 37 No. 3, pp. 709-729


Optimizing Read Operations of Hadoop Distributed File System on Heterogeneous Storages


JONGBAEG LEE1 , JONGWUK LEE2 AND SANG-WON LEE2
1Department of Electrical and Computer Engineering
2College of Computing
Sungkyunkwan University
Suwon, 16419 Korea


The key challenge in big data processing frameworks such as Hadoop distributed file system (HDFS) is to optimize the throughput for read operations. Toward this goal, several studies have been conducted to enhance read performance on heterogeneous storages. Recently, although HDFS has supported several storage policies for placing data blocks in heterogeneous storages, it fails to fully utilize the potential of fast storages (e.g., SSD). The primary reason for its suboptimal read performance is that, while distributing read requests, existing HDFS only considers the network distance between the client and datanodes, thereby incurring more read requests to slower storages with more data (e.g., HDD). In this paper, we propose a new data retrieval policy for distributing read requests on heterogeneous storages in HDFS. Specifically, the proposed policy considers both the unique characteristics of storages in datanodes and the network environments, to efficiently distribute read requests. We develop several policies including the proposed policy to balance these two factors such as random selection, storage type selection, weighted round-robin selection, and dynamic round-robin selection. Our experimental results show that the throughput of the proposed method outperforms those of the existing policies by up to six times in extensive benchmark datasets.


Keywords: Hadoop distributed file system, heterogeneous storage, data retrieval policy, MapReduce, load balancing

  Retrieve PDF document (JISE_202103_13.pdf)