JISE

The key challenge in big data processing frameworks such as Hadoop distributed file system (HDFS) is to optimize the throughput for read operations. Toward this goal, several studies have been conducted to enhance read performance on heterogeneous storages. Recently, although HDFS has supported several storage policies for placing data blocks in heterogeneous storages, it fails to fully utilize the potential of fast storages (e.g., SSD). The primary reason for its suboptimal read performance is that, while distributing read requests, existing HDFS only considers the network distance between the client and datanodes, thereby incurring more read requests to slower storages with more data (e.g., HDD). In this paper, we propose a new data retrieval policy for distributing read requests on heterogeneous storages in HDFS. Specifically, the proposed policy considers both the unique characteristics of storages in datanodes and the network environments, to efficiently distribute read requests. We develop several policies including the proposed policy to balance these two factors such as random selection, storage type selection, weighted round-robin selection, and dynamic round-robin selection. Our experimental results show that the throughput of the proposed method outperforms those of the existing policies by up to six times in extensive benchmark datasets.