MapReduce 是目前最熱門的雲端技術之一,用來處理大量資料,不論資料探勘、非結構化的紀錄檔、網頁索引處理及其他需要大量資料處理的科學研究,都可透過 MapReduce 得到極佳的執行效率。MapReduce 為一分散式批次資料處理程式框架,將一個工作分解為許多較小的 map 任務以及 reduce 任務,由map 處理每個小問題,再由reduce將問題彙整,得到最終的結果。
Hadoop 是一個開放原始碼的 MapReduce 架構,並且被廣泛地應用在以大規模資料運算為主的雲端計算。Hadoop有一個非常重要的元件稱為scheduler ,是 hadoop的中樞,負責調度、指派任務和資源分配的優先順序。Scheduler的任務選擇與分配方式,將會影響 MapReduce 工作的執行效率與整個叢集的使用率,目前Hadoop預設的scheduler是將任務以先進先出(FIFO)的方式進行排程。提升MapReduce運算效能的挑戰之一為如何適當的分配Mapper 和 Reducer給雲端裡的每個節點來執行。儘管過去已經有許多改善MapReduce運算效能的研究,但是大部分的方法在實際的運作中,仍存在很多的問題,如工作節點的動態負載、data locality的問題,計算節點的異質性等等。我們發現目前Hadoop對於這些問題並沒有妥善處理,並且在相關的情況下,整體效能仍有改進空間。
我們提出Data Locality Driven Scheduler(DLDS)的方法,並實踐在 Hadoop上,試圖提高scheduler的效能。我們設計不同的實驗,比較DLDS在不同狀況下和其他的排程演算法的差異。實驗結果顯示,透過提高資料的地域性,平均可提昇10% 至 15% 的效能。 / MapReduce is programming model for processing large data set. It is typically used to do distributed computing on clusters of computers such as Cloud computing platform. Examples of bit data set include unstructured logs, web indexing, scientific data, surveillance data, etc.
MapReduce is a distributed processing program framework, a computing job is broken down into many smaller Map tasks and a Reduce task.Each Map task processes a partition of the given data set and Reduce aggregates the results of Maps to produce final result.
Hadoop is an open-source MapReduce architecture, and is widely used in many cloud-based services.To best utilize computing resource in a cloud server, a task scheduler is essential to assign tasks to appropriate processors as well as to prioritize resource allocation. The default scheduler of Hadoop is first-in-first-out (FIFO) scheduler which is simple but has a performance inefficiency yet to be improved. Although there have been many researches aiming to improve the performance of MapReduce platform in the past year, there still have many issues hindering the performance improvement, such as dynamic load balance, data locality, and heterogeneity of computing nodes.
To improve data locality, we propose a new scheduler called Data Locality Driven Scheduler (DLDS) based on Hadoop platform. DLDS improve Hadoop's performamce by allocating Map tasks as close as possible to the data block they are to process. We evaluated the proposed DLDS against several other schedulers by simulation on an 8 nodes real Hadoop system. Experimental results show that DLDS can improve data locality by 10-15%, which results in a significant performamce improvement.
Identifer | oai:union.ndltd.org:CHENGCHI/G0097971010 |
Creators | 陳耀宗, Chen, Yao Chung |
Publisher | 國立政治大學 |
Source Sets | National Chengchi University Libraries |
Language | 中文 |
Detected Language | English |
Type | text |
Rights | Copyright © nccu library on behalf of the copyright holders |
Page generated in 0.0019 seconds