Global ETD Search

Return to search

Hadoop performance modeling and job optimization for big data analytics

Big data has received a momentum from both academia and industry. The MapReduce model has emerged into a major computing model in support of big data analytics. Hadoop, which is an open source implementation of the MapReduce model, has been widely taken up by the community. Cloud service providers such as Amazon EC2 cloud have now supported Hadoop user applications. However, a key challenge is that the cloud service providers do not a have resource provisioning mechanism to satisfy user jobs with deadline requirements. Currently, it is solely the user responsibility to estimate the require amount of resources for their job running in a public cloud. This thesis presents a Hadoop performance model that accurately estimates the execution duration of a job and further provisions the required amount of resources for a job to be completed within a deadline. The proposed model employs Locally Weighted Linear Regression (LWLR) model to estimate execution time of a job and Lagrange Multiplier technique for resource provisioning to satisfy user job with a given deadline. The performance of the propose model is extensively evaluated in both in-house Hadoop cluster and Amazon EC2 Cloud. Experimental results show that the proposed model is highly accurate in job execution estimation and jobs are completed within the required deadlines following on the resource provisioning scheme of the proposed model. In addition, the Hadoop framework has over 190 configuration parameters and some of them have significant effects on the performance of a Hadoop job. Manually setting the optimum values for these parameters is a challenging task and also a time consuming process. This thesis presents optimization works that enhances the performance of Hadoop by automatically tuning its parameter values. It employs Gene Expression Programming (GEP) technique to build an objective function that represents the performance of a job and the correlation among the configuration parameters. For the purpose of optimization, Particle Swarm Optimization (PSO) is employed to find automatically an optimal or a near optimal configuration settings. The performance of the proposed work is intensively evaluated on a Hadoop cluster and the experimental results show that the proposed work enhances the performance of Hadoop significantly compared with the default settings.

http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.655905

005.7

Identifer	oai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:655905
Date	January 2015
Creators	Khan, Mukhtaj
Contributors	Li, M.
Publisher	Brunel University
Source Sets	Ethos UK
Detected Language	English
Type	Electronic Thesis or Dissertation
Source	http://bura.brunel.ac.uk/handle/2438/11078

Page generated in 0.0021 seconds

Hadoop performance modeling and job optimization for big data analytics

Description

Links & Downloads

Tags

Additional Fields