Global ETD Search

11	PStorM: Profile Storage and Matching for Feedback-Based Tuning of MapReduce Jobs Ead, Mostafa January 2012 (has links) The MapReduce programming model has become widely adopted for large scale analytics on big data. MapReduce systems such as Hadoop have many tuning parameters, many of which have a significant impact on performance. The map and reduce functions that make up a MapReduce job are developed using arbitrary programming constructs, which makes them black-box in nature and prevents users from making good parameter tuning decisions for a submitted MapReduce job. Some research projects, such as the Starfish system, aim to provide automatic tuning decisions for input MapReduce jobs. Starfish and similar systems rely on an execution profile of a MapReduce job being tuned, and this profile is assumed to come from a previous execution of the same job. Managing these execution profiles has not been previously studied. This thesis presents PStorM, a profile store that organizes the collected profiling information in a scalable and extensible data model, and a profile matcher that accurately picks the relevant profiling information even for previously unseen MapReduce jobs. PStorM is currently integrated with the Starfish system, providing the necessary profiles that Starfish needs to tune a job. The thesis presents results that demonstrate the accuracy and efficiency of profile matching. The results also show that the profiles returned by PStorM lead to Starfish tuning decisions that are as good as the decisions made by profiles collected from a previous run of the job. Hadoop Tuning Profile Optimization Computer Science
12	Distributed Storage and Processing of Image Data / Distribuerad lagring och bearbeting av bilddata Dahlberg, Tobias January 2012 (has links) Systems operating in a medical environment need to maintain high standards regarding availability and performance. Large amounts of images are stored and studied to determine what is wrong with a patient. This puts hard requirements on the storage of the images. In this thesis, ways of incorporating distributed storage into a medical system are explored. Products, inspired by the success of Google, Amazon and others, are experimented with and compared to the current storage solutions. Several “non-relational databases” (NoSQL) are investigated for storing medically relevant metadata of images, while a set of distributed file systems are considered for storing the actual images. Distributed processing of the stored data is investigated by using Hadoop MapReduce to generate a useful model of the images' metadata. Distributed Storage NoSQL Distributed Processing Hadoop MapReduce
13	Fusion-based Hadoop MapReduce job for fault tolerance in distributed systems Ho, Iat-Kei 09 December 2013 (has links) Standard recovery solution on a failed task in Hadoop systems is to execute the task again. After retrying for a configured number of times, it is marked as failure. With significant amount of data, complicated Map and Reduce functions, recovering corrupted or unfinished data from a failed job can be more efficient than re-executing the same job. This paper is an extension of [1] by applying fusion-based technique [7][8] in Hadoop MapReduce tasks execution to enhance its fault tolerance. Multiple data sets are executed through Hadoop MapReduce with and without fusion in various pre-defined failure scenarios for comparison. As the complexity of the Map and Reduce function relative to the Recover function increases, it becomes more efficient to utilize fusion and users can tolerate faults by incurring less than ten percent of extra execution time. / text Fusion Hadoop Map Reduce Fault tolerance
14	"Big Data" Management and Security Application to Telemetry Data Products Kalibjian, Jeff 10 1900 (has links) ITC/USA 2013 Conference Proceedings / The Forty-Ninth Annual International Telemetering Conference and Technical Exhibition / October 21-24, 2013 / Bally's Hotel & Convention Center, Las Vegas, NV / "Big Data" [1] and the security challenge of managing "Big Data" is a hot topic in the IT world. The term "Big Data" is used to describe very large data sets that cannot be processed by traditional database applications in "tractable" periods of time. Securing data in a conventional database is challenge enough; securing data whose size may exceed hundreds of terabytes or even petabytes is even more daunting! As the size of telemetry product and telemetry post-processed product continues to grow, "Big Data" management techniques and the securing of that data may have ever increasing application in the telemetry realm. After reviewing "Big Data", "Big Data" security and management basics, potential application to telemetry post-processed product will be explored. Big Data Hadoop Big Data Security MapReduce
15	Improving the throughput of novel cluster computing systems Wu, Jiadong 21 September 2015 (has links) Traditional cluster computing systems such as the supercomputers are equipped with specially designed high-performance hardware, which escalates the manufacturing cost and the energy cost of those systems. Due to such drawbacks and the diversified demand in computation, two new types of clusters are developed: the GPU clusters and the Hadoop clusters. The GPU cluster combines traditional CPU-only computing cluster with general purpose GPUs to accelerate the applications. Thanks to the massively-parallel architecture of the GPU, this type of system can deliver much higher performance-per-watt than the traditional computing clusters. The Hadoop cluster is another popular type of cluster computing system. It uses inexpensive off-the-shelf component and standard Ethernet to minimize manufacturing cost. The Hadoop systems are widely used throughout the industry. Alongside with the lowered cost, these new systems also bring their unique challenges. According to our study, the GPU clusters are prone to severe under-utilization due to the heterogeneous nature of its computation resources, and the Hadoop clusters are vulnerable to network congestion due to its limited network resources. In this research, we are trying to improve the throughput of these novel cluster computing systems by increasing the workload parallelism and network I/O parallelism. GPU Hadoop Computer cluster Parallel computing
16	PStorM: Profile Storage and Matching for Feedback-Based Tuning of MapReduce Jobs Ead, Mostafa January 2012 (has links) The MapReduce programming model has become widely adopted for large scale analytics on big data. MapReduce systems such as Hadoop have many tuning parameters, many of which have a significant impact on performance. The map and reduce functions that make up a MapReduce job are developed using arbitrary programming constructs, which makes them black-box in nature and prevents users from making good parameter tuning decisions for a submitted MapReduce job. Some research projects, such as the Starfish system, aim to provide automatic tuning decisions for input MapReduce jobs. Starfish and similar systems rely on an execution profile of a MapReduce job being tuned, and this profile is assumed to come from a previous execution of the same job. Managing these execution profiles has not been previously studied. This thesis presents PStorM, a profile store that organizes the collected profiling information in a scalable and extensible data model, and a profile matcher that accurately picks the relevant profiling information even for previously unseen MapReduce jobs. PStorM is currently integrated with the Starfish system, providing the necessary profiles that Starfish needs to tune a job. The thesis presents results that demonstrate the accuracy and efficiency of profile matching. The results also show that the profiles returned by PStorM lead to Starfish tuning decisions that are as good as the decisions made by profiles collected from a previous run of the job. Hadoop Tuning Profile Optimization Computer Science
17	Characterization of Performance Anomalies in Hadoop Gupta, Puja Makhanlal 20 May 2015 (has links) No description available. Computer Science
18	STUDY ON PARALLELIZING PARTICLE FILTERS WITH APPLICATIONS TO TOPIC MODELS Ding, Erli 01 June 2016 (has links) No description available. Engineering
19	Policy-Driven YARN Launcher Giannokostas, Vasileios January 2016 (has links) In recent years, there has been a rising demand for IT solutions that are capable to handle vast amount of data. Hadoop became the de facto software framework for distributed storage and distributed processing of huge datasets with a high pace. YARN is the resource management layer for Hadoop ecosystem which decouples the programming model from the resource management mechanism. Although Hadoop and YARN create a powerful ecosystem which provides scalability and flexibility, launching applications with YARN currently requires intimate knowledge of YARN’s inner workings. This thesis focuses on designing and developing support for a human-friendly YARN application launching environmen twhere the system takes responsibility for allocating resources to applications. This novel idea will simplify the launching process of an application and it will give the opportunity to inexperienced users to run applications over Hadoop. / De senaste åren har haft en ökad efterfrågan på IT-lösningar som är kapabla att hantera stora mängd data. Hadoop är ett av de mest använda ramverken för att lagra och behandla stora datamängder distribuerat och i ett högt tempo. YARN är ett resurshanteringslager för Hadoop som skiljer programmeringsmodellen från resurshanteringsmekanismen. Även fast Hadoop och YARN skapar ett kraftfullt system som ger flexibilitet och skalbarhet så krävs det avancerade kunskaper om YARN för att göra detta. Detta examensarbete fokuserar på design och utveckling av en människovänlig YARN applikationsstartsmiljö där systemet tar ansvar för tilldelning av resurser till program. Denna nya idé förenklar starten av program och ger oerfarna användare möjligheten att köra program över Hadoop. Hadoop YARN Computer and Information Sciences Data- och informationsvetenskap
20	Improving Scheduling in Heterogeneous Grid and Hadoop Systems Rasooli, Oskooei Aysan 10 1900 (has links) <p>Advances in network technologies and computing resources have led to the deployment of large scale computational systems, such as those following Grid or Cloud architectures. The scheduling problem is a significant issue in these distributed computing environments, where a scheduling algorithm should consider multiple objectives and performance metrics. Moreover, heterogeneity is increasing at both the application and resource levels. The heterogeneity in these systems can have a huge impact on performance in terms of metrics such as average completion time. However, traditional Grid and Cloud scheduling algorithms neglect heterogeneity in their scheduling decisions. This PhD dissertation studies the scheduling challenges in Computational Grid, Data Grid, and Cloud computing systems, and introduces new scheduling algorithms for each of these systems.</p> <p>The main issue in Grid scheduling is the wide distribution of resources. As a result, gathering full state information can add huge overhead to the scheduler. This thesis introduces a Computational Grid scheduling algorithm which simultaneously addresses minimizing completion times (by considering system heterogeneity), while requiring zero dynamic state information. Simulation results show the promising performance of this algorithm, and its robustness with respect to errors in parameter estimates.</p> <p>In the case of Data Grid schedulers, an efficient scheduling decision should select a combination of resources for a task that simultaneously mitigates the computation and the communication costs. Therefore, these schedulers need to consider a large search space to find an appropriate combination. This thesis introduces a new Data Grid scheduling algorithm, which dynamically makes replication and scheduling decisions. The proposed algorithm reduces the search space, decreases the required state information, and improves the performance by considering the system heterogeneity. Simulation results illustrate the promising performance of the introduced algorithm.</p> <p>Cloud computing can be considered as a next generation of Grid computing. One of the main challenges in Cloud systems is the enormous expansion of data in different applications. The MapReduce programming model and Hadoop framework were designed as a solution for executing large scale data-intensive applications. A large number of (heterogeneous) users, using the same Hadoop cluster, can result in tensions between the various performance metrics by which such systems are measured. This research introduces and implements a Hadoop scheduling system, which uses system information such as estimated job arrival rates and mean job execution times to make scheduling decisions. The proposed scheduling system, named COSHH (Classification and Optimization based Scheduler for Heterogeneous Hadoop systems), considers heterogeneity at both the application and cluster levels. The main objective of COSHH is to improve the average completion time of jobs. However, as it is concerned with other key Hadoop performance metrics, it also achieves competitive performance under minimum share satisfaction, fairness and locality metrics, with respect to other well-known Hadoop schedulers. The proposed scheduler can be efficiently applied in heterogeneous clusters, in contrast to most Hadoop schedulers, which assume homogeneous clusters.</p> <p>A Hadoop system can be described based on three factors: cluster, workload, and user. Each factor is either heterogeneous or homogeneous, which reflects the heterogeneity level of the Hadoop system. This PhD research studies the effect of heterogeneity in each of these factors on the performance of Hadoop schedulers. Three schedulers which consider different levels of Hadoop heterogeneity are used for the analysis: FIFO, Fair sharing, and COSHH. Performance issues are introduced for Hadoop schedulers, and experiments are provided to evaluate these issues. The reported results suggest guidelines for selecting an appropriate scheduler for different Hadoop systems. The focus of these guidelines is on systems which do not have significant fluctuations in the number of resources or jobs.</p> <p>There is a considerable challenge in Hadoop to schedule tasks and resources in a scalable manner. Moreover, the potential heterogeneous nature of deployed Hadoop systems tends to increase this challenge. This thesis analyzes the performance of widely used Hadoop schedulers including FIFO and Fair sharing and compares them with the COSHH scheduler. Based on the discussed insights, a hybrid solution is introduced, which selects appropriate scheduling algorithms for scalable and heterogeneous Hadoop systems with respect to the number of incoming jobs and available resources. The proposed hybrid scheduler considers the guidelines provided for heterogeneous Hadoop systems in the case that the number of jobs and resources change considerably.</p> <p>To improve the performance of high priority users, Hadoop guarantees minimum numbers of resource shares for these users at each point in time. This research compares different scheduling algorithms based on minimum share consideration and under different heterogeneous and homogeneous environments. For this evaluation, a real Hadoop system is developed and evaluated using Facebook workloads. Based on the experiments performed, a reliable scheduling algorithm is suggested which can work efficiently in different environments.</p> / Doctor of Philosophy (PhD) Grid Computing Data Grids Grid Scheduling Shadow Routing Approach Cloud Computing Hadoop System Heterogeneous Hadoop Scalable Hadoop Hadoop Scheduling Other Engineering Other Engineering

Search results