1 |
An Enhanced MapReduce Workload Allocation Tool for Spot Market ResourcesHudzina, John Stephen 29 March 2015 (has links)
When a cloud user allocates a cluster to execute a map-reduce workload, the user must determine the number and type of virtual machine instances to minimize the workload's financial cost. The cloud user may rent on-demand instances at a fixed price or spot instances at a variable price to execute the workload. Although the cloud user may bid on spot virtual machine instances at a reduced rate, the spot market auction may delay the workload's start or terminate the spot instances before the workload completes. The cloud user requires a forecast for the workload's financial cost and completion time to analyze the trade-offs between on-demand and spot instances.
While existing estimation tools predict map-reduce workloads' completion times and costs, these tools do not provide spot instance estimates because a spot market auction determines the instance's start time and duration. The ephemeral spot instances impact execution time estimates because the spot market auction forces the map-reduce workloads to use different storage strategies to persist data after the spot instances terminate. The spot market also reduces the existing tools' completion time and cost estimate accuracy because the tool must factor in spot instance wait times and early terminations.
This dissertation updated an existing tool to forecast map-reduce workload's monetary cost and completion time based on spot market historical traces. The enhanced estimation tool includes three new enhancements over existing tools. First, the estimation tool models the impact to the execution from new storage strategies. Second, the enhanced tool calculates additional execution time from early spot instance termination. Finally, the enhance tool predicts the workloads wait time and early termination probabilities from historic traces. Based on two historical Amazon EC2 spot market traces, the enhancements reduce the average completion time prediction error by 96% and the average monetary cost prediction error by 99% over existing tools.
|
2 |
Design and implementation of scalable hierarchical density based clusteringDhandapani, Sankari 09 November 2010 (has links)
Clustering is a useful technique that divides data points into groups, also known as clusters, such that the data points of the same cluster exhibit similar properties. Typical clustering algorithms assign each data point to at least one cluster. However, in practical datasets like microarray gene dataset, only a subset of the genes are highly correlated and the dataset is often polluted with a huge volume of genes that are irrelevant. In such cases, it is important to ignore the poorly correlated genes and just cluster the highly correlated genes.
Automated Hierarchical Density Shaving (Auto-HDS) is a non-parametric density based technique that partitions only the relevant subset of the dataset into multiple clusters while pruning the rest. Auto-HDS performs a hierarchical clustering that identifies dense clusters of different densities and finds a compact hierarchy of the clusters identified. Some of the key features of Auto-HDS include selection and ranking of clusters using custom stability criterion and a topologically meaningful 2D projection and visualization of the clusters discovered in the higher dimensional original space. However, a key limitation of Auto-HDS is that it requires O(n*n) storage, and O(n*n*logn) computational complexity, making it scale up to only a few 10s of thousands of points. In this thesis, two extensions to Auto-HDS are presented for lower dimensional datasets that can generate clustering identical to Auto-HDS but can scale to much larger datasets. We first introduce Partitioned Auto-HDS that provides significant reduction in time and space complexity and makes it possible to generate the Auto-HDS cluster hierarchy on much larger datasets with 100s of millions of data points. Then, we describe Parallel Auto-HDS that takes advantage of the inherent parallelism available in Partitioned Auto-HDS to scale to even larger datasets without a corresponding increase in actual run time when a group of processors are available for parallel execution. Partitioned Auto-HDS is implemented on top of GeneDIVER, a previously existing Java based streaming implementation of Auto-HDS, and thus it retains all the key features of Auto-HDS including ranking, automatic selection of clusters and 2D visualization of the discovered cluster topology. / text
|
3 |
Parallelizing support vector machines for scalable image annotationAlham, Nasullah Khalid January 2011 (has links)
Machine learning techniques have facilitated image retrieval by automatically classifying and annotating images with keywords. Among them Support Vector Machines (SVMs) are used extensively due to their generalization properties. However, SVM training is notably a computationally intensive process especially when the training dataset is large. In this thesis distributed computing paradigms have been investigated to speed up SVM training, by partitioning a large training dataset into small data chunks and process each chunk in parallel utilizing the resources of a cluster of computers. A resource aware parallel SVM algorithm is introduced for large scale image annotation in parallel using a cluster of computers. A genetic algorithm based load balancing scheme is designed to optimize the performance of the algorithm in heterogeneous computing environments. SVM was initially designed for binary classifications. However, most classification problems arising in domains such as image annotation usually involve more than two classes. A resource aware parallel multiclass SVM algorithm for large scale image annotation in parallel using a cluster of computers is introduced. The combination of classifiers leads to substantial reduction of classification error in a wide range of applications. Among them SVM ensembles with bagging is shown to outperform a single SVM in terms of classification accuracy. However, SVM ensembles training are notably a computationally intensive process especially when the number replicated samples based on bootstrapping is large. A distributed SVM ensemble algorithm for image annotation is introduced which re-samples the training data based on bootstrapping and training SVM on each sample in parallel using a cluster of computers. The above algorithms are evaluated in both experimental and simulation environments showing that the distributed SVM algorithm, distributed multiclass SVM algorithm, and distributed SVM ensemble algorithm, reduces the training time significantly while maintaining a high level of accuracy in classifications.
|
4 |
Green Clusters / Green ClustersVašut, Marek January 2015 (has links)
The thesis evaluates the viability of reducing power consumption of a contem- porary computer cluster by using more power-efficient hardware components. The cluster in question runs an Map-Reduce algorithm implementation and the worker nodes consist of either systems with an ARM CPU or systems which combine both an ARM CPU and an FPGA in a single package. The behavior of such cluster is discussed from both performance side as well as power consumption side. The text discusses the problems and peculiarities with the integration of an ARM-based and especially the combined ARM-FPGA-based systems into the Map-Reduce framework. The Map-Reduce framework performance itself is eval- uated to identify the gravest performance bottlenecks when using the framework in the environment with ARM systems. 1
|
5 |
Embarrassingly Parallel Statistics and its Applications: Divide & Recombine Methods for Parallel Computation of Quantiles and Construction of K-D Trees for Big-DataAritra Chakravorty (5929565) 16 January 2019 (has links)
<div>In Divide & Recombine (D&R), data are divided into subsets, analytic methodsare applied to each subset independently, with no communication between processes;then the subset outputs for each method are recombined. For big data, this providesalmost all of the analytic tasking needed when data are analyzed. It also provideshigh computational performance because typically most of the computation is em-barrassingly parallel, the simplest parallel computation.</div><div><br></div><div>Another kind of tasking must address computational performance and numericaccuracy: the computing of functions of all of the data, or “statistics”. For data bigand small, it is often important to compute such statistics for all of the data, whichcan be summaries of the data, such as sample quantiles of continuous variables, orcan process the data into a form that helps analysis, such as dividing the data intorepresentative subsets. Development of computational methods to compute thesestatistics can be challenging.</div><div><br></div><div>D&R can be a very effective framework for computing statistics. To supportthis, we introduce the concept of embarrassingly parallel (EP) statistics, both weakand strong. The concept of EP statistics is not entirely new, but has had littledevelopment. The existing methodology is mainly sums of sums. For example, this isdone when computing the necessary statistics for least squares where sums of productsand cross productions are carried out on subsets then summed across subsets. Ourtreatment of EP statistics has taken the concept much further. The outcome is abilityto use EP statistics in conjunction with the use a Fourier series to approximate an optimization criteria. The series terms, which are strongly EP statistics, are summedacross subsets, and the result is optimized. These are EP-F computational methods.</div><div><br></div><div>We have so far developed two EP-F computational methods for two widely usedstatistic computations. EP-F-Quantile is for quantiles of big data, and EP-F-KDtreeis for KD-trees. Speed and accuracy of EPF-Quantile are compared with that of thewell-known binning method, which also can be formulated in terms of EP statistics. EPF-KDtree is the first parallel KD-tree computational method of which we areaware. EP and EPF computational methods have potentially many other applicationsto computing statistics.</div>
|
6 |
IMPLEMENTATION OF A CLOUD SHELL FOR LIGHT-WEIGHT UNIX PROGRAMMABILITY SUPPORT IN A DISTRIBUTED CLOUD ENVIRONMENTWei, Tzu-Chieh 09 February 2012 (has links)
This thesis describes the implementation of a UNIX-styled shell environment for cloud systems. This new scripting language, the cloud shell (CLSH), uses a syntax based upon the familiar BASH shell of UNIX systems. This familiar syntax allows users to quickly learn the new environment. The difference, as compared to BASH, is that CLSH gives the user easy access to the parallelism of the cloud. Indeed, the user does not need to explicitly refer to the cloud at all; the cloud becomes simply a virtual file system and the user experience is quite similar to standard bash programming.
This cloud shell is built into Hadoop¡¦s HDFS file system. The difference, as compared to HDFS, is that CLSH offers a full range of UNIX-style commands, rather than a small subset of simple commands. Moreover, CLSH is a full-fledged scripting language that offers much more control over file management than does HDFS. To achieve comparable behavior within HDFS, the user must use either the Pig Latin tool or else use java scripting. Not only are these alternatives harder to use than CLSH, but they also perform slower and are incapable of performing certain tasks that CLSH can easily achieve. Moreover, the cloud shell environment simply provides the user with a better cloud interface; it does not preclude the use of Pig Latin or Java scripts.
|
7 |
A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel ArchitecturesJiang, Wei 27 August 2012 (has links)
No description available.
|
8 |
Scalable and Declarative Information Extraction in a Parallel Data Analytics SystemRheinländer, Astrid 06 July 2017 (has links)
Informationsextraktions (IE) auf sehr großen Datenmengen erfordert hochkomplexe, skalierbare und anpassungsfähige Systeme. Obwohl zahlreiche IE-Algorithmen existieren, ist die nahtlose und erweiterbare Kombination dieser Werkzeuge in einem skalierbaren System immer noch eine große Herausforderung. In dieser Arbeit wird ein anfragebasiertes IE-System für eine parallelen Datenanalyseplattform vorgestellt, das für konkrete Anwendungsdomänen konfigurierbar ist und für Textsammlungen im Terabyte-Bereich skaliert. Zunächst werden konfigurierbare Operatoren für grundlegende IE- und Web-Analytics-Aufgaben definiert, mit denen komplexe IE-Aufgaben in Form von deklarativen Anfragen ausgedrückt werden können. Alle Operatoren werden hinsichtlich ihrer Eigenschaften charakterisiert um das Potenzial und die Bedeutung der Optimierung nicht-relationaler, benutzerdefinierter Operatoren (UDFs) für Data Flows hervorzuheben. Anschließend wird der Stand der Technik in der Optimierung nicht-relationaler Data Flows untersucht und herausgearbeitet, dass eine umfassende Optimierung von UDFs immer noch eine Herausforderung ist. Darauf aufbauend wird ein erweiterbarer, logischer Optimierer (SOFA) vorgestellt, der die Semantik von UDFs mit in die Optimierung mit einbezieht. SOFA analysiert eine kompakte Menge von Operator-Eigenschaften und kombiniert eine automatisierte Analyse mit manuellen UDF-Annotationen, um die umfassende Optimierung von Data Flows zu ermöglichen. SOFA ist in der Lage, beliebige Data Flows aus unterschiedlichen Anwendungsbereichen logisch zu optimieren, was zu erheblichen Laufzeitverbesserungen im Vergleich mit anderen Techniken führt. Als Viertes wird die Anwendbarkeit des vorgestellten Systems auf Korpora im Terabyte-Bereich untersucht und systematisch die Skalierbarkeit und Robustheit der eingesetzten Methoden und Werkzeuge beurteilt um schließlich die kritischsten Herausforderungen beim Aufbau eines IE-Systems für sehr große Datenmenge zu charakterisieren. / Information extraction (IE) on very large data sets requires highly complex, scalable, and adaptive systems. Although numerous IE algorithms exist, their seamless and extensible combination in a scalable system still is a major challenge. This work presents a query-based IE system for a parallel data analysis platform, which is configurable for specific application domains and scales for terabyte-sized text collections. First, configurable operators are defined for basic IE and Web Analytics tasks, which can be used to express complex IE tasks in the form of declarative queries. All operators are characterized in terms of their properties to highlight the potential and importance of optimizing non-relational, user-defined operators (UDFs) for dataflows. Subsequently, we survey the state of the art in optimizing non-relational dataflows and highlight that a comprehensive optimization of UDFs is still a challenge. Based on this observation, an extensible, logical optimizer (SOFA) is introduced, which incorporates the semantics of UDFs into the optimization process. SOFA analyzes a compact set of operator properties and combines automated analysis with manual UDF annotations to enable a comprehensive optimization of data flows. SOFA is able to logically optimize arbitrary data flows from different application areas, resulting in significant runtime improvements compared to other techniques. Finally, the applicability of the presented system to terabyte-sized corpora is investigated. Hereby, we systematically evaluate scalability and robustness of the employed methods and tools in order to pinpoint the most critical challenges in building an IE system for very large data sets.
|
9 |
Massively parallel computing for particle physicsPreston, Ian Christopher January 2010 (has links)
This thesis presents methods to run scientific code safely on a global-scale desktop grid. Current attempts to harness the world’s idle desktop computers face obstacles such as donor security, portability of code and privilege requirements. Nereus, a Java-based architecture, is a novel framework that overcomes these obstacles and allows the creation of a globally-scalable desktop grid capable of executing Java bytecode. However, most scientific code is written for the x86 architecture. To enable the safe execution of unmodified scientific code, we created JPC, a pure Java x86 PC emulator. The Nereus framework is applied to two tasks, a trivially parallel data generation task, BlackMax, and a parallelization and fault tolerance framework, Mycelia. Mycelia is an implementation of the Map-Reduce parallel programming paradigm. BlackMax is a microscopic blackhole event generator, of direct relevance for the Large Hadron Collider (LHC). The Nereus based BlackMax adaptation dramatically speeds up the production of data, limited only by the number of desktop machines available.
|
10 |
Building a scalable distributed data platform using lambda architectureMehta, Dhananjay January 1900 (has links)
Master of Science / Department of Computer Science / William H. Hsu / Data is generated all the time over Internet, systems sensors and mobile devices around us this is often referred to as ‘big data’. Tapping this data is a challenge to organizations because of the nature of data i.e. velocity, volume and variety. What make handling this data a challenge? This is because traditional data platforms have been built around relational database management systems coupled with enterprise data warehouses. Legacy infrastructure is either technically incapable to scale to big data or financially infeasible. Now the question arises, how to build a system to handle the challenges of big data and cater needs of an organization? The answer is Lambda Architecture.
Lambda Architecture (LA) is a generic term that is used for scalable and fault-tolerant data processing architecture that ensures real-time processing with low latency. LA provides a general strategy to knit together all necessary tools for building a data pipeline for real-time processing of big data. LA comprise of three layers – Batch Layer, responsible for bulk data processing, Speed Layer, responsible for real-time processing of data streams and Service Layer, responsible for serving queries from end users. This project draw analogy between modern data platforms and traditional supply chain management to lay down principles for building a big data platform and show how major challenges with building a data platforms can be mitigated. This project constructs an end to end data pipeline for ingestion, organization, and processing of data and demonstrates how any organization can build a low cost distributed data platform using Lambda Architecture.
|
Page generated in 0.0626 seconds