Spelling suggestions: "subject:"dataintensive computing"" "subject:"dataintensiva computing""
1 |
A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel ArchitecturesJiang, Wei 27 August 2012 (has links)
No description available.
|
2 |
Performance Optimization Techniques and Tools for Data-Intensive Computation Platforms : An Overview of Performance Limitations in Big Data Systems and Proposed OptimizationsKalavri, Vasiliki January 2014 (has links)
Big data processing has recently gained a lot of attention both from academia and industry. The term refers to tools, methods, techniques and frameworks built to collect, store, process and analyze massive amounts of data. Big data can be structured, unstructured or semi-structured. Data is generated from various different sources and can arrive in the system at various rates. In order to process these large amounts of heterogeneous data in an inexpensive and efficient way, massive parallelism is often used. The common architecture of a big data processing system consists of a shared-nothing cluster of commodity machines. However, even in such a highly parallel setting, processing is often very time-consuming. Applications may take up to hours or even days to produce useful results, making interactive analysis and debugging cumbersome. One of the main problems is that good performance requires both good data locality and good resource utilization. A characteristic of big data analytics is that the amount of data that is processed is typically large in comparison with the amount of computation done on it. In this case, processing can benefit from data locality, which can be achieved by moving the computation close the to data, rather than vice versa. Good utilization of resources means that the data processing is done with maximal parallelization. Both locality and resource utilization are aspects of the programming framework’s runtime system. Requiring the programmer to work explicitly with parallel process creation and process placement is not desirable. Thus, specifying good optimization that would relieve the programmer from low-level, error-prone instrumentation to achieve good performance is essential. The main goal of this thesis is to study, design and implement performance optimizations for big data frameworks. This work contributes methods and techniques to build tools for easy and efficient processing of very large data sets. It describes ways to make systems faster, by inventing ways to shorten job completion times. Another major goal is to facilitate the application development in distributed data-intensive computation platforms and make big-data analytics accessible to non-experts, so that users with limited programming experience can benefit from analyzing enormous datasets. The thesis provides results from a study of existing optimizations in MapReduce and Hadoop related systems. The study presents a comparison and classification of existing systems, based on their main contribution. It then summarizes the current state of the research field and identifies trends and open issues, while also providing our vision on future directions. Next, this thesis presents a set of performance optimization techniques and corresponding tools fordata-intensive computing platforms; PonIC, a project that ports the high-level dataflow framework Pig, on top of the data-parallel computing framework Stratosphere. The results of this work show that Pig can highly benefit from using Stratosphereas the backend system and gain performance, without any loss of expressiveness. The work also identifies the features of Pig that negatively impact execution time and presents a way of integrating Pig with different backends. HOP-S, a system that uses in-memory random sampling to return approximate, yet accurate query answers. It uses a simple, yet efficient random sampling technique implementation, which significantly improves the accuracy of online aggregation. An optimization that exploits computation redundancy in analysis programs and m2r2, a system that stores intermediate results and uses plan matching and rewriting in order to reuse results in future queries. Our prototype on top of the Pig framework demonstrates significantly reduced query response times. Finally, an optimization framework for iterative fixed points, which exploits asymmetry in large-scale graph analysis. The framework uses a mathematical model to explain several optimizations and to formally specify the conditions under which, optimized iterative algorithms are equivalent to the general solution. / <p>QC 20140605</p>
|
3 |
Improving Performance And Programmer Productivity For I/o-intensive High Performance Computing ApplicationsSehrish, Saba 01 January 2010 (has links)
Due to the explosive growth in the size of scientific data sets, data-intensive computing is an emerging trend in computational science. HPC applications are generating and processing large amount of data ranging from terabytes (TB) to petabytes (PB). This new trend of growth in data for HPC applications has imposed challenges as to what is an appropriate parallel programming framework to efficiently process large data sets. In this work, we study the applicability of two programming models (MPI/MPI-IO and MapReduce) to a variety of I/O-intensive HPC applications ranging from simulations to analytics. We identify several performance and programmer productivity related limitations of these existing programming models, if used for I/O-intensive applications. We propose new frameworks which will improve both performance and programmer productivity for the emerging I/O-intensive applications. Message Passing Interface (MPI) is widely used for writing HPC applications. MPI/MPI- IO allows a fine-grained control of assigning data and task distribution. At the programming frameworks level, various optimizations have been proposed to improve the performance of MPI/MPI-IO function calls. These performance optimizations are provided as various function options to the programmers. In order to write an efficient code, they are required to know the exact usage of the optimization functions, hence programmer productivity is limited. We propose an abstraction called Reduced Function Set Abstraction (RFSA) for MPI-IO to reduce the number of I/O functions and provide methods to automate the selection of appropriate I/O function for writing HPC simulation applications. The purpose of RFSA is to hide the performance optimization functions from the application developer, and relieve the application developer from deciding on a specific function. The proposed set of functions relies on a selection algorithm to decide among the most common optimizations provided by MPI-IO. Additionally, many application scientists are looking to integrate data-intensive computing into computational-intensive High Performance Computing facilities, particularly for data analytics. We have observed several scientific applications which must migrate their data from an HPC storage system to a data-intensive one. There is a gap between the data semantics of HPC storage and data-intensive system, hence, once migrated, the data must be further refined and reorganized. This reorganization must be performed before existing data-intensive tools such as MapReduce can be effectively used to analyze data. This reorganization requires at least two complete scans through the data set and then at least one MapReduce program to prepare the data before analyzing it. Running multiple MapReduce phases causes significant overhead for the application, in the form of excessive I/O operations. For every MapReduce application that must be run in order to complete the desired data analysis, a distributed read and write operation on the file system must be performed. Our contribution is to extend Map-Reduce to eliminate the multiple scans and also reduce the number of pre-processing MapReduce programs. We have added additional expressiveness to the MapReduce language in our novel framework called MapReduce with Access Patterns (MRAP), which allows users to specify the logical semantics of their data such that 1) the data can be analyzed without running multiple data pre-processing MapReduce programs, and 2) the data can be simultaneously reorganized as it is migrated to the data-intensive file system. We also provide a scheduling mechanism to further improve the performance of these applications. The main contributions of this thesis are, 1) We implement a selection algorithm for I/O functions like read/write, merge a set of functions for data types and file views and optimize the atomicity function by automating the locking mechanism in RFSA. By running different parallel I/O benchmarks on both medium-scale clusters and NERSC supercomputers, we show an improved programmer productivity (35.7% on average). This approach incurs an overhead of 2-5% for one particular optimization, and shows performance improvement of 17% when a combination of different optimizations is required by an application. 2) We provide an augmented Map-Reduce system (MRAP), which consist of an API and corresponding optimizations i.e. data restructuring and scheduling. We have demonstrated up to 33% throughput improvement in one real application (read-mapping in bioinformatics), and up to 70% in an I/O kernel of another application (halo catalogs analytics). Our scheduling scheme shows performance improvement of 18% for an I/O kernel of another application (QCD analytics).
|
4 |
Supporting Data-Intensive Scientic Computing on Bandwidth and Space Constrained EnvironmentsBicer, Tekin 18 August 2014 (has links)
No description available.
|
5 |
Efficient and parallel evaluation of XQueryLi, Xiaogang 22 February 2006 (has links)
No description available.
|
6 |
Building Evolutionary Clustering Algorithms on SparkFu, Xinye January 2017 (has links)
Evolutionary clustering (EC) is a kind of clustering algorithm to handle the noise of time-evolved data. It can track the truth drift of clustering across time by considering history. EC tries to make clustering result fit both current data and historical data/model well, so each EC algorithm defines snapshot cost (SC) and temporal cost (TC) to reflect both requests. EC algorithms minimize both SC and TC by different methods, and they have different ability to deal with a different number of cluster, adding/deleting nodes, etc.Until now, there are more than 10 EC algorithms, but no survey about that. Therefore, a survey of EC is written in the thesis. The survey first introduces the application scenario of EC, the definition of EC, and the history of EC algorithms. Then two categories of EC algorithms model-level algorithms and data-level algorithms are introduced oneby-one. What’s more, each algorithm is compared with each other. Finally, performance prediction of algorithms is given. Algorithms which optimize the whole problem (i.e., optimize change parameter or don’t use change parameter to control), accept a change of cluster number perform best in theory.EC algorithm always processes large datasets and includes many iterative data-intensive computations, so they are suitable for implementing on Spark. Until now, there is no implementation of EC algorithm on Spark. Hence, four EC algorithms are implemented on Spark in the project. In the thesis, three aspects of the implementation are introduced. Firstly, algorithms which can parallelize well and have a wide application are selected to be implemented. Secondly, program design details for each algorithm have been described. Finally, implementations are verified by correctness and efficiency experiments. / Evolutionär clustering (EC) är en slags klustringsalgoritm för att hantera bruset av tidutvecklad data. Det kan spåra sanningshanteringen av klustring över tiden genom att beakta historien. EC försöker göra klustringsresultatet passar både aktuell data och historisk data / modell, så varje EC-algoritm definierar ögonblicks kostnad (SC) och tidsmässig kostnad (TC) för att reflektera båda förfrågningarna. EC-algoritmer minimerar både SC och TC med olika metoder, och de har olika möjligheter att hantera ett annat antal kluster, lägga till / radera noder etc.Hittills finns det mer än 10 EC-algoritmer, men ingen undersökning om det. Därför skrivs en undersökning av EC i avhandlingen. Undersökningen introducerar först applikationsscenariot för EC, definitionen av EC och historien om EC-algoritmer. Därefter introduceras två kategorier av EC-algoritmer algoritmer på algoritmer och algoritmer på datanivå en för en. Dessutom jämförs varje algoritm med varandra. Slutligen ges resultatprediktion av algoritmer. Algoritmer som optimerar hela problemet (det vill säga optimera förändringsparametern eller inte använda ändringsparametern för kontroll), acceptera en förändring av klusternummer som bäst utför i teorin.EC-algoritmen bearbetar alltid stora dataset och innehåller många iterativa datintensiva beräkningar, så de är lämpliga för implementering på Spark. Hittills finns det ingen implementering av EG-algoritmen på Spark. Därför implementeras fyra EC-algoritmer på Spark i projektet. I avhandlingen införs tre aspekter av genomförandet. För det första är algoritmer som kan parallellisera väl och ha en bred tillämpning valda att implementeras. För det andra har programdesigndetaljer för varje algoritm beskrivits. Slutligen verifieras implementeringarna av korrekthet och effektivitetsexperiment.
|
7 |
Supporting Fault Tolerance and Dynamic Load Balancing in FREERIDE-GBicer, Tekin 23 August 2010 (has links)
No description available.
|
8 |
Specification, Configuration and Execution of Data-intensive Scientific ApplicationsKumar, Vijay Shiv 14 December 2010 (has links)
No description available.
|
9 |
Analyzing hybrid architectures for massively parallel graph analysisEdiger, David 08 April 2013 (has links)
The quantity of rich, semi-structured data generated by sensor networks, scientific simulation, business activity, and the Internet grows daily. The objective of this research is to investigate architectural requirements for emerging applications in massive graph analysis. Using emerging hybrid systems, we will map applications to architectures and close the loop between software and hardware design in this application space. Parallel algorithms and specialized machine architectures are necessary to handle the immense size and rate of change of today's graph data. To highlight the impact of this work, we describe a number of relevant application areas ranging from biology to business and cybersecurity. With several proposed architectures for massively parallel graph analysis, we investigate the interplay of hardware, algorithm, data, and programming model through real-world experiments and simulations. We demonstrate techniques for obtaining parallel scaling on multithreaded systems using graph algorithms that are orders of magnitude faster and larger than the state of the art. The outcome of this work is a proposed hybrid architecture for massive-scale analytics that leverages key aspects of data-parallel and highly multithreaded systems. In simulations, the hybrid systems incorporating a mix of multithreaded, shared memory systems and solid state disks performed up to twice as fast as either homogeneous system alone on graphs with as many as 18 trillion edges.
|
10 |
Performance Evaluation of Data Intensive Computing In The CloudKaza, Bhagavathi 01 January 2013 (has links)
Big data is a topic of active research in the cloud community. With increasing demand for data storage in the cloud, study of data-intensive applications is becoming a primary focus. Data-intensive applications involve high CPU usage for processing large volumes of data on the scale of terabytes or petabytes. While some research exists for the performance effect of data intensive applications in the cloud, none of the research compares the Amazon Elastic Compute Cloud (Amazon EC2) and Google Compute Engine (GCE) clouds using multiple benchmarks. This study performs extensive research on the Amazon EC2 and GCE clouds using the TeraSort, MalStone and CreditStone benchmarks on Hadoop and Sector data layers. Data collected for the Amazon EC2 and GCE clouds measure performance as the number of nodes is varied. This study shows that GCE is more efficient for data-intensive applications compared to Amazon EC2.
|
Page generated in 0.129 seconds