Global ETD Search

91	Automatic Tuning of Data-Intensive Analytical Workloads Herodotou, Herodotos January 2012 (has links) <p>Modern industrial, government, and academic organizations are collecting massive amounts of data ("Big Data") at an unprecedented scale and pace. The ability to perform timely and cost-effective analytical processing of such large datasets in order to extract deep insights is now a key ingredient for success. These insights can drive automated processes for advertisement placement, improve customer relationship management, and lead to major scientific breakthroughs.</p><p>Existing database systems are adapting to the new status quo while large-scale dataflow systems (like Dryad and MapReduce) are becoming popular for executing analytical workloads on Big Data. Ensuring good and robust performance automatically on such systems poses several challenges. First, workloads often analyze a hybrid mix of structured and unstructured datasets stored in nontraditional data layouts. The structure and properties of the data may not be known upfront, and will evolve over time. Complex analysis techniques and rapid development needs necessitate the use of both declarative and procedural programming languages for workload specification. Finally, the space of workload tuning choices is very large and high-dimensional, spanning configuration parameter settings, cluster resource provisioning (spurred by recent innovations in cloud computing), and data layouts.</p><p>We have developed a novel dynamic optimization approach that can form the basis for tuning workload performance automatically across different tuning scenarios and systems. Our solution is based on (i) collecting monitoring information in order to learn the run-time behavior of workloads, (ii) deploying appropriate models to predict the impact of hypothetical tuning choices on workload behavior, and (iii) using efficient search strategies to find tuning choices that give good workload performance. The dynamic nature enables our solution to overcome the new challenges posed by Big Data, and also makes our solution applicable to both MapReduce and Database systems. We have developed the first cost-based optimization framework for MapReduce systems for determining the cluster resources and configuration parameter settings to meet desired requirements on execution time and cost for a given analytic workload. We have also developed a novel tuning-based optimizer in Database systems to collect targeted run-time information, perform optimization, and repeat as needed to perform fine-grained tuning of SQL queries.</p> / Dissertation Computer science cost-based optimization Database systems MapReduce systems self-tuning systems
92	Distributed Algorithms for SVD-based Least Squares Estimation Peng, Yu-Ting 19 July 2011 (has links) Singular value decomposition (SVD) is a popular decomposition method for solving least-squares estimation problems. However, for large datasets, SVD is very time consuming and memory demanding in obtaining least squares solutions. In this paper, we propose a least squares estimator based on an iterative divide-and-merge scheme for large-scale estimation problems. The estimator consists of several levels. At each level, the input matrices are subdivided into submatrices. The submatrices are decomposed by SVD respectively and the results are merged into smaller matrices which become the input of the next level. The process is iterated until the resulting matrices are small enough which can then be solved directly and efficiently by the SVD algorithm. However, the iterative divide-and-merge algorithms executed on a single machine is still time demanding on large scale datasets. We propose two distributed algorithms to overcome this shortcoming by permitting several machines to perform the decomposition and merging of the submatrices in each level in parallel. The first one is implemented in MapReduce on the Hadoop distributed platform which can run the tasks in parallel on a collection of computers. The second one is implemented on CUDA which can run the tasks in parallel using the Nvidia GPUs. Experimental results demonstrate that the proposed distributed algorithms can greatly reduce the time required to solve large-squares problems. CUDA Matrix decomposition large-scale dataset least-squares solution SVD MapReduce Distributed
93	The development of an intelligent, cloud-based remote monitoring management system Cheng, Wen-Hao 25 October 2012 (has links) In this thesis, a data collection application based on MapReduce programming is described. This application aims to collect tempera- ture data stream continuously from a specied set of sensors. Instead of collecting the temperature information of all the sensors by one machine, the sensors are divided into several subsets each of which is handled as a Map task. In each Map task, the temperature data stream of the assigned sensors is collected continuously and stored in a predened database. All the Map tasks can run simultaneously on several machines. This method can reduce the delay time and improve the eciency of the data collection service, especially in the case of having a huge number of sensors monitored remotely by a data center through Internet. We can use the value of remote sensors to predict the next value of remote sensors by some methods such as linear regres- sion and K-means. And, we can use it to predict the system alarm. Experimental results show that the proposed method is eective in temperature data collection,and eective in carbon reduction. K-means Hadoop Data collection MapReduce Distributed programming Database Linear regression
94	A Distributed Graph Mining Framework Based On Mapreduce Alkan, Sertan 01 January 2010 (has links) (PDF) The frequent patterns hidden in a graph can reveal crucial information about the network the graph represents. Existing techniques to mine the frequent subgraphs in a graph database generally rely on the premise that the data can be fit into main memory of the device that the computation takes place. Even though there are some algorithms that are designed using highly optimized methods to some extent, many lack the solution to the problem of scalability. In this thesis work, our aim is to find and enumerate the subgraphs that are at least as frequent as the designated threshold in a given graph. Here, we propose a new distributed algorithm for frequent subgraph mining problem that can scale horizontally as the computing cluster size increases. The method described here, uses a partitioning method and Map/Reduce programming model to distribute the computation of frequent subgraphs. In the core of this algorithm, we make use of an existing graph partitioning method to split the given data in the distributed file system and to merge and join the computed subgraphs without losing information. The frequent subgraph computation in each split is done using another known method which can enumerate the frequent patterns. Although current algorithms can efficiently find frequent patterns, they are not parallel or distributed algorithms in that even when they partition the data, they are designed to work on a single machine. Furthermore, these algorithms are computationally expensive but not fault tolerant and are not designed to work on a distributed file system. Using the Map/Reduce paradigm, we distribute the computation of frequent patterns to every machine in a cluster. Our algorithm, first bi-partitions the data via successive Map/Reduce jobs, then invokes another Map/Reduce job to compute the subgraphs in partitions using CloseGraph, recovers the whole set by invoking a series of Map/Reduce jobs to merge-join the previously found patterns. The implementation uses an open source Map/Reduce environment, Hadoop. In our experiments, our method can scale up to large graphs, as the graph data size gets bigger, this method performs better than the existing algorithms. QA Computer Software 76.75-76.765
95	On the design of architecture-aware algorithms for emerging applications Kang, Seunghwa 30 January 2011 (has links) This dissertation maps various kernels and applications to a spectrum of programming models and architectures and also presents architecture-aware algorithms for different systems. The kernels and applications discussed in this dissertation have widely varying computational characteristics. For example, we consider both dense numerical computations and sparse graph algorithms. This dissertation also covers emerging applications from image processing, complex network analysis, and computational biology. We map these problems to diverse multicore processors and manycore accelerators. We also use new programming models (such as Transactional Memory, MapReduce, and Intel TBB) to address the performance and productivity challenges in the problems. Our experiences highlight the importance of mapping applications to appropriate programming models and architectures. We also find several limitations of current system software and architectures and directions to improve those. The discussion focuses on system software and architectural support for nested irregular parallelism, Transactional Memory, and hybrid data transfer mechanisms. We believe that the complexity of parallel programming can be significantly reduced via collaborative efforts among researchers and practitioners from different domains. This dissertation participates in the efforts by providing benchmarks and suggestions to improve system software and architectures. MapReduce Nested parallelism Parallel algorithm Algorithm engineering Performance tuning GPU Transactional memory Algorithms Parallel algorithms Multiprocessors
96	An Apache Hadoop Framework for Large-Scale Peptide Identification Donepudi, Harinivesh 01 July 2015 (has links) Peptide identification is an essential step in protein identification, and Peptide Spectrum Match (PSM) data set is huge, which is a time consuming process to work on a single machine. In a typical run of the peptide identification method, PSMs are positioned by a cross correlation, a statistical score, or a likelihood that the match between the trial and hypothetical is correct and unique. This process takes a long time to execute, and there is a demand for an increase in performance to handle large peptide data sets. Development of distributed frameworks are needed to reduce the processing time, but this comes at the price of complexity in developing and executing them. In distributed computing, the program may divide into multiple parts to be executed. The work in this thesis describes the implementation of Apache Hadoop framework for large-scale peptide identification using C-Ranker. The Apache Hadoop data processing software is immersed in a complex environment composed of massive machine clusters, large data sets, and several processing jobs. The framework uses Apache Hadoop Distributed File System (HDFS) and Apache Mapreduce to store and process the peptide data respectively.The proposed framework uses a peptide processing algorithm named CRanker which takes peptide data as an input and identifies the correct PSMs. The framework has two steps: Execute the C-Ranker algorithm on Hadoop cluster and compare the correct PSMs data generated via Hadoop approach with the normal execution approach of C-Ranker. The goal of this framework is to process large peptide datasets using Apache Hadoop distributed approach. MapReduce CRanker Peptide Spectrum Match PSM Computer Sciences OS and Networks
97	Scalable Scientific Computing Algorithms Using MapReduce Xiang, Jingen January 2013 (has links) Cloud computing systems, like MapReduce and Pregel, provide a scalable and fault tolerant environment for running computations at massive scale. However, these systems are designed primarily for data intensive computational tasks, while a large class of problems in scientific computing and business analytics are computationally intensive (i.e., they require a lot of CPU in addition to I/O). In this thesis, we investigate the use of cloud computing systems, in particular MapReduce, for computationally intensive problems, focusing on two classic problems that arise in scienti c computing and also in analytics: maximum clique and matrix inversion. The key contribution that enables us to e ectively use MapReduce to solve the maximum clique problem on dense graphs is a recursive partitioning method that partitions the graph into several subgraphs of similar size and running time complexity. After partitioning, the maximum cliques of the di erent partitions can be computed independently, and the computation is sped up using a branch and bound method. Our experiments show that our approach leads to good scalability, which is unachievable by other partitioning methods since they result in partitions of di erent sizes and hence lead to load imbalance. Our method is more scalable than an MPI algorithm, and is simpler and more fault tolerant. For the matrix inversion problem, we show that a recursive block LU decomposition allows us to e ectively compute in parallel both the lower triangular (L) and upper triangular (U) matrices using MapReduce. After computing the L and U matrices, their inverses are computed using MapReduce. The inverse of the original matrix, which is the product of the inverses of the L and U matrices, is also obtained using MapReduce. Our technique is the rst matrix inversion technique that uses MapReduce. We show experimentally that our technique has good scalability, and it is simpler and more fault tolerant than MPI implementations such as ScaLAPACK. Scientific Computing Cloud Computing MapReduce Hadoop Matrix Inversion Maximum Clique Computer Science
98	以MapReduce做有效率的天際線查詢 / Efficient Skyline Computation with MapReduce 陳家慶, Chen, Chia Ching Unknown Date (has links) 隨著巨量資料的議題逐漸被重視，有越來越多的巨量資料的分析都利用MapReduce作計算處理。而在資料庫查詢中，天際線查詢是一種常見的決策分析方法，其目的是要幫助使用者找出資料庫中各維度的數值貼近使用者查詢條件的資料。然而，過去在大量資料的查詢方法中，如果資料筆數較多，同時查詢的維度也大的情況下，往往會有著效率不彰的問題。因此，本研究提出一種在大量資料中，有效率應用MapReduce作天際線查詢的方法。而根據實驗結果顯示，我們的方法，比先前方法更有效率。 / With the big data issue being taken seriously today, more and more big data is processed with MapReduce. Moreover, skyline query is a common method for decision making, which helps users find the data whose value in each dimension is close to the user query. In the past, if the data is huge, or the data space involves many dimensions, the query processing becomes inefficient. Therefore, in this study, we present a new method to process skyline queries with MapReduce. According to the experimental results, our method is more efficient than previous methods. 巨量資料天際線 Big Data Skyline MapReduce
99	Cost-effective and privacy-conscious cloud service provisioning: architectures and algorithms Palanisamy, Balaji 27 August 2014 (has links) Cloud Computing represents a recent paradigm shift that enables users to share and remotely access high-powered computing resources (both infrastructure and software/services) contained in off-site data centers thereby allowing a more efficient use of hardware and software infrastructures. This growing trend in cloud computing, combined with the demands for Big Data and Big Data analytics, is driving the rapid evolution of datacenter technologies towards more cost-effective, consumer-driven, more privacy conscious and technology agnostic solutions. This dissertation is dedicated to taking a systematic approach to develop system-level techniques and algorithms to tackle the challenges of large-scale data processing in the Cloud and scaling and delivering privacy-aware services with anytime-anywhere availability. We analyze the key challenges in effective provisioning of Cloud services in the context of MapReduce-based parallel data processing considering the concerns of cost-effectiveness, performance guarantees and user-privacy and we develop a suite of solution techniques, architectures and models to support cost-optimized and privacy-preserving service provisioning in the Cloud. At the cloud resource provisioning tier, we develop a utility-driven MapReduce Cloud resource planning and management system called Cura for cost-optimally allocating resources to jobs. While existing services require users to select a number of complex cluster and job parameters and use those potentially sub-optimal per-job configurations, the Cura resource management achieves global resource optimization in the cloud by minimizing cost and maximizing resource utilization. We also address the challenges of resource management and job scheduling for large-scale parallel data processing in the Cloud in the presence of networking and storage bottlenecks commonly experienced in Cloud data centers. We develop Purlieus, a self-configurable locality-based data and virtual machine management framework that enables MapReduce jobs to access their data either locally or from close-by nodes including all input, output and intermediate data achieving significant improvements in job response time. We then extend our cloud resource management framework to support privacy-preserving data access and efficient privacy-conscious query processing. Concretely, we propose and implement VNCache: an efficient solution for MapReduce analysis of cloud-archived log data for privacy-conscious enterprises. Through a seamless data streaming and prefetching model in VNCache, Hadoop jobs begin execution as soon as they are launched without requiring any apriori downloading. At the cloud consumer tier, we develop mix-zone based techniques for delivering anonymous cloud services to mobile users on the move through Mobimix, a novel road-network mix-zone based framework that enables real time, location based service delivery without disclosing content or location privacy of the consumers. Cloud computing MapReduce Big data Cost-optimization Locality-awareness Caching System privacy Location privacy
100	Distributed Particle Filters for Data Assimilation in Simulation of Large Scale Spatial Temporal Systems Bai, Fan 18 December 2014 (has links) Assimilating real time sensor into a running simulation model can improve simulation results for simulating large-scale spatial temporal systems such as wildfire, road traffic and flood. Particle filters are important methods to support data assimilation. While particle filters can work effectively with sophisticated simulation models, they have high computation cost due to the large number of particles needed in order to converge to the true system state. This is especially true for large-scale spatial temporal simulation systems that have high dimensional state space and high computation cost by themselves. To address the performance issue of particle filter-based data assimilation, this dissertation developed distributed particle filters and applied them to large-scale spatial temporal systems. We first implemented a particle filter-based data assimilation framework and carried out data assimilation to estimate system state and model parameters based on an application of wildfire spread simulation. We then developed advanced particle routing methods in distributed particle filters to route particles among the Processing Units (PUs) after resampling in effective and efficient manners. In particular, for distributed particle filters with centralized resampling, we developed two routing policies named minimal transfer particle routing policy and maximal balance particle routing policy. For distributed PF with decentralized resampling, we developed a hybrid particle routing approach that combines the global routing with the local routing to take advantage of both. The developed routing policies are evaluated from the aspects of communication cost and data assimilation accuracy based on the application of data assimilation for large-scale wildfire spread simulations. Moreover, as cloud computing is gaining more and more popularity; we developed a parallel and distributed particle filter based on Hadoop & MapReduce to support large-scale data assimilation. Large-scale spatial temporal systems Distributed particle filters Routing and layout Simulation performance Hadoop & MapReduce.

Search results