Global ETD Search

11	Um método para paralelização automática de workflows intensivos em dados / A method for automatic paralelization of data-intensive workflows Watanabe, Elaine Naomi 22 May 2017 (has links) A análise de dados em grande escala é um dos grandes desafios computacionais atuais e está presente não somente em áreas da ciência moderna mas também nos setores público e industrial. Nesses cenários, o processamento dos dados geralmente é modelado como um conjunto de atividades interligadas por meio de fluxos de dados os workflows. Devido ao alto custo computacional, diversas estratégias já foram propostas para melhorar a eficiência da execução de workflows intensivos em dados, tais como o agrupamento de atividades para minimizar as transferências de dados e a paralelização do processamento, de modo que duas ou mais atividades sejam executadas ao mesmo tempo em diferentes recursos computacionais. O paralelismo nesse caso é definido pela estrutura descrita em seu modelo de composição de atividades. Em geral, os Sistemas de Gerenciamento de Workflows, responsáveis pela coordenação e execução dessas atividades em um ambiente distribuído, desconhecem o tipo de processamento a ser realizado e por isso não são capazes de explorar automaticamente estratégias para execução paralela. As atividades paralelizáveis são definidas pelo usuário em tempo de projeto e criar uma estrutura que faça uso eficiente de um ambiente distribuído não é uma tarefa trivial. Este trabalho tem como objetivo prover execuções mais eficientes de workflows intensivos em dados e propõe para isso um método para a paralelização automática dessas aplicações, voltado para usuários não-especialistas em computação de alto desempenho. Este método define nove anotações semânticas para caracterizar a forma como os dados são acessados e consumidos pelas atividades e, assim, levando em conta os recursos computacionais disponíveis para a execução, criar automaticamente estratégias que explorem o paralelismo de dados. O método proposto gera réplicas das atividades anotadas e define também um esquema de indexação e distribuição dos dados do workflow que possibilita maior acesso paralelo. Avaliou-se sua eficiência em dois modelos de workflows com dados reais, executados na plataforma de nuvem da Amazon. Usou-se um SGBD relacional (PostgreSQL) e um NoSQL (MongoDB) para o gerenciamento de até 20,5 milhões de objetos de dados em 21 cenários com diferentes configurações de particionamento e replicação de dados. Os resultados obtidos mostraram que a paralelização da execução das atividades promovida pelo método reduziu o tempo de execução do workflow em até 66,6% sem aumentar o seu custo monetário. / The analysis of large-scale datasets is one of the major current computational challenges and it is present not only in fields of modern science domain but also in the industry and public sector. In these scenarios, the data processing is usually modeled as a set of activities interconnected through data flows as known as workflows. Due to their high computational cost, several strategies were proposed to improve the efficiency of data-intensive workflows, such as activities clustering to minimize data transfers and parallelization of data processing for reducing makespan, in which two or more activities are performed at same time on different computational resources. The parallelism, in this case, is defined in the structure of the workflows model of activities composition. In general, Workflow Management Systems are responsible for the coordination and execution of these activities in a distributed environment. However, they are not aware of the type of processing that will be performed by each one of them. Thus, they are not able to automatically explore strategies for parallel execution. Parallelizable activities are defined by user at workflow design time and creating a structure that makes an efficient use of a distributed environment is not a trivial task. This work aims to provide more efficient executions for data intensive workflows and, for that, proposes a method for automatic parallelization of these applications, focusing on users who are not specialists in high performance computing. This method defines nine semantic annotations to characterize how data is accessed and consumed by activities and thus, taking into account the available computational resources, automatically creates strategies that explore data parallelism. The proposed method generates replicas of annotated activities. It also defines a workflow data indexing and distribution scheme that allows greater parallel access. Its efficiency was evaluated in two workflow models with real data, executed in Amazon cloud platform. A relational (PostgreSQL) and a NoSQL (MongoDB) DBMS were used to manage up to 20.5 million of data objects in 21 scenarios with different partitioning and data replication settings. The experiments have shown that the parallelization of the execution of the activities promoted by the method resulted in a reduction of up to 66.6 % in the workflows makespan without increasing its monetary cost. Data Parallelism Data-intensive Workflows NoSQL NoSQL Paralelismo de Dados Workflows Intensivos em Dados
12	Performance Optimization Techniques and Tools for Data-Intensive Computation Platforms : An Overview of Performance Limitations in Big Data Systems and Proposed Optimizations Kalavri, Vasiliki January 2014 (has links) Big data processing has recently gained a lot of attention both from academia and industry. The term refers to tools, methods, techniques and frameworks built to collect, store, process and analyze massive amounts of data. Big data can be structured, unstructured or semi-structured. Data is generated from various different sources and can arrive in the system at various rates. In order to process these large amounts of heterogeneous data in an inexpensive and efficient way, massive parallelism is often used. The common architecture of a big data processing system consists of a shared-nothing cluster of commodity machines. However, even in such a highly parallel setting, processing is often very time-consuming. Applications may take up to hours or even days to produce useful results, making interactive analysis and debugging cumbersome. One of the main problems is that good performance requires both good data locality and good resource utilization. A characteristic of big data analytics is that the amount of data that is processed is typically large in comparison with the amount of computation done on it. In this case, processing can benefit from data locality, which can be achieved by moving the computation close the to data, rather than vice versa. Good utilization of resources means that the data processing is done with maximal parallelization. Both locality and resource utilization are aspects of the programming framework’s runtime system. Requiring the programmer to work explicitly with parallel process creation and process placement is not desirable. Thus, specifying good optimization that would relieve the programmer from low-level, error-prone instrumentation to achieve good performance is essential. The main goal of this thesis is to study, design and implement performance optimizations for big data frameworks. This work contributes methods and techniques to build tools for easy and efficient processing of very large data sets. It describes ways to make systems faster, by inventing ways to shorten job completion times. Another major goal is to facilitate the application development in distributed data-intensive computation platforms and make big-data analytics accessible to non-experts, so that users with limited programming experience can benefit from analyzing enormous datasets. The thesis provides results from a study of existing optimizations in MapReduce and Hadoop related systems. The study presents a comparison and classification of existing systems, based on their main contribution. It then summarizes the current state of the research field and identifies trends and open issues, while also providing our vision on future directions. Next, this thesis presents a set of performance optimization techniques and corresponding tools fordata-intensive computing platforms; PonIC, a project that ports the high-level dataflow framework Pig, on top of the data-parallel computing framework Stratosphere. The results of this work show that Pig can highly benefit from using Stratosphereas the backend system and gain performance, without any loss of expressiveness. The work also identifies the features of Pig that negatively impact execution time and presents a way of integrating Pig with different backends. HOP-S, a system that uses in-memory random sampling to return approximate, yet accurate query answers. It uses a simple, yet efficient random sampling technique implementation, which significantly improves the accuracy of online aggregation. An optimization that exploits computation redundancy in analysis programs and m2r2, a system that stores intermediate results and uses plan matching and rewriting in order to reuse results in future queries. Our prototype on top of the Pig framework demonstrates significantly reduced query response times. Finally, an optimization framework for iterative fixed points, which exploits asymmetry in large-scale graph analysis. The framework uses a mathematical model to explain several optimizations and to formally specify the conditions under which, optimized iterative algorithms are equivalent to the general solution. / <p>QC 20140605</p> performance optimization data-intensive computing big data Engineering and Technology Teknik och teknologier
13	Um método para paralelização automática de workflows intensivos em dados / A method for automatic paralelization of data-intensive workflows Elaine Naomi Watanabe 22 May 2017 (has links) A análise de dados em grande escala é um dos grandes desafios computacionais atuais e está presente não somente em áreas da ciência moderna mas também nos setores público e industrial. Nesses cenários, o processamento dos dados geralmente é modelado como um conjunto de atividades interligadas por meio de fluxos de dados os workflows. Devido ao alto custo computacional, diversas estratégias já foram propostas para melhorar a eficiência da execução de workflows intensivos em dados, tais como o agrupamento de atividades para minimizar as transferências de dados e a paralelização do processamento, de modo que duas ou mais atividades sejam executadas ao mesmo tempo em diferentes recursos computacionais. O paralelismo nesse caso é definido pela estrutura descrita em seu modelo de composição de atividades. Em geral, os Sistemas de Gerenciamento de Workflows, responsáveis pela coordenação e execução dessas atividades em um ambiente distribuído, desconhecem o tipo de processamento a ser realizado e por isso não são capazes de explorar automaticamente estratégias para execução paralela. As atividades paralelizáveis são definidas pelo usuário em tempo de projeto e criar uma estrutura que faça uso eficiente de um ambiente distribuído não é uma tarefa trivial. Este trabalho tem como objetivo prover execuções mais eficientes de workflows intensivos em dados e propõe para isso um método para a paralelização automática dessas aplicações, voltado para usuários não-especialistas em computação de alto desempenho. Este método define nove anotações semânticas para caracterizar a forma como os dados são acessados e consumidos pelas atividades e, assim, levando em conta os recursos computacionais disponíveis para a execução, criar automaticamente estratégias que explorem o paralelismo de dados. O método proposto gera réplicas das atividades anotadas e define também um esquema de indexação e distribuição dos dados do workflow que possibilita maior acesso paralelo. Avaliou-se sua eficiência em dois modelos de workflows com dados reais, executados na plataforma de nuvem da Amazon. Usou-se um SGBD relacional (PostgreSQL) e um NoSQL (MongoDB) para o gerenciamento de até 20,5 milhões de objetos de dados em 21 cenários com diferentes configurações de particionamento e replicação de dados. Os resultados obtidos mostraram que a paralelização da execução das atividades promovida pelo método reduziu o tempo de execução do workflow em até 66,6% sem aumentar o seu custo monetário. / The analysis of large-scale datasets is one of the major current computational challenges and it is present not only in fields of modern science domain but also in the industry and public sector. In these scenarios, the data processing is usually modeled as a set of activities interconnected through data flows as known as workflows. Due to their high computational cost, several strategies were proposed to improve the efficiency of data-intensive workflows, such as activities clustering to minimize data transfers and parallelization of data processing for reducing makespan, in which two or more activities are performed at same time on different computational resources. The parallelism, in this case, is defined in the structure of the workflows model of activities composition. In general, Workflow Management Systems are responsible for the coordination and execution of these activities in a distributed environment. However, they are not aware of the type of processing that will be performed by each one of them. Thus, they are not able to automatically explore strategies for parallel execution. Parallelizable activities are defined by user at workflow design time and creating a structure that makes an efficient use of a distributed environment is not a trivial task. This work aims to provide more efficient executions for data intensive workflows and, for that, proposes a method for automatic parallelization of these applications, focusing on users who are not specialists in high performance computing. This method defines nine semantic annotations to characterize how data is accessed and consumed by activities and thus, taking into account the available computational resources, automatically creates strategies that explore data parallelism. The proposed method generates replicas of annotated activities. It also defines a workflow data indexing and distribution scheme that allows greater parallel access. Its efficiency was evaluated in two workflow models with real data, executed in Amazon cloud platform. A relational (PostgreSQL) and a NoSQL (MongoDB) DBMS were used to manage up to 20.5 million of data objects in 21 scenarios with different partitioning and data replication settings. The experiments have shown that the parallelization of the execution of the activities promoted by the method resulted in a reduction of up to 66.6 % in the workflows makespan without increasing its monetary cost. NoSQL Paralelismo de Dados Workflows Intensivos em Dados Data Parallelism Data-intensive Workflows NoSQL
14	Requirement-driven Design and Optimization of Data-Intensive Flows Jovanovic, Petar 26 September 2016 (has links) Data have become number one assets of today's business world. Thus, its exploitation and analysis attracted the attention of people from different fields and having different technical backgrounds. Data-intensive flows are central processes in today’s business intelligence (BI) systems, deploying different technologies to deliver data, from a multitude of data sources, in user-preferred and analysis-ready formats. However, designing and optimizing such data flows, to satisfy both users' information needs and agreed quality standards, have been known as a burdensome task, typically left to the manual efforts of a BI system designer. These tasks have become even more challenging for next generation BI systems, where data flows typically need to combine data from in-house transactional storages, and data coming from external sources, in a variety of formats (e.g. social media, governmental data, news feeds). Moreover, for making an impact to business outcomes, data flows are expected to answer unanticipated analytical needs of a broader set of business users' and deliver valuable information in near real-time (i.e. at the right time). These challenges largely indicate a need for boosting the automation of the design and optimization of data-intensive flows. This PhD thesis aims at providing automatable means for managing the lifecycle of data-intensive flows. The study primarily analyzes the remaining challenges to be solved in the field of data-intensive flows, by performing a survey of current literature, and envisioning an architecture for managing the lifecycle of data-intensive flows. Following the proposed architecture, we further focus on providing automatic techniques for covering different phases of the data-intensive flows' lifecycle. In particular, the thesis first proposes an approach (CoAl) for incremental design of data-intensive flows, by means of multi-flow consolidation. CoAl not only facilitates the maintenance of data flow designs in front of changing information needs, but also supports the multi-flow optimization of data-intensive flows, by maximizing their reuse. Next, in the data warehousing (DW) context, we propose a complementary method (ORE) for incremental design of the target DW schema, along with systematically tracing the evolution metadata, which can further facilitate the design of back-end data-intensive flows (i.e. ETL processes). The thesis then studies the problem of implementing data-intensive flows into deployable formats of different execution engines, and proposes the BabbleFlow system for translating logical data-intensive flows into executable formats, spanning single or multiple execution engines. Lastly, the thesis focuses on managing the execution of data-intensive flows on distributed data processing platforms, and to this end, proposes an algorithm (H-WorD) for supporting the scheduling of data-intensive flows by workload-driven redistribution of data in computing clusters. The overall outcome of this thesis an end-to-end platform for managing the lifecycle of data-intensive flows, called Quarry. The techniques proposed in this thesis, plugged to the Quarry platform, largely facilitate the manual efforts, and assist users of different technical skills in their analytical tasks. Finally, the results of this thesis largely contribute to the field of data-intensive flows in today's BI systems, and advocate for further attention by both academia and industry to the problems of design and optimization of data-intensive flows. / Doctorat en Sciences de l'ingénieur et technologie / info:eu-repo/semantics/nonPublished Analyse de systèmes informatiques Informatique générale data-intensive flows workflow management optimization business intelligence ETL Data Warehousing
15	Improving Performance And Programmer Productivity For I/o-intensive High Performance Computing Applications Sehrish, Saba 01 January 2010 (has links) Due to the explosive growth in the size of scientific data sets, data-intensive computing is an emerging trend in computational science. HPC applications are generating and processing large amount of data ranging from terabytes (TB) to petabytes (PB). This new trend of growth in data for HPC applications has imposed challenges as to what is an appropriate parallel programming framework to efficiently process large data sets. In this work, we study the applicability of two programming models (MPI/MPI-IO and MapReduce) to a variety of I/O-intensive HPC applications ranging from simulations to analytics. We identify several performance and programmer productivity related limitations of these existing programming models, if used for I/O-intensive applications. We propose new frameworks which will improve both performance and programmer productivity for the emerging I/O-intensive applications. Message Passing Interface (MPI) is widely used for writing HPC applications. MPI/MPI- IO allows a fine-grained control of assigning data and task distribution. At the programming frameworks level, various optimizations have been proposed to improve the performance of MPI/MPI-IO function calls. These performance optimizations are provided as various function options to the programmers. In order to write an efficient code, they are required to know the exact usage of the optimization functions, hence programmer productivity is limited. We propose an abstraction called Reduced Function Set Abstraction (RFSA) for MPI-IO to reduce the number of I/O functions and provide methods to automate the selection of appropriate I/O function for writing HPC simulation applications. The purpose of RFSA is to hide the performance optimization functions from the application developer, and relieve the application developer from deciding on a specific function. The proposed set of functions relies on a selection algorithm to decide among the most common optimizations provided by MPI-IO. Additionally, many application scientists are looking to integrate data-intensive computing into computational-intensive High Performance Computing facilities, particularly for data analytics. We have observed several scientific applications which must migrate their data from an HPC storage system to a data-intensive one. There is a gap between the data semantics of HPC storage and data-intensive system, hence, once migrated, the data must be further refined and reorganized. This reorganization must be performed before existing data-intensive tools such as MapReduce can be effectively used to analyze data. This reorganization requires at least two complete scans through the data set and then at least one MapReduce program to prepare the data before analyzing it. Running multiple MapReduce phases causes significant overhead for the application, in the form of excessive I/O operations. For every MapReduce application that must be run in order to complete the desired data analysis, a distributed read and write operation on the file system must be performed. Our contribution is to extend Map-Reduce to eliminate the multiple scans and also reduce the number of pre-processing MapReduce programs. We have added additional expressiveness to the MapReduce language in our novel framework called MapReduce with Access Patterns (MRAP), which allows users to specify the logical semantics of their data such that 1) the data can be analyzed without running multiple data pre-processing MapReduce programs, and 2) the data can be simultaneously reorganized as it is migrated to the data-intensive file system. We also provide a scheduling mechanism to further improve the performance of these applications. The main contributions of this thesis are, 1) We implement a selection algorithm for I/O functions like read/write, merge a set of functions for data types and file views and optimize the atomicity function by automating the locking mechanism in RFSA. By running different parallel I/O benchmarks on both medium-scale clusters and NERSC supercomputers, we show an improved programmer productivity (35.7% on average). This approach incurs an overhead of 2-5% for one particular optimization, and shows performance improvement of 17% when a combination of different optimizations is required by an application. 2) We provide an augmented Map-Reduce system (MRAP), which consist of an API and corresponding optimizations i.e. data restructuring and scheduling. We have demonstrated up to 33% throughput improvement in one real application (read-mapping in bioinformatics), and up to 70% in an I/O kernel of another application (halo catalogs analytics). Our scheduling scheme shows performance improvement of 18% for an I/O kernel of another application (QCD analytics). Data-intensive computing High Performance computing MapReduce MPI/MPI-IO Computer Engineering Engineering
16	Supporting Data-Intensive Scientic Computing on Bandwidth and Space Constrained Environments Bicer, Tekin 18 August 2014 (has links) No description available. Computer Science Data-Intensive Computing Map-Reduce Cloud Computing Big Data Scientific Data Management Compression
17	Efficient and parallel evaluation of XQuery Li, Xiaogang 22 February 2006 (has links) No description available. Computer Science XQuery XML Streaming Data Data Intensive Computing Restructuring Compiler
18	Studies on Nonlinear Optimal Control System Design Based on Data-Intensive Approach / データ集約的方法に基づく非線形最適制御系設計法の研究 Beppu, Hirofumi 23 March 2022 (has links) 京都大学 / 新制・課程博士 / 博士(工学) / 甲第23888号 / 工博第4975号 / 新制\|\|工\|\|1777(附属図書館) / 京都大学大学院工学研究科航空宇宙工学専攻 / (主査)教授藤本健治, 教授加納学, 准教授丸田一郎, 教授松野文俊 / 学位規則第4条第1項該当 / Doctor of Philosophy (Engineering) / Kyoto University / DGAM Nonlinear optimal control Data-intensive approach Gaussian process regression Neural network 500
19	Preuve de propriétés dynamiques en B / Proving dynamic properties in B Diagne, Fama 26 September 2013 (has links) Les propriétés que l’on souhaite exprimer sur les applications système d’information ne peuvent se restreindre aux propriétés statiques, dites propriétés d’invariance, qui portent sur des états du système pris au même moment. En effet, certaines propriétés, dites propriétés dynamiques, peuvent faire référence à l’état passé ou futur du système. Les travaux existants sur la vérification de telles propriétés utilisent généralement le model checking dont l’efficacité pour le domaine des systèmes d’information est plutôt réduite à cause de l’explosion combinatoire de l’espace des états. Aussi, les techniques, fondées sur la preuve, requièrent des connaissances assez avancées en termes de raisonnement mathématique et sont donc difficiles à mettre en œuvre d’autant plus que ces dernières ne sont pas outillées. Pour palier ces limites, nous proposons dans cette thèse des méthodes de vérification de propriétés dynamiques basées sur la preuve en utilisant la méthode formelle B. Nous nous intéressons principalement aux propriétés d’atteignabilité et de précédence pour lesquelles nous avons défini des méthodes de génération d’obligations de preuve permettant de les prouver. Une propriété d’atteignabilité permet d’exprimer qu’il existe au moins une exécution du système qui permet d’atteindre un état cible à partir d’un état initial donné. Par contre, la propriété de précédence permet de s’assurer qu’un état donné du système est toujours précédé par un autre état. Afin de rendre ces différentes approches opérationnelles, nous avons développé un outil support qui permet de décharger l’utilisateur de la tâche de génération d’obligations de preuve qui peut être longue et fastidieuse / The properties that we would like to express on data-intensive applications cannot be limited to static properties, called invariance properties, which depend on states taken at the same time. Indeed, some properties, called dynamic properties, may refer to the past or the future states of the system. Existing work on the verification of such properties typically use model checking whose effectiveness for data-intensive applications is rather limited due to the combinatorial explosion of the state space. In addition, the techniques, based on the proof, require fairly advanced knowledge and mathematical reasoning especially that they are not always supported by tools. To overcome these limitations, we propose in this thesis proof-based verification approaches that use the B formal method. We are mainly interested in reachability and precedence properties for which we defined formal rules to generate proof obligations that permit to discharge them. A reachability property expresses that there is at least one execution scenario that permits to reach a target state from a given initial state while a precedence property ensures that a given system state is always preceded by another state. To make these different approaches workable, we have developed a support tool that permits to discharge the users from tedious and error-prone tasks Systèmes d'information Preuve Automatisation Propriétés dynamiques Méthode B Data-intensive applications Proof Automation Dynamic properties B method
20	A portable relational algebra library for high performance data-intensive query processing Saeed, Ifrah 09 April 2014 (has links) A growing number of industries are turning to data warehousing applications such as forecasting and risk assessment to process large volumes of data. These data warehousing applications, which utilize queries comprised of a mix of arithmetic and relational algebra (RA) operators, currently run on systems that utilize commodity multi-core CPUs. If we acknowledge the data-intensive nature of these applications, general purpose graphics processing units (GPUs) with high throughput and memory bandwidth seem to be natural candidates to host these applications. However, since such relational queries exhibit irregular parallelism and data accesses, their efficient implementation on GPUs remains challenging. Thus, although tailored solutions for individual processors using their native programming environments have evolved, these solutions are not accessible to other processors. This thesis addresses this problem by providing a portable implementation of RA, mathematical, and related primitives required to implement and accelerate relational queries over large data sets in the form of the library. These primitives can run on any modern multi- and many-core architecture that supports OpenCL, thereby enhancing the performance potential of such architectures for warehousing applications. In essence, this thesis describes the implementation of primitives and the results of their performance evaluation on a range of platforms and concludes with insights, the identification of opportunities, and lessons learned. One of the major insights from our analysis is that for complex relational queries, the time taken to transfer data between host CPUs and discrete GPUs can render the performance of discrete and integrated GPUs comparable in spite of the higher computing power and memory bandwidth of discrete GPUs. Therefore, data movement optimization is the key to eff ectively harnessing the high performance of discrete GPUs; otherwise, cost eff ectiveness would encourage the use of integrated GPUs. Furthermore, portability also enables the complete utilization of all GPUs and CPUs in the system at run time by opportunistically using any type of available processor when a kernel is ready for execution. Data-intensive query processing RA operators OpenCL GPUs CPUs Graphics processing units Data warehousing Big data Relation algebras

Search results