Global ETD Search

91	Analytical Query Processing Based on Continuous Compression of Intermediates Damme, Patrick 02 October 2020 (has links) Nowadays, increasingly large amounts of data are being collected in numerous areas ranging from science to industry. To gain valueable insights from these data, the importance of Online Analytical Processing (OLAP) workloads is constantly growing. At the same time, the hardware landscape is continuously evolving. On the one hand, the increasing capacities of DRAM allow database systems to store their entire data in main memory. Furthermore, the performance of microprocessors has improved tremendously in recent years through the use of sophisticated hardware techniques, such as Single Instruction Multiple Data (SIMD) extensions promising hitherto unknown processing speeds. On the other hand, the main memory bandwidth has not increased proportionately, such that the data access is now the main bottleneck for an efficient data processing. To face these developments, in-memory column-stores have emerged as a new database architecture. These systems store each attribute of a relation separately in memory as a contiguous sequence of values. It is state-of-the-art to encode all values as integers and apply lossless lightweight integer compression to reduce the data size. This offers several advantages ranging from lower transfer times between RAM and CPU over a better utilization of the cache hierarchy to fast direct processing of compressed data. However, compression also incurs a certain computational overhead. State-of-the-art systems focus on the compression of base data. However, intermediate results generated during the execution of complex analytical queries can exceed the base data in number and total size. Since in in-memory systems, accessing intermediates is as expensive as accessing base data, intermediates should be handled as efficiently as possible, too. While there are approaches trying to avoid intermediates whenever it is possible, we envision the orthogonal approach of efficiently representing intermediates using lightweight integer compression algorithms to reduce memory accesses. More precisely, our vision is a balanced query processing based on lightweight compression of intermediate results in in-memory column-stores. That means, all intermediates shall be represented using a suitable lightweight integer compression algorithm and processed by compression-enabled query operators to avoid a full decompression, whereby compression shall be used in a balanced way to ensure that its benefits outweigh its costs. In this thesis, we address all important aspects of this vision. We provide an extensive overview of existing lightweight integer compression algorithms and conduct a systematical experimental survey of several of these algorithms to gain a deep understanding of their behavior. We propose a novel compression-enabled processing model for in-memory column-stores allowing a continuous compression of intermediates. Additionally, we develop novel cost-based strategies for a compression-aware secondary query optimization to make effective use of our processing model. Our end-to-end evaluation using the famous Star Schema Benchmark shows that our envisioned compression of intermediates can improve both the memory footprint and the runtime of complex analytical queries significantly.:1 Introduction 1.1 Contributions 1.2 Outline 2 Lightweight Integer Compression 2.1 Foundations 2.1.1 Disambiguation of Lightweight Integer Compression 2.1.2 Overview of Lightweight Integer Compression 2.1.3 State-of-the-Art in Lightweight Integer Compression 2.2 Experimental Survey 2.2.1 Related Work 2.2.2 Experimental Setup and Methodology 2.2.3 Evaluation of the Impact of the Data Characteristics 2.2.4 Evaluation of the Impact of the Hardware Characteristics 2.2.5 Evaluation of the Impact of the SIMD Extension 2.3 Summary and Discussion 3 Processing Compressed Intermediates 3.1 Processing Model for Compressed Intermediates 3.1.1 Related Work 3.1.2 Description of the Underlying Processing Model 3.1.3 Integration of Compression into Query Operators 3.1.4 Integration of Compression into the Overall Query Execution 3.1.5 Efficient Implementation 3.1.6 Evaluation 3.2 Direct Integer Morphing Algorithms 3.2.1 Related Work 3.2.2 Integer Morphing Algorithms 3.2.3 Example Algorithms 3.2.4 Evaluation 3.3 Summary and Discussion 4 Compression-Aware Query Optimization Strategies 4.1 Related Work 4.2 Compression-Aware Secondary Query Optimization 4.2.1 Compression-Level: Selecting a Suitable Algorithm 4.2.2 Operator-Level: Selecting Suitable Input/Output Formats 4.2.3 QEP-Level: Selecting Suitable Formats for All Involved Columns 4.3 Evaluation 4.3.1 Compression-Level: Selecting a Suitable Algorithm 4.3.2 Operator-Level: Selecting Suitable Input/Output Formats 4.3.3 Lessons Learned 4.4 Summary and Discussion 5 End-to-End Evaluation 5.1 Experimental Setup and Methodology 5.2 A Simple OLAP Query 5.3 Complex OLAP Queries: The Star Schema Benchmark 5.4 Summary and Discussion 6 Conclusion 6.1 Summary of this Thesis 6.2 Directions for Future Work Bibliography List of Figures List of Tables info:eu-repo/classification/ddc/004 ddc:004
92	Demonstrating Efficient Query Processing in Heterogeneous Environments Karnagel, Tomas, Hille, Matthias, Ludwig, Mario, Habich, Dirk, Lehner, Wolfgang, Heimel, Max, Markl, Volker 30 June 2022 (has links) The increasing heterogeneity in hardware systems gives developers many opportunities to add more functionality and computational power to the system. As a consequence, modern database systems will need to be able to adapt to a wide variety of heterogeneous architectures. While porting single operators to accelerator architectures is well-understood, a more generic approach is needed for the whole database system. In prior work, we presented a generic hardware-oblivious database system, where the operators can be executed on the main processor as well as on a large number of accelerator architectures. However, to achieve fully heterogeneous query processing, placement decisions are needed for the database operators. We enhance the presented system with heterogeneity-aware operator placement (HOP) to take a major step towards designing a database system that can efficiently exploit highly heterogeneous hardware environments. In this demonstration, we are focusing on the placement-integration aspect as well as presenting the resulting database system. info:eu-repo/classification/ddc/004 ddc:004
93	Supporting Advanced Queries on Scientific Array Data Ebenstein, Roee A. 18 December 2018 (has links) No description available. Computer Science DBMS Scientific Array Data Analytical Querying SDBMS Join Array Join Array Data Join Databases Array Databases NoDB Scientific Databases Query Optimization Plan Pruning Bitmap Join Bitmap Indexes Range Joins Data Skew Querying Skewed data
94	Efficient techniques for large-scale Web data management / Techniques efficaces de gestion de données Web à grande échelle Camacho Rodriguez, Jesus 25 September 2014 (has links) Le développement récent des offres commerciales autour du cloud computing a fortement influé sur la recherche et le développement des plateformes de distribution numérique. Les fournisseurs du cloud offrent une infrastructure de distribution extensible qui peut être utilisée pour le stockage et le traitement des données.En parallèle avec le développement des plates-formes de cloud computing, les modèles de programmation qui parallélisent de manière transparente l'exécution des tâches gourmandes en données sur des machines standards ont suscité un intérêt considérable, à commencer par le modèle MapReduce très connu aujourd'hui puis par d'autres frameworks plus récents et complets. Puisque ces modèles sont de plus en plus utilisés pour exprimer les tâches de traitement de données analytiques, la nécessité se fait ressentir dans l'utilisation des langages de haut niveau qui facilitent la charge de l'écriture des requêtes complexes pour ces systèmes.Cette thèse porte sur des modèles et techniques d'optimisation pour le traitement efficace de grandes masses de données du Web sur des infrastructures à grande échelle. Plus particulièrement, nous étudions la performance et le coût d'exploitation des services de cloud computing pour construire des entrepôts de données Web ainsi que la parallélisation et l'optimisation des langages de requêtes conçus sur mesure selon les données déclaratives du Web.Tout d'abord, nous présentons AMADA, une architecture d'entreposage de données Web à grande échelle dans les plateformes commerciales de cloud computing. AMADA opère comme logiciel en tant que service, permettant aux utilisateurs de télécharger, stocker et interroger de grands volumes de données Web. Sachant que les utilisateurs du cloud prennent en charge les coûts monétaires directement liés à leur consommation de ressources, notre objectif n'est pas seulement la minimisation du temps d'exécution des requêtes, mais aussi la minimisation des coûts financiers associés aux traitements de données. Plus précisément, nous étudions l'applicabilité de plusieurs stratégies d'indexation de contenus et nous montrons qu'elles permettent non seulement de réduire le temps d'exécution des requêtes mais aussi, et surtout, de diminuer les coûts monétaires liés à l'exploitation de l'entrepôt basé sur le cloud.Ensuite, nous étudions la parallélisation efficace de l'exécution de requêtes complexes sur des documents XML mis en œuvre au sein de notre système PAXQuery. Nous fournissons de nouveaux algorithmes montrant comment traduire ces requêtes dans des plans exprimés par le modèle de programmation PACT (PArallelization ConTracts). Ces plans sont ensuite optimisés et exécutés en parallèle par le système Stratosphere. Nous démontrons l'efficacité et l'extensibilité de notre approche à travers des expérimentations sur des centaines de Go de données XML.Enfin, nous présentons une nouvelle approche pour l'identification et la réutilisation des sous-expressions communes qui surviennent dans les scripts Pig Latin. Notre algorithme, nommé PigReuse, agit sur les représentations algébriques des scripts Pig Latin, identifie les possibilités de fusion des sous-expressions, sélectionne les meilleurs à exécuter en fonction du coût et fusionne d'autres expressions équivalentes pour partager leurs résultats. Nous apportons plusieurs extensions à l'algorithme afin d’améliorer sa performance. Nos résultats expérimentaux démontrent l'efficacité et la rapidité de nos algorithmes basés sur la réutilisation et des stratégies d'optimisation. / The recent development of commercial cloud computing environments has strongly impacted research and development in distributed software platforms. Cloud providers offer a distributed, shared-nothing infrastructure, that may be used for data storage and processing.In parallel with the development of cloud platforms, programming models that seamlessly parallelize the execution of data-intensive tasks over large clusters of commodity machines have received significant attention, starting with the MapReduce model very well known by now, and continuing through other novel and more expressive frameworks. As these models are increasingly used to express analytical-style data processing tasks, the need for higher-level languages that ease the burden of writing complex queries for these systems arises.This thesis investigates the efficient management of Web data on large-scale infrastructures. In particular, we study the performance and cost of exploiting cloud services to build Web data warehouses, and the parallelization and optimization of query languages that are tailored towards querying Web data declaratively.First, we present AMADA, an architecture for warehousing large-scale Web data in commercial cloud platforms. AMADA operates in a Software as a Service (SaaS) approach, allowing users to upload, store, and query large volumes of Web data. Since cloud users support monetary costs directly connected to their consumption of resources, our focus is not only on query performance from an execution time perspective, but also on the monetary costs associated to this processing. In particular, we study the applicability of several content indexing strategies, and show that they lead not only to reducing query evaluation time, but also, importantly, to reducing the monetary costs associated with the exploitation of the cloud-based warehouse.Second, we consider the efficient parallelization of the execution of complex queries over XML documents, implemented within our system PAXQuery. We provide novel algorithms showing how to translate such queries into plans expressed in the PArallelization ConTracts (PACT) programming model. These plans are then optimized and executed in parallel by the Stratosphere system. We demonstrate the efficiency and scalability of our approach through experiments on hundreds of GB of XML data.Finally, we present a novel approach for identifying and reusing common subexpressions occurring in Pig Latin scripts. In particular, we lay the foundation of our reuse-based algorithms by formalizing the semantics of the Pig Latin query language with extended nested relational algebra for bags. Our algorithm, named PigReuse, operates on the algebraic representations of Pig Latin scripts, identifies subexpression merging opportunities, selects the best ones to execute based on a cost function, and merges other equivalent expressions to share its result. We bring several extensions to the algorithm to improve its performance. Our experiment results demonstrate the efficiency and effectiveness of our reuse-based algorithms and optimization strategies. Données Web XML Stratégies Traitement des requêtes Entreposage distribué XQuery Optimisation multi-requête Pig Latin Web data XML Commercial cloud services Indexing strategies Query processing Distributed storage Query parallelization XQuery Multi-query optimization Pig Latin
95	Scalable algorithms for cloud-based Semantic Web data management / Algorithmes passant à l’échelle pour la gestion de données du Web sémantique sur les platformes cloud Zampetakis, Stamatis 21 September 2015 (has links) Afin de construire des systèmes intelligents, où les machines sont capables de raisonner exactement comme les humains, les données avec sémantique sont une exigence majeure. Ce besoin a conduit à l’apparition du Web sémantique, qui propose des technologies standards pour représenter et interroger les données avec sémantique. RDF est le modèle répandu destiné à décrire de façon formelle les ressources Web, et SPARQL est le langage de requête qui permet de rechercher, d’ajouter, de modifier ou de supprimer des données RDF. Être capable de stocker et de rechercher des données avec sémantique a engendré le développement des nombreux systèmes de gestion des données RDF.L’évolution rapide du Web sémantique a provoqué le passage de systèmes de gestion des données centralisées à ceux distribués. Les premiers systèmes étaient fondés sur les architectures pair-à-pair et client-serveur, alors que récemment l’attention se porte sur le cloud computing.Les environnements de cloud computing ont fortement impacté la recherche et développement dans les systèmes distribués. Les fournisseurs de cloud offrent des infrastructures distribuées autonomes pouvant être utilisées pour le stockage et le traitement des données. Les principales caractéristiques du cloud computing impliquent l’évolutivité́, la tolérance aux pannes et l’allocation élastique des ressources informatiques et de stockage en fonction des besoins des utilisateurs.Cette thèse étudie la conception et la mise en œuvre d’algorithmes et de systèmes passant à l’échelle pour la gestion des données du Web sémantique sur des platformes cloud. Plus particulièrement, nous étudions la performance et le coût d’exploitation des services de cloud computing pour construire des entrepôts de données du Web sémantique, ainsi que l’optimisation de requêtes SPARQL pour les cadres massivement parallèles.Tout d’abord, nous introduisons les concepts de base concernant le Web sémantique et les principaux composants des systèmes fondés sur le cloud. En outre, nous présentons un aperçu des systèmes de gestion des données RDF (centralisés et distribués), en mettant l’accent sur les concepts critiques de stockage, d’indexation, d’optimisation des requêtes et d’infrastructure.Ensuite, nous présentons AMADA, une architecture de gestion de données RDF utilisant les infrastructures de cloud public. Nous adoptons le modèle de logiciel en tant que service (software as a service - SaaS), où la plateforme réside dans le cloud et des APIs appropriées sont mises à disposition des utilisateurs, afin qu’ils soient capables de stocker et de récupérer des données RDF. Nous explorons diverses stratégies de stockage et d’interrogation, et nous étudions leurs avantages et inconvénients au regard de la performance et du coût monétaire, qui est une nouvelle dimension importante à considérer dans les services de cloud public.Enfin, nous présentons CliqueSquare, un système distribué de gestion des données RDF basé sur Hadoop. CliqueSquare intègre un nouvel algorithme d’optimisation qui est capable de produire des plans massivement parallèles pour des requêtes SPARQL. Nous présentons une famille d’algorithmes d’optimisation, s’appuyant sur les équijointures n- aires pour générer des plans plats, et nous comparons leur capacité à trouver les plans les plus plats possibles. Inspirés par des techniques de partitionnement et d’indexation existantes, nous présentons une stratégie de stockage générique appropriée au stockage de données RDF dans HDFS (Hadoop Distributed File System). Nos résultats expérimentaux valident l’effectivité et l’efficacité de l’algorithme d’optimisation démontrant également la performance globale du système. / In order to build smart systems, where machines are able to reason exactly like humans, data with semantics is a major requirement. This need led to the advent of the Semantic Web, proposing standard ways for representing and querying data with semantics. RDF is the prevalent data model used to describe web resources, and SPARQL is the query language that allows expressing queries over RDF data. Being able to store and query data with semantics triggered the development of many RDF data management systems. The rapid evolution of the Semantic Web provoked the shift from centralized data management systems to distributed ones. The first systems to appear relied on P2P and client-server architectures, while recently the focus moved to cloud computing.Cloud computing environments have strongly impacted research and development in distributed software platforms. Cloud providers offer distributed, shared-nothing infrastructures that may be used for data storage and processing. The main features of cloud computing involve scalability, fault-tolerance, and elastic allocation of computing and storage resources following the needs of the users.This thesis investigates the design and implementation of scalable algorithms and systems for cloud-based Semantic Web data management. In particular, we study the performance and cost of exploiting commercial cloud infrastructures to build Semantic Web data repositories, and the optimization of SPARQL queries for massively parallel frameworks.First, we introduce the basic concepts around Semantic Web and the main components and frameworks interacting in massively parallel cloud-based systems. In addition, we provide an extended overview of existing RDF data management systems in the centralized and distributed settings, emphasizing on the critical concepts of storage, indexing, query optimization, and infrastructure. Second, we present AMADA, an architecture for RDF data management using public cloud infrastructures. We follow the Software as a Service (SaaS) model, where the complete platform is running in the cloud and appropriate APIs are provided to the end-users for storing and retrieving RDF data. We explore various storage and querying strategies revealing pros and cons with respect to performance and also to monetary cost, which is a important new dimension to consider in public cloud services. Finally, we present CliqueSquare, a distributed RDF data management system built on top of Hadoop, incorporating a novel optimization algorithm that is able to produce massively parallel plans for SPARQL queries. We present a family of optimization algorithms, relying on n-ary (star) equality joins to build flat plans, and compare their ability to find the flattest possibles. Inspired by existing partitioning and indexing techniques we present a generic storage strategy suitable for storing RDF data in HDFS (Hadoop’s Distributed File System). Our experimental results validate the efficiency and effectiveness of the optimization algorithm demonstrating also the overall performance of the system. Web sémantique RDF Stratégies d’indexation Systèmes distribués Stockage distribué Traitement des requêtes Optimisation des requêtes MapReduce Hadoop HDFS CliqueSquare AMADA Gestion des données RDF Jointures n-aires Plans plats Semantic Web RDF Commercial cloud services Indexing strategies Distributed systems Distributed storage Query processing Query optimization Query parallelization MapReduce Hadoop HDFS CliqueSquare AMADA RDF data management N-ary joins Flat plans
96	Real-time Business Intelligence through Compact and Efficient Query Processing Under Updates Idris, Muhammad 10 April 2019 (has links) Responsive analytics are rapidly taking over the traditional data analytics dominated by the post-fact approaches in traditional data warehousing. Recent advancements in analytics demand placing analytical engines at the forefront of the system to react to updates occurring at high speed and detect patterns, trends and anomalies. These kinds of solutions find applications in Financial Systems, Industrial Control Systems, Business Intelligence and on-line Machine Learning among others. These applications are usually associated with Big Data and require the ability to react to constantly changing data in order to obtain timely insights and take proactive measures. Generally, these systems specify the analytical results or their basic elements in a query language, where the main task then is to maintain these results under frequent updates efficiently. The task of reacting to updates and analyzing changing data has been addressed in two ways in the literature: traditional business intelligence (BI) solutions focus on historical data analysis where the data is refreshed periodically and in batches, and stream processing solutions process streams of data from transient sources as flow (or set of flows) of data items. Both kinds of systems share the niche of reacting to updates (known as dynamic evaluation); however, they differ in architecture, query languages, and processing mechanisms. In this thesis, we investigate the possibility of a reactive and unified framework to model queries that appear in both kinds of systems. In traditional BI solutions, evaluating queries under updates has been studied under the umbrella of incremental evaluation of updates that is based on relational incremental view maintenance model and mostly focus on queries that feature equi-joins. Streaming systems, in contrast, generally follow the automaton based models to evaluate queries under updates, and they generally process queries that mostly feature comparisons of temporal attributes (e.g., timestamp attributes) along-with comparisons of non-temporal attributes over streams of bounded sizes. Temporal comparisons constitute inequality constraints, while non-temporal comparisons can either be equality or inequality constraints, hence these systems mostly process inequality joins. As starting point, we postulate the thesis that queries in streaming systems can also be evaluated efficiently based on the paradigm of incremental evaluation just like in BI systems in a main-memory model. The efficiency of such a model is measured in terms of runtime memory footprint and the update processing cost. To this end, the existing approaches of dynamic evaluation in both kind of systems present a trade-off between memory footprint and the update processing cost. More specifically, systems that avoid materialization of query (sub) results incur high update latency and systems that materialize (sub) results incur high memory footprint. We are interested in investigating the possibility to build a model that can address this trade-off. In particular, we overcome this trade-off by investigating the possibility of practical dynamic evaluation algorithm for queries that appear in both kinds of systems, and present a main-memory data representation that allows to enumerate query (sub) results without materialization and can be maintained efficiently under updates. We call this representation the Dynamic Constant Delay Linear Representation (DCLR). We devise DCLRs with the following properties: 1) they allow, without materialization, enumeration of query results with bounded-delay (and with constant delay for a sub-class of queries); 2) they allow tuple lookup in query results with logarithmic delay (and with constant delay for conjunctive queries with equi-joins only); 3) they take space linear in the size of the database; 4) they can be maintained efficiently under updates. We first study the DCLRs with the above-described properties for the class of acyclic conjunctive queries featuring equi-joins with projections and present the dynamic evaluation algorithm. Then, we present the generalization of thiw algorithm to the class of acyclic queries featuring multi-way theta-joins with projections. We devise DCLRs with the above properties for acyclic conjunctive queries, and the working of dynamic algorithms over DCLRs is based on a particular variant of join trees, called the Generalized Join Trees (GJTs) that guarantee the above-described properties of DCLRs. We define GJTs and present the algorithms to test a conjunctive query featuring theta-joins for acyclicity and to generate GJTs for such queries. To do this, we extend the classical GYO algorithm from testing a conjunctive query with equalities for acyclicity to test a conjunctive query featuring multi-way theta-joins with projections for acyclicity. We further extend the GYO algorithm to generate GJTs for queries that are acyclic. We implemented our algorithms in a query compiler that takes as input the SQL queries and generates Scala executable code – a trigger program to process queries and maintain under updates. We tested our approach against state of the art main-memory BI and CEP systems. Our evaluation results have shown that our DCLRs based approach is over an order of magnitude efficient than existing systems for both memory footprint and update processing cost. We have also shown that the enumeration of query results without materialization in DCLRs is comparable (and in some cases efficient) as compared to enumerating from materialized query results. info:eu-repo/classification/ddc/004 ddc:004

Page generated in 0.1088 seconds