Global ETD Search

111	Cardinality Estimation with Local Deep Learning Models Woltmann, Lucas, Hartmann, Claudio, Thiele, Maik, Habich, Dirk, Lehner, Wolfgang 14 June 2022 (has links) Cardinality estimation is a fundamental task in database query processing and optimization. Unfortunately, the accuracy of traditional estimation techniques is poor resulting in non-optimal query execution plans. With the recent expansion of machine learning into the field of data management, there is the general notion that data analysis, especially neural networks, can lead to better estimation accuracy. Up to now, all proposed neural network approaches for the cardinality estimation follow a global approach considering the whole database schema at once. These global models are prone to sparse data at training leading to misestimates for queries which were not represented in the sample space used for generating training queries. To overcome this issue, we introduce a novel local-oriented approach in this paper, therefore the local context is a specific sub-part of the schema. As we will show, this leads to better representation of data correlation and thus better estimation accuracy. Compared to global approaches, our novel approach achieves an improvement by two orders of magnitude in accuracy and by a factor of four in training time performance for local models. info:eu-repo/classification/ddc/004 ddc:004
112	Materialized Views in the Presence of Reporting Functions Lehner, Wolfgang, Habich, Dirk, Just, Michael 15 June 2022 (has links) Materialized views are a well-known optimization strategy with the potential for massive improvements in query processing time, especially for aggregation queries over large tables. To realize this potential, the query optimizer has to know how and when to exploit materialized views. Reporting functions represent a novel technique to formulate sequence-oriented queries in SQL. They provide a column-wise ordering, partitioning, and windowing mechanism for aggregation functions and therefore extend the well-known way of grouping and applying simple aggregation functions. Up to now, current work has not considered the frequently used reporting functions in data warehouse environments. In this paper, we introduce materialized reporting function views and show how to rewrite queries with reporting functions as well as aggregation queries to this new kind of materialized views. We demonstrate the efficiency of our approach with a large number of experiments. info:eu-repo/classification/ddc/004 ddc:004
113	Sample synopses for approximate answering of group-by queries Lehner, Wolfgang, Rösch, Philipp 22 April 2022 (has links) With the amount of data in current data warehouse databases growing steadily, random sampling is continuously gaining in importance. In particular, interactive analyses of large datasets can greatly benefit from the significantly shorter response times of approximate query processing. Typically, those analytical queries partition the data into groups and aggregate the values within the groups. Further, with the commonly used roll-up and drill-down operations a broad range of group-by queries is posed to the system, which makes the construction of highly-specialized synopses difficult. In this paper, we propose a general-purpose sampling scheme that is biased in order to answer group-by queries with high accuracy. While existing techniques focus on the size of the group when computing its sample size, our technique is based on its standard deviation. The basic idea is that the more homogeneous a group is, the less representatives are required in order to give a good estimate. With an extensive set of experiments, we show that our approach reduces both the estimation error and the construction cost compared to existing techniques. info:eu-repo/classification/ddc/004 ddc:004
114	Gestion des données dans les réseaux sociaux / Data management in social networks Maniu, Silviu 28 September 2012 (has links) Nous abordons dans cette thèse quelques-unes des questions soulevées par I'émergence d'applications sociales sur le Web, en se concentrant sur deux axes importants: l'efficacité de recherche sociale dans les applications Web et l'inférence de liens sociaux signés à partir des interactions entre les utilisateurs dans les applications Web collaboratives. Nous commençons par examiner la recherche sociale dans les applications de "tag- ging". Ce problème nécessite une adaptation importante des techniques existantes, qui n'utilisent pas des informations sociaux. Dans un contexte ou le réseau est importante, on peut (et on devrait) d'exploiter les liens sociaux, ce qui peut indiquer la façon dont les utilisateurs se rapportent au demandeur et combien de poids leurs actions de "tagging" devrait avoir dans le résultat. Nous proposons un algorithme qui a le potentiel d'évoluer avec la taille des applications actuelles, et on le valide par des expériences approfondies. Comme les applications de recherche sociale peut être considérée comme faisant partie d'une catégorie plus large des applications sensibles au contexte, nous étudions le problème de répondre aux requêtes à partir des vues, en se concentrant sur deux sous-problèmes importants. En premier, la manipulation des éventuelles différences de contexte entre les différents points de vue et une requête d'entrée conduit à des résultats avec des score incertains, valables pour le nouveau contexte. En conséquence, les algorithmes top-k actuels ne sont plus directement applicables et doivent être adaptés aux telle incertitudes dans les scores des objets. Deuxièmement, les techniques adaptées de sélection de vue sont nécessaires, qui peuvent s’appuyer sur les descriptions des requêtes et des statistiques sur leurs résultats. Enfin, nous présentons une approche pour déduire un réseau signé (un "réseau de confiance") à partir de contenu généré dans Wikipedia. Nous étudions les mécanismes pour deduire des relations entre les contributeurs Wikipédia - sous forme de liens dirigés signés - en fonction de leurs interactions. Notre étude met en lumière un réseau qui est capturée par l’interaction sociale. Nous examinons si ce réseau entre contributeurs Wikipedia représente en effet une configuration plausible des liens signes, par l’étude de ses propriétés globaux et locaux du reseau, et en évaluant son impact sur le classement des articles de Wikipedia. / We address in this thesis some of the issues raised by the emergence of social applications on the Web, focusing on two important directions: efficient social search inonline applications and the inference of signed social links from interactions between users in collaborative Web applications. We start by considering social search in tagging (or bookmarking) applications. This problem requires a significant departure from existing, socially agnostic techniques. In a network-aware context, one can (and should) exploit the social links, which can indicate how users relate to the seeker and how much weight their tagging actions should have in the result build-up. We propose an algorithm that has the potential to scale to current applications, and validate it via extensive experiments. As social search applications can be thought of as part of a wider class of context-aware applications, we consider context-aware query optimization based on views, focusing on two important sub-problems. First, handling the possible differences in context between the various views and an input query leads to view results having uncertain scores, i.e., score ranges valid for the new context. As a consequence, current top-k algorithms are no longer directly applicable and need to be adapted to handle such uncertainty in object scores. Second, adapted view selection techniques are needed, which can leverage both the descriptions of queries and statistics over their results. Finally, we present an approach for inferring a signed network (a "web of trust")from user-generated content in Wikipedia. We investigate mechanisms by which relationships between Wikipedia contributors - in the form of signed directed links - can be inferred based their interactions. Our study sheds light into principles underlying a signed network that is captured by social interaction. We investigate whether this network over Wikipedia contributors represents indeed a plausible configuration of link signs, by studying its global and local network properties, and at an application level, by assessing its impact in the classification of Wikipedia articles.javascript:nouvelleZone('abstract');_ajtAbstract('abstract'); Wikipedia Recherche sociale Web collaboratif Algorithme de seuil Réseau signé Traîtement des requêtes Algorithme de grande échelle Wikipedia Social search Collaborative web Threshold algorithm Signed network Query processing Web-scale algorithm
115	Derby/S: A DBMS for Sample-Based Query Answering Klein, Anja, Gemulla, Rainer, Rösch, Philipp, Lehner, Wolfgang 10 November 2022 (has links) Although approximate query processing is a prominent way to cope with the requirements of data analysis applications, current database systems do not provide integrated and comprehensive support for these techniques. To improve this situation, we propose an SQL extension---called SQL/S---for approximate query answering using random samples, and present a prototypical implementation within the engine of the open-source database system Derby---called Derby/S. Our approach significantly reduces the required expert knowledge by enabling the definition of samples in a declarative way; the choice of the specific sampling scheme and its parametrization is left to the system. SQL/S introduces new DDL commands to easily define and administrate random samples subject to a given set of optimization criteria. Derby/S automatically takes care of sample maintenance if the underlying dataset changes. Finally, samples are transparently used during query processing, and error bounds are provided. Our extensions do not affect traditional queries and provide the means to integrate sampling as a first-class citizen into a DBMS. info:eu-repo/classification/ddc/004 ddc:004
116	Adaptive work placement for query processing on heterogeneous computing resources Karnagel, Thomas, Habich, Dirk, Wolfgang 10 November 2022 (has links) The hardware landscape is currently changing from homogeneous multi-core systems towards heterogeneous systems with many di↵erent computing units, each with their own characteristics. This trend is a great opportunity for database systems to increase the overall performance if the heterogeneous resources can be utilized eciently. To achieve this, the main challenge is to place the right work on the right computing unit. Current approaches tackling this placement for query processing assume that data cardinalities of intermediate results can be correctly estimated. However, this assumption does not hold for complex queries. To overcome this problem, we propose an adaptive placement approach being independent of cardinality estimation of intermediate results. Our approach is incorporated in a novel adaptive placement sequence. Additionally, we implement our approach as an extensible virtualization layer, to demonstrate the broad applicability with multiple database systems. In our evaluation, we clearly show that our approach significantly improves OLAP query processing on heterogeneous hardware, while being adaptive enough to react to changing cardinalities of intermediate query results. info:eu-repo/classification/ddc/004 ddc:004
117	Robust Real-time Query Processing with QStream Schmidt, Sven, Legler, Thomas, Schär, Sebastian, Lehner, Wolfgang 08 August 2023 (has links) Processing data streams with Quality-of-Service (QoS) guarantees is an emerging area in existing streaming applications. Although it is possible to negotiate the result quality and to reserve the required processing resources in advance, it remains a challenge to adapt the DSMS to data stream characteristics which are not known in advance or are difficult to obtain. Within this paper we present the second generation of our QStream DSMS which addresses the above challenge by using a real-time capable operating system environment for resource reservation and by applying an adaptation mechanism if the data stream characteristics change spontaneously. info:eu-repo/classification/ddc/004 ddc:004
118	Heterogeneity-Aware Operator Placement in Column-Store DBMS Karnagel, Tomas, Habich, Dirk, Schlegel, Benjamin, Lehner, Wolfgang 02 February 2023 (has links) Due to the tremendous increase in the amount of data efficiently managed by current database systems, optimization is still one of the most challenging issues in database research. Today’s query optimizer determine the most efficient composition of physical operators to execute a given SQL query, whereas the underlying hardware consists of a multi-core CPU. However, hardware systems are more and more shifting towards heterogeneity, combining a multi-core CPU with various computing units, e.g., GPU or FPGA cores. In order to efficiently utilize the provided performance capability of such heterogeneous hardware, the assignment of physical operators to computing units gains importance. In this paper, we propose a heterogeneity-aware physical operator placement strategy (HOP) for in-memory columnar database systems in a heterogeneous environment. Our placement approach takes operators from the physical query execution plan as an input and assigns them to computing units using a cost model at runtime. To enable this runtime decision, our cost model uses the characteristics of the computing units, execution properties of the operators, as well as runtime data to estimate execution costs for each unit. We evaluated our approach on full TPC-H queries within a prototype database engine. As we are going to show, the placement in a heterogeneous hardware system has a high influence on query performance. info:eu-repo/classification/ddc/004 ddc:004
119	Exploring Techniques for Providing Privacy in Location-Based Services Nearest Neighbor Query Asanya, John-Charles 01 January 2015 (has links) Increasing numbers of people are subscribing to location-based services, but as the popularity grows so are the privacy concerns. Varieties of research exist to address these privacy concerns. Each technique tries to address different models with which location-based services respond to subscribers. In this work, we present ideas to address privacy concerns for the two main models namely: the snapshot nearest neighbor query model and the continuous nearest neighbor query model. First, we address snapshot nearest neighbor query model where location-based services response represents a snapshot of point in time. In this model, we introduce a novel idea based on the concept of an open set in a topological space where points belongs to a subset called neighborhood of a point. We extend this concept to provide anonymity to real objects where each object belongs to a disjointed neighborhood such that each neighborhood contains a single object. To help identify the objects, we implement a database which dynamically scales in direct proportion with the size of the neighborhood. To retrieve information secretly and allow the database to expose only requested information, private information retrieval protocols are executed twice on the data. Our study of the implementation shows that the concept of a single object neighborhood is able to efficiently scale the database with the objects in the area. The size of the database grows with the size of the grid and the objects covered by the location-based services. Typically, creating neighborhoods, computing distances between objects in the area, and running private information retrieval protocols causes the CPU to respond slowly with this increase in database size. In order to handle a large number of objects, we explore the concept of kernel and parallel computing in GPU. We develop GPU parallel implementation of the snapshot query to handle large number of objects. In our experiment, we exploit parameter tuning. The results show that with parameter tuning and parallel computing power of GPU we are able to significantly reduce the response time as the number of objects increases. To determine response time of an application without knowledge of the intricacies of GPU architecture, we extend our analysis to predict GPU execution time. We develop the run time equation for an operation and extrapolate the run time for a problem set based on the equation, and then we provide a model to predict GPU response time. As an alternative, the snapshot nearest neighbor query privacy problem can be addressed using secure hardware computing which can eliminate the need for protecting the rest of the sub-system, minimize resource usage and network transmission time. In this approach, a secure coprocessor is used to provide privacy. We process all information inside the coprocessor to deny adversaries access to any private information. To obfuscate access pattern to external memory location, we use oblivious random access memory methodology to access the server. Experimental evaluation shows that using a secure coprocessor reduces resource usage and query response time as the size of the coverage area and objects increases. Second, we address privacy concerns in the continuous nearest neighbor query model where location-based services automatically respond to a change in object*s location. In this model, we present solutions for two different types known as moving query static object and moving query moving object. For the solutions, we propose plane partition using a Voronoi diagram, and a continuous fractal space filling curve using a Hilbert curve order to create a continuous nearest neighbor relationship between the points of interest in a path. Specifically, space filling curve results in multi-dimensional to 1-dimensional object mapping where values are assigned to the objects based on proximity. To prevent subscribers from issuing a query each time there is a change in location and to reduce the response time, we introduce the concept of transition and update time to indicate where and when the nearest neighbor changes. We also introduce a database that dynamically scales with the size of the objects in a path to help obscure and relate objects. By executing the private information retrieval protocol twice on the data, the user secretly retrieves requested information from the database. The results of our experiment show that using plane partitioning and a fractal space filling curve to create nearest neighbor relationships with transition time between objects reduces the total response time. Engineering
120	Partition-based SIMD Processing and its Application to Columnar Database Systems Hildebrandt, Juliana, Pietrzyk, Johannes, Krause, Alexander, Habich, Dirk, Lehner, Wolfgang 19 March 2024 (has links) The Single Instruction Multiple Data (SIMD) paradigm became a core principle for optimizing query processing in columnar database systems. Until now, only the LOAD/STORE instructions are considered to be efficient enough to achieve the expected speedups, while avoiding GATHER/SCATTER is considered almost imperative. However, the GATHER instruction offers a very flexible way to populate SIMD registers with data elements coming from non-consecutive memory locations. As we will discuss within this article, the GATHER instruction can achieve the same performance as the LOAD instruction, if applied properly. To enable the proper usage, we outline a novel access pattern allowing fine-grained, partition-based SIMD implementations. Then, we apply this partition-based SIMD processing to two representative examples from columnar database systems to experimentally demonstrate the applicability and efficiency of our new access pattern. info:eu-repo/classification/ddc/004 ddc:004

Search results