Global ETD Search

81	Query Processing on Prefix Trees Live Kissinger, Thomas, Schlegel, Benjamin, Habich, Dirk, Lehner, Wolfgang 17 August 2022 (has links) Modern database systems have to process huge amounts of data and should provide results with low latency at the same time. To achieve this, data is nowadays typically hold completely in main memory, to benefit of its high bandwidth and low access latency that could never be reached with disks. Current in-memory databases are usually column-stores that exchange columns or vectors between operators and suffer from a high tuple reconstruction overhead. In this demonstration proposal, we present DexterDB, which implements our novel prefix tree-based processing model that makes indexes the first-class citizen of the database system. The core idea is that each operator takes a set of indexes as input and builds a new index as output that is indexed on the attribute requested by the successive operator. With that, we are able to build composed operators, like the multi-way-select-join-group. Such operators speed up the processing of complex OLAP queries so that DexterDB outperforms state-of-the-art in-memory databases. Our demonstration focuses on the different optimization options for such query plans. Hence, we built an interactive GUI that connects to a DexterDB instance and allows the manipulation of query optimization parameters. The generated query plans and important execution statistics are visualized to help the visitor to understand our processing model. info:eu-repo/classification/ddc/004 ddc:004
82	Learning Techniques For Information Retrieval And Mining In High-dimensional Databases Cheng, Hao 01 January 2009 (has links) The main focus of my research is to design effective learning techniques for information retrieval and mining in high-dimensional databases. There are two main aspects in the retrieval and mining research: accuracy and efficiency. The accuracy problem is how to return results which can better match the ground truth, and the efficiency problem is how to evaluate users' requests and execute learning algorithms as fast as possible. However, these problems are non-trivial because of the complexity of the high-level semantic concepts, the heterogeneous natures of the feature space, the high dimensionality of data representations and the size of the databases. My dissertation is dedicated to addressing these issues. Specifically, my work has five main contributions as follows. The first contribution is a novel manifold learning algorithm, Local and Global Structures Preserving Projection (LGSPP), which defines salient low-dimensional representations for the high-dimensional data. A small number of projection directions are sought in order to properly preserve the local and global structures for the original data. Specifically, two groups of points are extracted for each individual point in the dataset: the first group contains the nearest neighbors of the point, and the other set are a few sampled points far away from the point. These two point sets respectively characterize the local and global structures with regard to the data point. The objective of the embedding is to minimize the distances of the points in each local neighborhood and also to disperse the points far away from their respective remote points in the original space. In this way, the relationships between the data in the original space are well preserved with little distortions. The second contribution is a new constrained clustering algorithm. Conventionally, clustering is an unsupervised learning problem, which systematically partitions a dataset into a small set of clusters such that data in each cluster appear similar to each other compared with those in other clusters. In the proposal, the partial human knowledge is exploited to find better clustering results. Two kinds of constraints are integrated into the clustering algorithm. One is the must-link constraint, indicating that the involved two points belong to the same cluster. On the other hand, the cannot-link constraint denotes that two points are not within the same cluster. Given the input constraints, data points are arranged into small groups and a graph is constructed to preserve the semantic relations between these groups. The assignment procedure makes a best effort to assign each group to a feasible cluster without violating the constraints. The theoretical analysis reveals that the probability of data points being assigned to the true clusters is much higher by the new proposal, compared to conventional methods. In general, the new scheme can produce clusters which can better match the ground truth and respect the semantic relations between points inferred from the constraints. The third contribution is a unified framework for partition-based dimension reduction techniques, which allows efficient similarity retrieval in the high-dimensional data space. Recent similarity search techniques, such as Piecewise Aggregate Approximation (PAA), Segmented Means (SMEAN) and Mean-Standard deviation (MS), prove to be very effective in reducing data dimensionality by partitioning dimensions into subsets and extracting aggregate values from each dimension subset. These partition-based techniques have many advantages including very efficient multi-phased pruning while being simple to implement. They, however, are not adaptive to different characteristics of data in diverse applications. In this study, a unified framework for these partition-based techniques is proposed and the issue of dimension partitions is examined in this framework. An investigation of the relationships of query selectivity and the dimension partition schemes discovers indicators which can predict the performance of a partitioning setting. Accordingly, a greedy algorithm is designed to effectively determine a good partitioning of data dimensions so that the performance of the reduction technique is robust with regard to different datasets. The fourth contribution is an effective similarity search technique in the database of point sets. In the conventional model, an object corresponds to a single vector. In the proposed study, an object is represented by a set of points. In general, this new representation can be used in many real-world applications and carries much more local information, but the retrieval and learning problems become very challenging. The Hausdorff distance is the common distance function to measure the similarity between two point sets, however, this metric is sensitive to outliers in the data. To address this issue, a novel similarity function is defined to better capture the proximity of two objects, in which a one-to-one mapping is established between vectors of the two objects. The optimal mapping minimizes the sum of distances between each paired points. The overall distance of the optimal matching is robust and has high retrieval accuracy. The computation of the new distance function is formulated into the classical assignment problem. The lower-bounding techniques and early-stop mechanism are also proposed to significantly accelerate the expensive similarity search process. The classification problem over the point-set data is called Multiple Instance Learning (MIL) in the machine learning community in which a vector is an instance and an object is a bag of instances. The fifth contribution is to convert the MIL problem into a standard supervised learning in the conventional vector space. Specially, feature vectors of bags are grouped into clusters. Each object is then denoted as a bag of cluster labels, and common patterns of each category are discovered, each of which is further reconstructed into a bag of features. Accordingly, a bag is effectively mapped into a feature space defined by the distances from this bag to all the derived patterns. The standard supervised learning algorithms can be applied to classify objects into pre-defined categories. The results demonstrate that the proposal has better classification accuracy compared to other state-of-the-art techniques. In the future, I will continue to explore my research in large-scale data analysis algorithms, applications and system developments. Especially, I am interested in applications to analyze the massive volume of online data. similarity search dimension reduction data clustering constrained clustering manifold learning query processing multiple instance learning Computer Sciences Engineering
83	Efficient exploitation of similar subexpressions for query processing Zhou, Jingren, Larson, Per-Ake, Freytag, Johann Christoph, Lehner, Wolfgang 13 December 2022 (has links) Complex queries often contain common or similar subexpressions, either within a single query or among multiple queries submitted as a batch. If so, query execution time can be improved by evaluating a common subexpression once and reusing the result in multiple places. However, current query optimizers do not recognize and exploit similar subexpressions, even within the same query. We present an efficient, scalable, and principled solution to this long-standing optimization problem. We introduce a light-weight and effective mechanism to detect potential sharing opportunities among expressions. Candidate covering subexpressions are constructed and optimization is resumed to determine which, if any, such subexpressions to include in the final query plan. The chosen subexpression(s) are computed only once and the results are reused to answer other parts of queries. Our solution automatically applies to optimization of query batches, nested queries, and maintenance of multiple materialized views. It is the first comprehensive solution covering all aspects of the problem: detection, construction, and cost-based optimization. Experiments on Microsoft SQL Server show significant performance improvements with minimal overhead. info:eu-repo/classification/ddc/004 ddc:004
84	Teaching In-Memory Database Systems the Detection of Hardware Errors Lehner, Wolfgang, Habich, Dirk, Kolditz, Till 18 January 2023 (has links) The key objective of database systems is to reliably manage data, whereby high query throughput and low query latency are core requirements. To satisfy these requirements, database systems constantly adapt to novel hardware features. Although it has been intensively studied and commonly accepted that hardware error rates in terms of bit flips increase dramatically with the decrease of the underlying chip structures, most database system research activities neglected this fact, leaving error (bit flip) detection as well as correction to the underlying hardware. Especially for main memory, silent data corruption (SDC) as a result of transient bit flips leading to faulty data is mainly detected and corrected at the DRAM and memory-controller layer. However, since future hardware becomes less reliable and error detection as well as correction by hardware becomes more expensive, this free ride will come to an end in the near future. To further provide a reliable data management, an emerging research direction is employing specific and tailored protection techniques at the database system level. Following that, we are currently developing and implementing an adopted system design for state-of-the-art in-memory column stores. In our lightning talk, we will summarize our current state and outline future work. info:eu-repo/classification/ddc/005 ddc:005
85	EFFICIENT LSM SECONDARY INDEXING FOR UPDATE-INTENSIVE WORKLOADS Jaewoo Shin (17069089) 29 September 2023 (has links) <p dir="ltr">In recent years, massive amounts of data have been generated from various types of devices or services. For these data, update-intensive workloads where the data update their status periodically and continuously are common. The Log-Structured-Merge (LSM, for short) is a widely-used indexing technique in various systems, where index structures buffer insert operations into the memory layer and flush them into disk when the data size in memory exceeds a threshold. Despite its noble ability to handle write-intensive (i.e., insert-intensive) workloads, LSM suffers from degraded query performance due to its inefficiency on index maintenance of secondary keys to handle update-intensive workloads.</p><p dir="ltr">This dissertation focuses on the efficient support of update-intensive workloads for LSM-based indexes. First, the focus is on the optimization of LSM secondary-key indexes and their support for update-intensive workloads. A mechanism to enable the LSM R-tree to handle update-intensive workloads efficiently is introduced. The new LSM indexing structure is termed the LSM RUM-tree, an LSM R-tree with Update Memo. The key insights are to reduce the maintenance cost of the LSM R-tree by leveraging an additional in-memory memo structure to control the size of the memo to fit in memory. In the experiments, the LSM RUM-tree achieves up to 9.6x speedup on update operations and up to 2400x speedup on query operations.</p><p dir="ltr">Second, the focus is to offer several significant advancements in the context of the LSM RUM-tree. We provide an extended examination of LSM-aware Update Memo (UM) cleaning strategies, elucidating how effectively each strategy reduces UM size and contributes to performance enhancements. Moreover, in recognition of the imperative need to facilitate concurrent activities within the LSM RUM-Tree, particularly in multi-threaded/multi-core environments, we introduce a pivotal feature of concurrency control for the update memo. The novel atomic operation known as Compare and If Less than Swap (CILS) is introduced to enable seamless concurrent operations on the Update Memo. Experimental results attest to a notable 4.5x improvement in the speed of concurrent update operations when compared to existing and baseline implementations.</p><p dir="ltr">Finally, we present a novel technique designed to improve query processing performance and optimize storage management in any secondary LSM tree. Our proposed approach introduces a new framework and mechanisms aimed at addressing the specific challenges associated with secondary indexing in the structure of the LSM tree, especially in the context of secondary LSM B+-tree (LSM BUM-tree). Experimental results show that the LSM BUM-tree achieves up to 5.1x speedup on update-intensive workloads and 107x speedup on update and query mixed workloads over existing LSM B+-tree implementations.</p> Data models, storage and indexing Database systems LSM-based index Secondary index Query Processing R-trees B-Tree spatial data processing
86	Query processing on low-energy many-core processors Lehner, Wolfgang, Ungethüm, Annett, Habich, Dirk, Karnagel, Tomas, Asmussen, Nils, Völp, Marcus, Nöthen, Benedikt, Fettweis, Gerhard 12 January 2023 (has links) Aside from performance, energy efficiency is an increasing challenge in database systems. To tackle both aspects in an integrated fashion, we pursue a hardware/software co-design approach. To fulfill the energy requirement from the hardware perspective, we utilize a low-energy processor design offering the possibility to us to place hundreds to millions of chips on a single board without any thermal restrictions. Furthermore, we address the performance requirement by the development of several database-specific instruction set extensions to customize each core, whereas each core does not have all extensions. Therefore, our hardware foundation is a low-energy processor consisting of a high number of heterogeneous cores. In this paper, we introduce our hardware setup on a system level and present several challenges for query processing. Based on these challenges, we describe two implementation concepts and a comparison between these concepts. Finally, we conclude the paper with some lessons learned and an outlook on our upcoming research directions. info:eu-repo/classification/ddc/004 ddc:004
87	Hierarchisches gruppenbasiertes Sampling Rainer, Gemulla, Berthold, Henrike, Lehner, Wolfgang 12 January 2023 (has links) In Zeiten wachsender Datenbankgrößen ist es unumgänglich, Anfragen näherungsweise auszuwerten um schnelle Antworten zu erhalten. Dieser Artikel stellt verschiedene Methoden vor, dieses Ziel zu erreichen, und wendet sich anschließend dem Sampling zu, welches mit Hilfe einer Stichprobe schnell zu adäquaten Ergebnissen führt. Enthalten Datenbankanfragen Verbund- oder Gruppierungsoperationen, so sinkt die Genauigkeit vieler Sampling-Verfahren sehr stark; insbesondere werden vor allem kleine Gruppen nicht erkannt. Dieser Artikel befasst sich mit hierarchischen gruppenbasiertem Sampling, welches Sampling, Gruppierung und Verbundoperationen kombiniert. / In times of increasing database sizes it is crucial to process queries approximately in order to obtain answers quickly. This article introduces several methods for achieving this goal and afterwards focuses on sampling, yielding appropriate results by using only a subset of the actual data. If database queries contain join or group-by operations, the accuracy of many sampling methods drops significantly; especially small groups are not recognized. This article is concerned with hierarchical group-based sampling, which combines sampling, grouping and joins. info:eu-repo/classification/ddc/004 ddc:004
88	Service-Oriented Sensor-Actuator Networks Rezgui, Abdelmounaam 09 January 2008 (has links) In this dissertation, we propose service-oriented sensor-actuator networks (SOSANETs) as a new paradigm for building the next generation of customizable, open, interoperable sensor-actuator networks. In SOSANETs, nodes expose their capabilities to applications in the form of service profiles. A node's service profile consists of a set of services (i.e., sensing and actuation capabilities) that it provides and the quality of service (QoS) parameters associated with those services (delay, accuracy, freshness, etc.). SOSANETs provide the benefits of both application-specific SANETs and generic SANETs. We first define a query model and an architecture for SOSANETs. The proposed query model offers a simple, uniform query interface whereby applications specify sensing and actuation queries independently from any specific deployment of the underlying SOSANET. We then present μRACER (Reliable Adaptive serviCe-driven Efficient Routing), a routing protocol suite for SOSANETs. μRACER consists of three routing protocols, namely, SARP (Service-Aware Routing Protocol), CARP (Context-Aware Routing Protocol), and TARP (Trust-Aware Routing Protocol). SARP uses an efficient service-aware routing approach that aggressively reduces downstream traffic by translating service profiles into efficient paths. CARP supports QoS by dynamically adapting each node's routing behavior and service profile according to the current context of that node, i.e. number of pending queries and number and type of messages to be routed. Finally, TARP achieves high end-to-end reliability through a scalable reputation-based approach in which each node is able to locally estimate the next hop of the most reliable path to the sink. We also propose query optimization techniques that contribute to the efficient execution of queries in SOSANETs. To evaluate the proposed service-oriented architecture, we implemented TinySOA, a prototype SOSANET built on top of TinyOS with uRACER as its routing mechansim. TinySOA is designed as a set of layers with a loose interaction model that enables several cross-layer optimization options. We conducted an evaluation of TinySOA that included a comparison with TinyDB. The obtained empirical results show that TinySOA achieves significant improvements on many aspects including energy consumption, scalability, reliability and response time. / Ph. D. Context-aware Routing Cross-Layer Optimization Query Processing Trust Reputation Routing Sensor-Actuator Networks uRACER Service-Oriented Architectures
89	Design and Analysis of Adaptive Fault Tolerant QoS Control Algorithms for Query Processing in Wireless Sensor Networks Speer, Ngoc Anh Phan 02 May 2008 (has links) Data sensing and retrieval in WSNs have a great applicability in military, environmental, medical, home and commercial applications. In query-based WSNs, a user would issue a query with QoS requirements in terms of reliability and timeliness, and expect a correct response to be returned within the deadline. Satisfying these QoS requirements requires that fault tolerance mechanisms through redundancy be used, which may cause the energy of the system to deplete quickly. This dissertation presents the design and validation of adaptive fault tolerant QoS control algorithms with the objective to achieve the desired quality of service (QoS) requirements and maximize the system lifetime in query-based WSNs. We analyze the effect of redundancy on the mean time to failure (MTTF) of query-based cluster-structured WSNs and show that an optimal redundancy level exists such that the MTTF of the system is maximized. We develop a hop-by-hop data delivery (HHDD) mechanism and an Adaptive Fault Tolerant Quality of Service Control (AFTQC) algorithm in which we utilize "source" and "path" redundancy with the goal to satisfy application QoS requirements while maximizing the lifetime of WSNs. To deal with network dynamics, we investigate proactive and reactive methods to dynamically collect channel and delay conditions to determine the optimal redundancy level at runtime. AFTQC can adapt to network dynamics that cause changes to the node density, residual energy, sensor failure probability, and radio range due to energy consumption, node failures, and change of node connectivity. Further, AFTQC can deal with software faults, concurrent query processing with distinct QoS requirements, and data aggregation. We compare our design with a baseline design without redundancy based on acknowledgement for data transmission and geographical routing for relaying packets to demonstrate the feasibility. We validate analytical results with extensive simulation studies. When given QoS requirements of queries in terms of reliability and timeliness, our AFTQC design allows optimal "source" and "path" redundancies to be identified and applied dynamically in response to network dynamics such that not only query QoS requirements are satisfied, as long as adequate resources are available, but also the lifetime of the system is prolonged. / Ph. D. energy conservation redundancy query processing timeliness reliability quality of service fault tolerance wireless sensor networks mean time to failure
90	Préservation de la confidentialité des données externalisées dans le traitement des requêtes top-k / Privacy preserving top-k query processing over outsourced data Mahboubi, Sakina 21 November 2018 (has links) L’externalisation de données d’entreprise ou individuelles chez un fournisseur de cloud, par exemple avec l’approche Database-as-a-Service, est pratique et rentable. Mais elle introduit un problème majeur: comment préserver la confidentialité des données externalisées, tout en prenant en charge les requêtes expressives des utilisateurs. Une solution simple consiste à crypter les données avant leur externalisation. Ensuite, pour répondre à une requête, le client utilisateur peut récupérer les données cryptées du cloud, les décrypter et évaluer la requête sur des données en texte clair (non cryptées). Cette solution n’est pas pratique, car elle ne tire pas parti de la puissance de calcul fournie par le cloud pour évaluer les requêtes.Dans cette thèse, nous considérons un type important de requêtes, les requêtes top-k, et le problème du traitement des requêtes top-k sur des données cryptées dans le cloud, tout en préservant la vie privée. Une requête top-k permet à l’utilisateur de spécifier un nombre k de tuples les plus pertinents pour répondre à la requête. Le degré de pertinence des tuples par rapport à la requête est déterminé par une fonction de notation.Nous proposons d’abord un système complet, appelé BuckTop, qui est capable d’évaluer efficacement les requêtes top-k sur des données cryptées, sans avoir à les décrypter dans le cloud. BuckTop inclut un algorithme de traitement des requêtes top-k qui fonctionne sur les données cryptées, stockées dans un nœud du cloud, et retourne un ensemble qui contient les données cryptées correspondant aux résultats top-k. Il est aidé par un algorithme de filtrage efficace qui est exécuté dans le cloud sur les données chiffrées et supprime la plupart des faux positifs inclus dans l’ensemble renvoyé. Lorsque les données externalisées sont volumineuses, elles sont généralement partitionnées sur plusieurs nœuds dans un système distribué. Pour ce cas, nous proposons deux nouveaux systèmes, appelés SDB-TOPK et SD-TOPK, qui permettent d’évaluer les requêtes top-k sur des données distribuées cryptées sans avoir à les décrypter sur les nœuds où elles sont stockées. De plus, SDB-TOPK et SD-TOPK ont un puissant algorithme de filtrage qui filtre les faux positifs autant que possible dans les nœuds et renvoie un petit ensemble de données cryptées qui seront décryptées du côté utilisateur. Nous analysons la sécurité de notre système et proposons des stratégies efficaces pour la mettre en œuvre.Nous avons validé nos solutions par l’implémentation de BuckTop, SDB-TOPK et SD-TOPK, et les avons comparé à des approches de base par rapport à des données synthétiques et réelles. Les résultats montrent un excellent temps de réponse par rapport aux approches de base. Ils montrent également l’efficacité de notre algorithme de filtrage qui élimine presque tous les faux positifs. De plus, nos systèmes permettent d’obtenir une réduction significative des coûts de communication entre les nœuds du système distribué lors du calcul du résultat de la requête. / Outsourcing corporate or individual data at a cloud provider, e.g. using Database-as-a-Service, is practical and cost-effective. But it introduces a major problem: how to preserve the privacy of the outsourced data, while supporting powerful user queries. A simple solution is to encrypt the data before it is outsourced. Then, to answer a query, the user client can retrieve the encrypted data from the cloud, decrypt it, and evaluate the query over plaintext (non encrypted) data. This solution is not practical, as it does not take advantage of the computing power provided by the cloud for evaluating queries.In this thesis, we consider an important kind of queries, top-k queries,and address the problem of privacy-preserving top-k query processing over encrypted data in the cloud.A top-k query allows the user to specify a number k, and the system returns the k tuples which are most relevant to the query. The relevance degree of tuples to the query is determined by a scoring function.We first propose a complete system, called BuckTop, that is able to efficiently evaluate top-k queries over encrypted data, without having to decrypt it in the cloud. BuckTop includes a top-k query processing algorithm that works on the encrypted data, stored at one cloud node,and returns a set that is proved to contain the encrypted data corresponding to the top-k results. It also comes with an efficient filtering algorithm that is executed in the cloud on encypted data and removes most of the false positives included in the set returned.When the outsourced data is big, it is typically partitioned over multiple nodes in a distributed system. For this case, we propose two new systems, called SDB-TOPK and SD-TOPK, that can evaluate top-k queries over encrypted distributed data without having to decrypt at the nodes where they are stored. In addition, SDB-TOPK and SD-TOPK have a powerful filtering algorithm that filters the false positives as much as possible in the nodes, and returns a small set of encrypted data that will be decrypted in the user side. We analyze the security of our system, and propose efficient strategies to enforce it.We validated our solutions through implementation of BuckTop , SDB-TOPK and SD-TOPK, and compared them to baseline approaches over synthetic and real databases. The results show excellent response time compared to baseline approaches. They also show the efficiency of our filtering algorithm that eliminates almost all false positives. Furthermore, our systems yieldsignificant reduction in communication cost between the distributed system nodes when computing the query result. Préservation de la confidentialité Requêtes top-K Données sensibles Sécurité du cloud Systèmes distribués Privacy Preserving Top-K Query Processing Sensitive Data Cloud security Distributed Systems

Search results