Global ETD Search

131	Factorisation in relational databases Zavodny, Jakub January 2014 (has links) We study representation systems for relational data based on relational algebra expressions with unions, products, and singleton relations. Algebraic factorisation using the distributivity of product over union allows succinct representation of many-to-many relationships; further succinctness is brought by sharing repeated subexpressions. We show that these techniques are especially applicable to results of conjunctive queries. In the first part of the dissertation we derive tight asymptotic size bounds for two flavours of factorised representations of results of conjunctive queries. Any conjunctive query is characterised by rational parameters that govern the factorisability of its results independently of the database instance. We relate these parameters to fractional edge covers and fractional hypertree decompositions. Factorisation naturally extends from relational data to its provenance. We characterise conjunctive queries by tight bounds on their readability, which captures how many times each input tuple is used to contribute to an output tuple, and we define syntactically the class of queries with bounded readability. In the second part of the dissertation we describe FDB, a relational database engine that uses factorised representations at the physical layer to reduce data redundancy and boost query performance. We develop algorithms for optimisation and evaluation of queries with selection, projection, join, aggregation and order-by clauses on factorised representations. By introducing novel operators for factorisation restructuring and a new optimisation objective to maintain intermediate and final results succinctly factorised, we allow query evaluation with lower time complexity than on flat relations. Experiments show that for data sets with many-to-many relationships, FDB can outperform relational engines by orders of magnitude. 005.75
132	Temporal Join Processing with Hilbert Curve Space Mapping Raigoza, Jaime Antonio 01 January 2013 (has links) Management of data with a time dimension increases the overhead of storage and query processing in large database applications especially with the join operation, which is a commonly used and expensive relational operator whose processing is dependent on the size of the input relations. An index-based approach has been shown to improve the processing of a join operation, which in turn, improves the performance of querying historical data. Temporal data consist of tuples associated with a time interval value having a valid life span of different lengths. With join processing on temporal data, since tuples with longer life spans tend to overlap a greater number of joining tuples, they are likely to be accessed more often. The efficient performance of a temporal join depending on index-clustered data is the main theme studied and researched in this work. The presence of intervals having an extended data range in temporal data makes the join evaluation harder because temporal data are intrinsically multidimensional. Some temporal join processing methods create duplicates of tuples with long life spans to achieve clustering of similar data, which improves the performance on tuples that tend to be accessed more frequently. The proposed Hilbert-Temporal Join (Hilbert-TJ) join algorithm overcomes the need of data duplication by mapping temporal data into Hilbert curve space that is inherently clustered, thus allowing for fast retrieval and storage. A balanced B+ tree index structure was implemented to manage and query the data. The query method identifies data pages containing matching tuples that intersect a multidimensional region. Given that data pages consist of contiguously mapped points on the curve, the query process successively traverses along the curve to determine the next page that intersects the query region by iteratively partitioning the data space. The proposed Adaptive Replacement Cache-Temporal Data (ARC-TD) buffer replacement policy is built upon the Adaptive Replacement Cache (ARC) policy by favoring the cache retention of data pages in proportion to the average life span of the tuples in the buffer. By giving preference to tuples having long life spans, a higher cache hit ratio was evident. The caching priority is also balanced between recently and frequently accessed data. An evaluation and comparison study of the proposed Hilbert-TJ algorithm determined the relative performance with respect to a nested-loop join, a sort-merge join, and a partition-based join algorithm that use a multiversion B+ tree (MVBT) index. The metrics are based on a comparison between the processing time (disk I/O time plus CPU time), cache hit ratio, and index storage size needed to perform the temporal join. The study was conducted with comparisons in terms of the Least Recently Used (LRU), Least Frequently Used (LFU), ARC, and the new ARC-TD buffer replacement policy. Under the given conditions, the expected outcome was that by reducing data redundancy and considering the longevity of frequently accessed temporal data, better performance was achieved. Additionally, the Hilbert-TJ algorithm offers support to both valid-time and transaction-time data. arc Hilbert join optimization query temporal Computer Sciences
133	Main-Memory Query Processing Utilizing External Indexes Truong, Thanh January 2016 (has links) Many applications require storage and indexing of new kinds of data in main-memory, e.g. color histograms, textures, shape features, gene sequences, sensor readings, or financial time series. Even though, many domain index structures were developed, very a few of them are implemented in any database management system (DBMS), usually only B-trees and hash indexes. A major reason is that the manual effort to include a new index implementation in a regular DBMS is very costly and time-consuming because it requires integration with all components of the DBMS kernel. To alleviate this, there are some extensible indexing frameworks. However, they all require re-engineering the index implementations, which is a problem when the index has third-party ownership, when only binary code is available, or simply when the index implementation is complex to re-engineer. Therefore, the DBMS should allow including new index implementations without code changes and performance degradation. Furthermore, for high performance the query processor needs knowledge of how to process queries to utilize plugged-in index. Moreover, it is important that all functionalities of a plugged-in index implementation are correct. The extensible main memory database system (MMDB) Mexima (Main-memory External Index Manager) addresses these challenges. It enables transparent plugging in main-memory index implementations without code changes. Index specific rewrite rules transform complex queries to utilize the indexes. Automatic test procedures validate the correctness of them based on user provided index meta-data. Moreover, the same optimization framework can also optimize complex queries sent to a back-end DBMS by exposing hidden indexes for its query optimizer. Altogether, Mexima is a complete and extensible platform for transparently index integration, utilization, and evaluation. Database indexing query processing index structures main-memory index validation
134	Distributed Graph Storage And Querying System Balaji, Janani 12 August 2016 (has links) Graph databases offer an efficient way to store and access inter-connected data. However, to query large graphs that no longer fit in memory, it becomes necessary to make multiple trips to the storage device to filter and gather data based on the query. But I/O accesses are expensive operations and immensely slow down query response time and prevent us from fully exploiting the graph specific benefits that graph databases offer. The storage models of most existing graph database systems view graphs as indivisible structures and hence do not allow a hierarchical layering of the graph. This adversely affects query performance for large graphs as there is no way to filter the graph on a higher level without actually accessing the entire information from the disk. Distributing the storage and processing is one way to extract better performance. But current distributed solutions to this problem are not entirely effective, again due to the indivisible representation of graphs adopted in the storage format. This causes unnecessary latency due to increased inter-processor communication. In this dissertation, we propose an optimized distributed graph storage system for scalable and faster querying of big graph data. We start with our unique physical storage model, in which the graph is decomposed into three different levels of abstraction, each with a different storage hierarchy. We use a hybrid storage model to store the most critical component and restrict the I/O trips to only when absolutely necessary. This lets us actively make use of multi-level filters while querying, without the need of comprehensive indexes. Our results show that our system outperforms established graph databases for several class of queries. We show that this separation also eases the difficulties in distributing graph data and go on propose a more efficient distributed model for querying general purpose graph data using the Spark framework. Graph Databases Distributed Graph Databases Distributed Graph Query Processing Spark
135	Static type analysis of XQuery expressions using rewriting calculus Wang, Zhen, 王珍 January 2007 (has links) published_or_final_version / abstract / Computer Science / Master / Master of Philosophy XML (Document markup language) Query languages (Computer science) Calculus.
136	Evaluation of Shortest Path Query Algorithm in Spatial Databases Lim, Heechul January 2003 (has links) Many variations of algorithms for finding the shortest path in a large graph have been introduced recently due to the needs of applications like the Geographic Information System (GIS) or Intelligent Transportation System (ITS). The primary subjects of those algorithms are materialization and hierarchical path views. Some studies focus on the materialization and sacrifice the pre-computational costs and storage costs for faster computation of a query. Other studies focus on the shortest-path algorithm, which has less pre-computation and storage but takes more time to compute the shortest path. The main objective of this thesis is to accelerate the computation time for the shortest-path queries while keeping the degree of materialization as low as possible. This thesis explores two different categories: 1) the reduction of the I/O-costs for multiple queries, and 2) the reduction of search spaces in a graph. The thesis proposes two simple algorithms to reduce the I/O-costs, especially for multiple queries. To tackle the problem of reducing search spaces, we give two different levels of materializations, namely, the <i>boundary set distance matrix</i> and <i>x-Hop sketch graph</i>, both of which materialize the shortest-path view of the boundary nodes in a partitioned graph. Our experiments show that a combination of the suggested solutions for 1) and 2) performs better than the original Disk-based SP algorithm [7], on which our work is based, and requires much less storage than <i>HEPV</i> [3]. Computer Science Shortest Path Query Spatial Database Pruning Algorithm
137	Causal modeling and prediction over event streams Acharya, Saurav 01 January 2014 (has links) In recent years, there has been a growing need for causal analysis in many modern stream applications such as web page click monitoring, patient health care monitoring, stock market prediction, electric grid monitoring, and network intrusion detection systems. The detection and prediction of causal relationships help in monitoring, planning, decision making, and prevention of unwanted consequences. An event stream is a continuous unbounded sequence of event instances. The availability of a large amount of continuous data along with high data throughput poses new challenges related to causal modeling over event streams, such as (1) the need for incremental causal inference for the unbounded data, (2) the need for fast causal inference for the high throughput data, and (3) the need for real-time prediction of effects from the events seen so far in the continuous event streams. This dissertation research addresses these three problems by focusing on utilizing temporal precedence information which is readily available in event streams: (1) an incremental causal model to update the causal network incrementally with the arrival of a new batch of events instead of storing the complete set of events seen so far and building the causal network from scratch with those stored events, (2) a fast causal model to speed up the causal network inference time, and (3) a real-time top-k predictive query processing mechanism to find the most probable k effects with the highest scores by proposing a run-time causal inference mechanism which addresses cyclic causal relationships. In this dissertation, the motivation, related work, proposed approaches, and the results are presented in each of the three problems. Bayesian network Causal model Prediction Top-k query Computer Sciences
138	Query evaluation with constant delay / L'évaluation de requêtes avec un délai constant Kazana, Wojciech 16 September 2013 (has links) Cette thèse se concentre autour du problème de l'évaluation des requêtes. Étant donné une requête q et une base de données D, l'objectif est de calculer l'ensemble q(D) des uplets résultant de l'évaluation de q sur D. Toutefois, l'ensemble q(D) peut être plus grand que la base de données elle-même car elle peut avoir une taille de la forme n^l où n est la taille de la base de données et l est l'arité de la requête. Calculer entièrement q(D) peut donc nécessiter plus que les ressources disponibles. L'objectif principal de cette thèse est une solution particulière à ce problème: une énumération de q(D) avec un délai constant. Intuitivement, cela signifie qu'il existe un algorithme avec deux phases: une phase de pré-traitement qui fonctionne en temps linéaire dans la taille de la base de données, suivie d'une phase d'énumération produisant un à un tous les éléments de q(D) avec un délai constant (indépendant de la taille de la base de données) entre deux éléments consécutifs. En outre, quatre autres problèmes sont considérés: le model-checking (où la requête q est un booléen), le comptage (où on veut calculer la taille \|q(D)\|), les tests (où on s'intéresse à un test efficace pour savoir si un uplet donné appartient au résultat de la requête) et la j-ième solution (où on veut accéder directement au j-ième élément de q(D)). Les résultats présentés dans cette thèse portent sur les problèmes ci-dessus concernant: - les requêtes du premier ordre sur les classes de structures de degré borné, - les requêtes du second ordre monadique sur les classes de structures de largeur d'arborescente bornée, - les requêtes du premier ordre sur les classes de structures avec expansion bornée. / This thesis is concentrated around the problem of query evaluation. Given a query q and a database D it is to compute the set q(D) of all tuples in the output of q on D. However, the set q(D) may be larger than the database itself as it can have a size of the form n^l where n is the size of the database and l the arity of the query. It can therefore require too many of the available resources to compute it entirely. The main focus of this thesis is a particular solution to this problem: a scenario where in stead of just computing, we are interested in enumerating q(D) with constant delay. Intuitively, this means that there is a two-phase algorithm working as follows: a preprocessing phase that works in time linear in the size of the database, followed by an enumeration phase outputting one by one all the elements of q(D) with a constant delay (which is independent from the size of the database) between any two consecutive outputs. Additionally, four more problems related to enumeration are also considered in the thesis. These are model-checking (where the query q is boolean), counting (where one wants to compute just the size \|q(D)\| of the output set), testing (where one is interested in an efficient test for whether a given tuple belongs to the output of the query or not) and j-th solution (where, one wants to be able to directly access the j-th element of q(D)). The results presented in the thesis address the above problems with respect to: - first-order queries over the classes of structures with bounded degree, - monadic second-order queries over the classes of structures with bounded treewidth, - first-order queries over the classes of structures with bounded expansion. Bases de données Évaluation des requêtes Logique Databases Query evaluation Logic
139	Event stream analytics Poppe, Olga 05 January 2018 (has links) Advances in hardware, software and communication networks have enabled applications to generate data at unprecedented volume and velocity. An important type of this data are event streams generated from financial transactions, health sensors, web logs, social media, mobile devices, and vehicles. The world is thus poised for a sea-change in time-critical applications from financial fraud detection to health care analytics empowered by inferring insights from event streams in real time. Event processing systems continuously evaluate massive workloads of Kleene queries to detect and aggregate event trends of interest. Examples of these trends include check kites in financial fraud detection, irregular heartbeat in health care analytics, and vehicle trajectories in traffic control. These trends can be of any length. Worst yet, their number may grow exponentially in the number of events. State-of-the-art systems do not offer practical solutions for trend analytics and thus suffer from long delays and high memory costs. In this dissertation, we propose the following event trend detection and aggregation techniques. First, we solve the trade-off between CPU processing time and memory usage while computing event trends over high-rate event streams. Namely, our event trend detection approach guarantees minimal CPU processing time given limited memory. Second, we compute online event trend aggregation at multiple granularity levels from fine (per matched event), to medium (per event type), to coarse (per pattern). Thus, we minimize the number of aggregates Â– reducing both time and space complexity compared to the state-of-the-art approaches. Third, we share intermediate aggregates among multiple event sequence queries while avoiding the expensive construction of matched event sequences. In several comprehensive experimental studies, we demonstrate the superiority of the proposed strategies over the state-of-the-art techniques with respect to latency, throughput, and memory costs. algorithms Complex Event Processing Data streaming query optimization
140	State-Slice: A New Stream Query Optimization Paradigm for Multi-query and Distributed Processing Wang, Song 25 March 2008 (has links) Modern stream applications necessitate the handling of large numbers of continuous queries specified over high volume data streams. This dissertation proposes novel solutions to continuous query optimization in three core areas of stream query processing, namely state-slice based multiple continuous query sharing, ring-based multi-way join query distribution and scalable distributed multi-query optimization. The first part of the dissertation proposes efficient optimization strategies that utilize the novel state-slicing concept to achieve maximum memory and computation sharing for stream join queries with window constraints. Extensive analytical and experimental evaluations demonstrate that our proposed strategies is capable to minimize the memory or CPU consumptions for multiple join queries. The second part of this dissertation proposes a novel scheme for the distributed execution of generic multi-way joins with window constraints. The proposed scheme partitions the states into disjoint slices in the time domain, and then distributes the fine-grained states in the cluster, forming a virtual computation ring. New challenges to support this distributed state-slicing processing are answered by numerous new techniques. The extensive experimental evaluations show that the proposed strategies achieve significant performance improvements in terms of response time and memory usages for a wide range of configurations and workloads on a real system. Ring based distributed stream query processing and multi-query sharing both are based on the state-slice concept. The third part of this dissertation combines the first two parts of this dissertation work and proposes a novel distributed multi-query optimization technique. database optimization stream query processing Querying (Computer science)

Search results