61 |
Efficiently Approximating Query Optimizer DiagramsDey, Atreyee 08 1900 (has links)
Modern database systems use a query optimizer to identify the most efficient strategy, called “query execution plan”, to execute declarative SQL queries. The role of the query optimizer is especially critical for the complex decision-support queries featured in current data warehousing and data mining applications.
Given an SQL query template that is parametrized on the selectivities of the participating base relations and a choice of query optimizer, a plan diagram is a color-coded pictorial enumeration of the execution plan choices of the optimizer over the query parameter space. Complementary to the plan-diagrams are cost and cardinality diagrams which graphically plot the estimated execution costs and cardinalities respectively, over the query parameter space. These diagrams are collectively known as optimizer diagrams. Optimizer diagrams have proved to be a powerful tool for the analysis and redesign of modern optimizers, and are gaining interest in diverse industrial and academic institutions. However, their utility is adversely impacted by the impractically large computational overheads incurred when standard brute-force approaches are used for producing fine-grained diagrams on high-dimensional query templates.
In this thesis, we investigate strategies for efficiently producing close approximations to complex optimizer diagrams. Our techniques are customized for different classes of optimizers, ranging from the generic Class I optimizers that provide only the optimal plan for a query, to Class II optimizers that also support costing of sub-optimal plans and Class III optimizers which offer enumerated rank-ordered lists of plans in addition to both the former features.
For approximating plan diagrams for Class I optimizers, we first present database oblivious techniques based on classical random sampling in conjunction with nearest neighbor (NN) inference scheme. Next we propose grid sampling algorithms which consider database specific knowledge such as(a) the structural differences between the operator trees of plans on the grid locations and (b) parametric query optimization principle. These algorithms become more efficient when modified to exploit the sub-optimal plan costing feature available with Class II optimizers. The final algorithm developed for Class III optimizers assume plan cost monotonicity and utilize the rank-ordered lists of plans to efficiently generate completely accurate optimizer diagrams. Subsequently, we provide a relaxed variant, which trades quality of approximation, for reduction in diagram generation overhead. Our proposed algorithms are capable of terminating according to user given error bound for plan diagram approximation.
For approximating cost diagrams, our strategy is based on linear least square regression performed on a mathematical model of plan cost behavior over the parameter space, in conjunction with interpolation techniques. Game theoretic and linear programming approaches have been employed to further reduce the error in cost approximation.
For approximating cardinality diagrams, we propose a novel parametrized mathematical model as a function of selectivities for characterizing query cardinality behavior. The complete cardinality model is constructed by clustering the data points according to their cardinality values and subsequently fitting the model through linear least square regression technique separately for each cluster. For non-sampled data points the cardinality values are estimated by first determining the cluster they belong to and then interpolating the cardinality value according to the suitable model.
Extensive experimentation with a representative set of TPC-H and TPC-DS-based query templates on industrial-strength optimizers indicates that our techniques are capable of delivering 90% accurate optimizer diagrams while incurring no more than 20% of the computational overheads of the exhaustive approach. Infact, for full-featured optimizers, we can guarantee zero error optimizer diagrams which usually require less than 10% overheads. Our results exhibit that (a) the approximation is materially faithful to the features of the exact optimizer diagram, with the errors thinly spread across the picture and Largely confined to the plan transition boundaries and (b) the cost increase at the non-sampled point due to assignment of sub-optimal plan is also limited.
These approximation techniques have been implemented in the publicly available Picasso optimizer visualizer tool. We have also modified PostgreSQL’s optimizer to incorporate costing of sub-optimal plans and enumerating rank-ordered lists of plans. In addition to these, we have designed estimators for predicting the time overhead involved in approximating optimizer diagrams with regard to user given error bounds.
In summary, this thesis demonstrates that accurate approximations to exact optimizer diagrams can indeed be obtained cheaply and consistently, with typical overheads being an order of magnitude lower than the brute-force approach. We hope that our results will encourage database vendors to incorporate the foreign-plan-costing and plan-rank-list features in their optimizer APIs.
|
62 |
Improving Query Classification by Features’ Weight LearningAbghari, Arash January 2013 (has links)
This work is an attempt to enhance query classification in call routing applications. A new method has been introduced to learn weights from training data by means of a regression model. This work has investigated applying the tf-idf weighting method, but the approach is not limited to a specific method and can be used for any weighting scheme. Empirical evaluations with several classifiers including Support Vector Machines (SVM), Maximum Entropy, Naive Bayes, and k-Nearest Neighbor (k-NN) show substantial improvement in both macro and micro F1 measures.
|
63 |
Example-Based Query Generation for Spontaneous SpeechMURAO, Hiroya, KAWAGUCHI, Nobuo, MATSUBARA, Shigeki, INAGAKI, Yasuyoshi 02 1900 (has links)
No description available.
|
64 |
Deciding Second-order Logics using Database Evaluation TechniquesUnel, Gulay January 2008 (has links)
We outline a novel technique that maps the satisfiability problems of
second-order logics, in particular WSnS (weak monadic second-order
logic with n successors), S1S (monadic second-order logic with one
successor), and of μ-calculus, to the problem of query evaluation
of Complex-value Datalog queries. In this dissertation, we propose
techniques that use database evaluation and optimization techniques
for automata-based decision procedures for the above logics. We show
how the use of advanced implementation techniques for Deductive
databases and for Logic Programs, in particular the use of tabling,
yields a considerable improvement in performance over more traditional
approaches. We also explore various optimizations of the proposed
technique, in particular we consider variants of tabling and goal
reordering. We then show that the decision problem for S1S can be
mapped to the problem of query evaluation of
Complex-value Datalog queries.
We explore optimizations that
can be applied to various types of formulas. Last, we propose
analogous techniques that allow us to approach μ-calculus
satisfiability problem in an incremental fashion and without the need
for re-computation. In addition, we outline a top-down evaluation
technique to drive our incremental procedure and propose heuristics
that guide the problem partitioning to reduce the size of the problems
that need to be solved.
|
65 |
Deciding Second-order Logics using Database Evaluation TechniquesUnel, Gulay January 2008 (has links)
We outline a novel technique that maps the satisfiability problems of
second-order logics, in particular WSnS (weak monadic second-order
logic with n successors), S1S (monadic second-order logic with one
successor), and of μ-calculus, to the problem of query evaluation
of Complex-value Datalog queries. In this dissertation, we propose
techniques that use database evaluation and optimization techniques
for automata-based decision procedures for the above logics. We show
how the use of advanced implementation techniques for Deductive
databases and for Logic Programs, in particular the use of tabling,
yields a considerable improvement in performance over more traditional
approaches. We also explore various optimizations of the proposed
technique, in particular we consider variants of tabling and goal
reordering. We then show that the decision problem for S1S can be
mapped to the problem of query evaluation of
Complex-value Datalog queries.
We explore optimizations that
can be applied to various types of formulas. Last, we propose
analogous techniques that allow us to approach μ-calculus
satisfiability problem in an incremental fashion and without the need
for re-computation. In addition, we outline a top-down evaluation
technique to drive our incremental procedure and propose heuristics
that guide the problem partitioning to reduce the size of the problems
that need to be solved.
|
66 |
Grenzen der visuellen Query-Konstruktion mittels Faceted BrowsingKoßlitz, Marleen 14 May 2012 (has links) (PDF)
Um in einer Menge von Daten nach bestimmten Informationen suchen und filtern zu können, verwenden Suchmaschinen und Datenbanksysteme Queries (Suchanfragen). Diese Queries sind häufig durch eine eigene Sprache definiert, welche die Bildung von komplexen Ausdrücken erlaubt. Die Systeme antworten auf die Suchanfrage in Form einer Ergebnismenge. Komplexe Suchanfragen ermöglichen dabei das Auffinden von präzisen Ergebnissen.
Faceted Browsing ist ein Benutzerschnittstellen-Paradigma zum Suchen und Filtern von Daten. Dabei können Suchanfragen visuell erstellt und sukzessiv verfeinert werden, ohne die spezielle Anfragesprache kennen zu müssen. Die einfache und intuitive Benutzbarkeit der Oberfläche bildet das Erfolgsrezept, sodass Faceted Browsing in vielen Anwendungen, wie beispielsweise auch in Online-Shops, zum Einsatz kommt.
Bisher sind die Systeme überwiegend so konzipiert, dass Queries, welche aus Konjunktionen von Disjunktionen bestehen, gebildet werden können. Es stellt sich nun die Frage, ob auch komplexere Suchanfragen mittels Faceted Browsing erstellt werden können und welche Veränderungen der Oberfläche dafür notwendig sind. Reichen die Veränderungen dabei so weit, dass zu Gunsten der Komplexität der Suchanfrage auf die Einfachheit der Oberfläche verzichtet werden muss oder existieren Möglichkeiten, komplexere Queries zu bilden und dabei die Einfachheit der Oberfläche zu bewahren?
Ziel der Arbeit ist es, zu ermitteln, welche Komplexität die Suchanfragen, die mittels Faceted Browsing gebildet werden, aufweisen können, ohne dabei die einfache Benutzbarkeit der Facettenbrowseroberfläche zu verlieren. Dazu wird die bisherige Mächtigkeit von Facettenbrowseroberflächen hinsichtlich der Querybildung analysiert. Weiterhin werden komplexere Suchanfragen auf ihre Umsetzbarkeit mit Hilfe des Faceted Browsing untersucht. Es wird betrachtet, auf welche Weise sich bisherige Facettenbrowseroberflächen verändern müssen, um die visuelle Erstellung solcher Suchanfragen zu ermöglichen. Durch die prototypische Erweiterung eines bestehenden Facettenbrowsers um notwendige Oberflächenelemente soll die Möglichkeit bestehen, komplexere Suchanfragen, als bisher mittels Faceted Browsing möglich, zu bilden.
|
67 |
A Keyword-Based Association Rule Mining Method for Personal Document QueryTseng, Chien-Ming 29 August 2003 (has links)
Because of the flourishing growth of Internet and IT there are too much information surround us today. We have limited attention but unlimited information. So almost all people today face a novel problem¡X Information Overload. It means our precious resource¡X attention, which is not enough to be used to digest any information that we touch. This problem also exists in Literature Digital Libraries.
In today, any Literature Digital Library may collect over one million literatures and documents. Hence a well searching or recommendation mechanism is needful for users. But the traditional ones are not good enough for users. Their searching results may need users to spend more effort to select for users¡¦ true requirement.
So this study tries to propose a new personal document recommendation mechanism to solve this problem. This mechanism use keyword-based association rule mining method to find association rules between documents. Then according to these rules and user¡¦s history preference, the mechanism recommend documents for user that they really want.
After some evaluations, we prove this study¡¦s mechanism actually solve partial information overload problem. And it has good performance on both ¡§Precision¡¨ and ¡§Recall¡¨ indices.
|
68 |
An NA-tree Approach to Answering the Spatial Exact Match Query in P2P SystemsWang, Ching-i 17 July 2006 (has links)
Spatial data occurs in several important and diverse applications in P2P systems, for example, P2P virtual cities, GIS, development planning, etc. For the problem of answering exact queries for spatial region data in the P2P environment, an R¡Vtree based structure probably is a good choice. However, since a peer system is dynamic, the global update characteristics of data insertion/deletion in an R¡Vtree can not work well in a P2P system. Moreover, the problem of overlaps in an R¡Vtree results in large number of the disk accesses (which will be considered as large number of messages in P2P systems). Therefore, a P2PR¡Vtree based indexing method for P2P systems has been proposed which has only local update to the proposed index structure when data insertion/deletion occurs. Although the P2PR¡Vtree can achieve the goal of the local update for data insertion/deletion, the overlapping phenomenon is still hard to solve. Recently, for region data access, an NA¡Vtree has been proposed which outperforms R¡Vtree¡Vlike data structures. Moreover, it does not have the problem of overlaps which may occur in an R¡Vtree. Basically, an NA¡Vtree does not split the spatial space, but it just classifies the spatial data objects by some rules. On the other hand, the Chord system is a well¡Vknown structured P2P system in which the data search is performed by a hash function, instead of flooding used in most of the unstructured P2P system. Since the Chord system is a hash approach, it is easy to deal with data insertion/deletion with only local update. However, the current Chord system can not work well with the region data, since it only works well with a single key value. Therefore, in this thesis, we propose to apply an NA-tree in the Chord system to encode spatial region data in the data key part used in the hash function to data search. That is, we still use one hash function of Chord to assign nodes to peers, and use another hash function to do data assignment by applying an NA¡Vtree to encode the spatial region data to data keys. First, we use three bits to present the first eight children in the NA¡Vtree. Next, we propose two methods to generate the key value of the remaining bits. For our first proposed method, it generates the remaining bits by adding 0¡¦s. This method is simple and applicable to the case that there are few objects in P2P systems. To avoid the case that a peer may own too many objects, the second method takes the central points of regions into consideration. This method is applicable to the case that there are too many objects in the P2P system. Finally, we concatenate the first three and the remaining bits to get the key values of objects. Thus, we combine the NA¡Vtree with the Chord system to solve the overlapping problem which the P2PR¡Vtree can not deal with. In our simulation study, we use six different data distributions to compare our method with the P2PR¡Vtree. From our simulation results, we show that the number of visited peers in our approach is less than that in the P2PR¡Vtree.
|
69 |
Adaptive scheduling algorithm selection in a streaming query systemPielech, Bradford Charles. January 2003 (has links)
Thesis (M.S.)--Worcester Polytechnic Institute. / Keywords: streaming query; query processing; database. Includes bibliographical references (p. 57-59).
|
70 |
Evaluating multi-way joins over discounted hitting timeZhang, Wangda, 张望达 January 2013 (has links)
The prevalence of graphs in emerging applications has recently raised a lot of research interests. To acquire interesting information hidden in large graphs, tasks including link prediction, collaborative recommendation, and reputation ranking, all make use of proximities between graph nodes. The discounted hitting time (DHT), which is a random-walk similarity measure for graph node pairs, has shown to be useful in various applications. In this thesis, we examine a novel query, called the multi-way join (or n-way join), over DHT scores. Given a graph and n sets of nodes, the n-way join retrieves a ranked list of n-tuples with the k highest scores, according to some aggregation function of DHT values. By extracting such top-k results, this query enables the analysis and prediction of various complex relationships among n sets of nodes on a large graph.
Since an n-way join is expensive to evaluate, we develop the Partial Join algorithm (or PJ). This solution decomposes an n-way join into a number of top-m 2-way joins, and combines their results to construct the answer of the n-way join. Since the process of PJ may necessitate the computation of top-(m + 1) 2-way joins, we study an incremental solution, which saves the trouble of recomputation and allows the results of top-(m+1) 2-way join to be derived quickly from the top-m 2-way join results earlier computed. For better performance, we further examine efficient processing algorithms and pruning techniques for 2-way joins. Through extensive experiments on three real graph datasets, we show that the proposed PJ algorithm accurately evaluates n-way joins, and is four orders of magnitude faster than basic solutions. / published_or_final_version / Computer Science / Master / Master of Philosophy
|
Page generated in 0.0362 seconds