Global ETD Search

31	Efficient Matrix-aware Relational Query Processing in Big Data Systems Yongyang Yu (5930462) 03 January 2019 (has links) <div>In the big data era, the use of large-scale machine learning methods is becoming ubiquitous in data exploration tasks ranging from business intelligence and bioinformatics to self-driving cars. In these domains, a number of queries are composed of various kinds of operators, such as relational operators for preprocessing input data, and machine learning models for complex analysis. Usually, these learning methods heavily rely on matrix computations. As a result, it is imperative to develop novel query processing approaches and systems that are aware of big matrix data and corresponding operators, scale to clusters of hundreds of machines, and leverage distributed memory for high-performance computation. This dissertation introduces and studies several matrix-aware relational query processing strategies, analyzes and optimizes their performance.</div><div><br></div><div><div>The first contribution of this dissertation is MatFast, a matrix computation system for efficiently processing and optimizing matrix-only queries in a distributed in-memory environment. We introduce a set of heuristic rules to rewrite special features of a matrix query for less memory footprint, and cost models to estimate the sparsity of sparse matrix multiplications, and to distribute the matrix data partitions among various compute workers for a communication-efficient execution. We implement and test the query processing strategies in an open-source distributed dataflow</div><div>engine (Apache Spark).</div></div><div><br></div><div><div>In the second contribution of this dissertation, we extend MatFast to MatRel, where we study how to efficiently process queries that involve both matrix and relational operators. We identify a series of equivalent transformation rules to rewrite a logical plan when both relational and matrix operations are present. We introduce selection, projection, aggregation, and join operators over matrix data, and propose optimizations to reduce computation overhead. We also design a cost model to distribute matrix data among various compute workers for communication-efficient</div><div>evaluation of relational join operations.</div></div><div><br></div><div><div>In the third and last contribution of this dissertation, we demonstrate how to leverage MatRel for optimizing complex matrix-aware relational query evaluation pipelines. Especially, we showcase how to efficiently learn model parameters for deep neural networks of various applications with MatRel, e.g., Word2Vec.</div></div> Applied Computer Science big data query optimization matrix computation distributed computing
32	Equivalence of Queries with Nested Aggregation DeHaan, David January 2009 (has links) Query equivalence is a fundamental problem within database theory. The correctness of all forms of logical query rewriting—join minimization, view flattening, rewriting over materialized views, various semantic optimizations that exploit schema dependencies, federated query processing and other forms of data integration—requires proving that the final executed query is equivalent to the original user query. Hence, advances in the theory of query equivalence enable advances in query processing and optimization. In this thesis we address the problem of deciding query equivalence between conjunctive SQL queries containing aggregation operators that may be nested. Our focus is on understanding the interaction between nested aggregation operators and the other parts of the query body, and so we model aggregation functions simply as abstract collection constructors. Hence, the precise language that we study is a conjunctive algebraic language that constructs complex objects from databases of flat relations. Using an encoding of complex objects as flat relations, we reduce the query equivalence problem for this algebraic language to deciding equivalence between relational encodings output by traditional conjunctive queries (not containing aggregation). This encoding-equivalence cleanly unifies and generalizes previous results for deciding equivalence of conjunctive queries evaluated under various processing semantics. As part of our study of aggregation operators that can construct empty sub-collections—so-called “scalar” aggregation—we consider query equivalence for conjunctive queries extended with a left outer join operator, a very practical class of queries for which the general equivalence problem has never before been analyzed. Although we do not completely solve the equivalence problem for queries with outer joins or with scalar aggregation, we do propose useful sufficient conditions that generalize previously known results for restricted classes of queries. Overall, this thesis offers new insight into the fundamental principles governing the behaviour of nested aggregation. database query optimization query equivalence aggregation set semantics bag semantics Computer Science
33	Equivalence of Queries with Nested Aggregation DeHaan, David January 2009 (has links) Query equivalence is a fundamental problem within database theory. The correctness of all forms of logical query rewriting—join minimization, view flattening, rewriting over materialized views, various semantic optimizations that exploit schema dependencies, federated query processing and other forms of data integration—requires proving that the final executed query is equivalent to the original user query. Hence, advances in the theory of query equivalence enable advances in query processing and optimization. In this thesis we address the problem of deciding query equivalence between conjunctive SQL queries containing aggregation operators that may be nested. Our focus is on understanding the interaction between nested aggregation operators and the other parts of the query body, and so we model aggregation functions simply as abstract collection constructors. Hence, the precise language that we study is a conjunctive algebraic language that constructs complex objects from databases of flat relations. Using an encoding of complex objects as flat relations, we reduce the query equivalence problem for this algebraic language to deciding equivalence between relational encodings output by traditional conjunctive queries (not containing aggregation). This encoding-equivalence cleanly unifies and generalizes previous results for deciding equivalence of conjunctive queries evaluated under various processing semantics. As part of our study of aggregation operators that can construct empty sub-collections—so-called “scalar” aggregation—we consider query equivalence for conjunctive queries extended with a left outer join operator, a very practical class of queries for which the general equivalence problem has never before been analyzed. Although we do not completely solve the equivalence problem for queries with outer joins or with scalar aggregation, we do propose useful sufficient conditions that generalize previously known results for restricted classes of queries. Overall, this thesis offers new insight into the fundamental principles governing the behaviour of nested aggregation. database query optimization query equivalence aggregation set semantics bag semantics Computer Science
34	Genetic Algorithms For Distributed Database Design And Distributed Database Query Optimization Sevinc, Ender 01 October 2009 (has links) (PDF) The increasing performance of computers, reduced prices and ability to connect systems with low cost gigabit ethernet LAN and ATM WAN networks make distributed database systems an attractive research area. However, the complexity of distributed database query optimization is still a limiting factor. Optimal techniques, such as dynamic programming, used in centralized database query optimization are not feasible because of the increased problem size. The recently developed genetic algorithm (GA) based optimization techniques presents a promising alternative. We compared the best known GA with a random algorithm and showed that it achieves almost no improvement over the random search algorithm generating an equal number of random solutions. Then, we analyzed a set of possible GA parameters and determined that two-point truncate technique using GA gives the best results. New mutation and crossover operators defined in our GA are experimentally analyzed within a synthetic distributed database having increasing the numbers of relations and nodes. The designed synthetic database replicated relations, but there was no horizontal/vertical fragmentation. We can translate a select-project-join query including a fragmented relation with N fragments into a corresponding query with N relations. Comparisons with optimal results found by exhaustive search are only 20% off the results produced by our new GA formulation showing a 50% improvement over the previously known GA based algorithm.
35	Querying Data Providing Web Services Sabesan, Manivasakan January 2010 (has links) Web services are often used for search computing where data is retrieved from servers providing information of different kinds. Such data providing web services return a set of objects for a given set of parameters without any side effects. There is need to enable general and scalable search capabilities of data from data providing web services, which is the topic of this Thesis. The Web Service MEDiator (WSMED) system automatically provides relational views of any data providing web service operations by reading the WSDL documents describing them. These views can be queried with SQL. Without any knowledge of the costs of executing specific web service operations the WSMED query processor automatically and adaptively finds an optimized parallel execution plan calling queried data providing web services. For scalable execution of queries to data providing web services, an algebra operator PAP adaptively parallelizes calls in execution plans to web service operations until no significant performance improvement is measured, based on monitoring the flow from web service operations without any cost knowledge or extensive memory usage. To comply with the Everything as a Service (XaaS) paradigm WSMED itself is implemented as a web service that provides web service operations to query and combine data from data providing web services. A web based demonstration of the WSMED web service provides general SQL queries to any data providing web service operations from a browser. WSMED assumes that all queried data sources are available as web services. To make any data providing system into a data providing web service WSMED includes a subsystem, the web service generator, which generates and deploys the web service operations to access a data source. The WSMED web service itself is generated by the web service generator. / eSSENCE views of web service operations web service queries adaptive parallelization query optimization Computer science Datalogi
36	Privacy Preserving Data Mining using Unrealized Data Sets: Scope Expansion and Data Compression Fong, Pui Kuen 16 May 2013 (has links) In previous research, the author developed a novel PPDM method – Data Unrealization – that preserves both data privacy and utility of discrete-value training samples. That method transforms original samples into unrealized ones and guarantees 100% accurate decision tree mining results. This dissertation extends their research scope and achieves the following accomplishments: (1) it expands the application of Data Unrealization on other data mining algorithms, (2) it introduces data compression methods that reduce storage requirements for unrealized training samples and increase data mining performance and (3) it adds a second-level privacy protection that works perfectly with Data Unrealization. From an application perspective, this dissertation proves that statistical information (i. e. counts, probability and information entropy) can be retrieved precisely from unrealized training samples, so that Data Unrealization is applicable for all counting-based, probability-based and entropy-based data mining models with 100% accuracy. For data compression, this dissertation introduces a new number sequence – J-Sequence – as a mean to compress training samples through the J-Sampling process. J-Sampling converts the samples into a list of numbers with many replications. Applying run-length encoding on the resulting list can further compress the samples into a constant storage space regardless of the sample size. In this way, the storage requirement of the sample database becomes O(1) and the time complexity of a statistical database query becomes O(1). J-Sampling is used as an encryption approach to the unrealized samples already protected by Data Unrealization; meanwhile, data mining can be performed on these samples without decryption. In order to retain privacy preservation and to handle data compression internally, a column-oriented database management system is recommended to store the encrypted samples. / Graduate / 0984 / fong_bee@hotmail.com Data Mining Data Privacy PPDM Data Compression Database Set Theory Number Sequence Query Optimization
37	[en] HIPERBOLIC PROGRAMMING IN 0-1 VARIABLES AND BIBLIOGRAPHIC DATABASES SEARCH OPTIMIZATION / [pt] PROGRAMAÇÃO HIPERBÓLICA EM VARIÁVEIS 0-1 E OTIMIZAÇÃO DE CONSULTAS A BANCOS DE DADOS BIBLIOGRAFICOS MARCUS VINICIUS SOLEDADE POGGI DE ARAGAO 31 August 2009 (has links) [pt] Neste trabalho estuda-se a resolução de problemas de otimização e síntese de consultas para recuperação de informações de bancos de dados bibliográficos, através da sua formulação como problemas de programação matemática em variáveis 0-1. Primeiramente é estudado o problema de programação hiperbólica, para o qual foram desenvolvidos algoritmos de complexidade linear. O segundo problema estudado trata de uma extensão do anterior, sendo chamado neste texto de problema de soma hiperbólica. Para este problema são desenvolvidas heurísticas dos tipos simulated annealing e steepest ascent mildest descent (tabu search), além de algoritmos exatos do tipo pesquisa arborescente. Todos os métodos descritos acima foram implementados e são apresentados resultados numéricos. Quanto à otimização de consultas, foram estudados dois problemas básicos: consultas periódicas e síntese de novas, que são formulados como problemas de programação hiperbólica e soma hiperbólica, respectivamente. Foram feitas aplicações considerando-se um banco de dados do Centro de Informações Nucleares da CNEN (Comissão Nacional de Energia Nuclear). / [en] In this work we study the solution of problems arising in the field of queries optimization in information retrieval from classical databases, through their formulation as mathematical problems in 0-1 variables. The first problem studied is the hyperbolic programming problem in 0-1 variables, for which we developed exact linear-time algorithms. The second problem studied is an extension of the former, here named as hyperbolic sum problem. For this problem we developed simulated annealing and steepest ascent-mildest descent (tabu search) heuristics, as well as exact branch-and-bound algorithms. All these methods were implemented and numerical results are presented. Concerning the problem of queries optimization, two basic problems were studied: periodical query and synthesis of new queries, which are formulated respectively as an hyperbolic programming problem and an hyperbolic sum problem. We have also done applications involving these problems, considering real data gathered from a database of Center of Nuclear Information from CNEN (Brazilian National Comission of Nucler Energy) [pt] RECUPERACAO DE INFORMACAO [en] INFORMATION RETRIEVAL [pt] OTIMIZACAO DE CONSULTAS [en] QUERY OPTIMIZATION [pt] PROGRAMACAO MULTIMIDIA
38	[en] EXPERIMENTAL STUDY OF CONJUNCTIVE QUERIES OPTIMIZATION WITH EXPENSIVE PREDICATES / [pt] ESTUDO EXPERIMENTAL DE ALGORITMOS PARA OTIMIZAÇÃO DE CONSULTAS CONJUNTIVAS COM PREDICADOS CAROS RODRIGO SILVA GUARINO 12 July 2004 (has links) [pt] As técnicas tradicionais de otimização de consultas em banco de dados possuem como heurística fundamental a organização dos predicados de uma consulta em dois tipos principais: predicados simples e predicados envolvendo junção(join) de tabelas. Como príncipio geral considera-se a priori os predicados envolvendo junção bem mais caros do que os predicados simples, e também que não existam diferenças significativas entre os tempos de processamento dos predicados simples, o que leva o otimizador a executar primeiro os predicados simples(em uma ordem qualquer), a fim de se diminuir a quantidade de tuplas que seriam necessárias à execução da junção. Essa consideração que se aplica bem à maioria das aplicações convencionais de banco de dados, passou a não se aplicar mais à novas aplicações que envolviam o preprocessamento de dados e/ou funções complexas nos predicados que não envolviam junções. Dessa forma esses novos predicados simples passaram a ter um tempo de processamento não mais desprezível em relação aos predicados que envolviam junções e também em relação a outros predicados simples. Dessa forma a heurística principal de otimização não se aplicava mais e tornou-se necessário o desenvolvimento de novas técnicas para resolver consultas que envolvessem esse novo tipo de predicado, que passou a ser chamado de predicado caro. O presente trabalho tem dois objetivos principais: apresentar um framework que possibilite o desenvolvimento, teste e análise integrada de algoritmos para o processamento de predicados caros, e analisar o desempenho de quatro implementações de algoritmos baseados na abordagem Cherry Picking, cujo o objetivo é explorar a dependência entre os dados que compõem as consultas. Os experimentos são conduzidos em consultas envolvendo predicados conjuntivos (AND) e a idéia geral é tentar avaliar os atributos em uma ordem que minimize o custo de avaliação geral das tuplas. / [en] Traditional database query optimization technique have as its main heuristic the organization of predicates in two main types: selection predicates and join predicates. Join predicates are considered much more expensive than selection predicates. In additional, it's also considered that there's no big difference among the costs of different selection predicates, what makes the optimizer executes them first in any order, reducing the number of tuples necessary to execute join predicates.This assumption, that is well applied in traditional database applications, becomes invalid in respect of recent database applications, that executes complex functions over complex data in selection predicates. In this cases, selection predicates are considered more expensive than join predicates and their costs cannot be considered equivalent anymore. This makes the main heuristic of push down selections invalid for these kind of new selection predicates which calls for new optimization techniques. These type of cue named expensive predicates. This work has two main objectives: Present a software that makes possible the development, test and integrat analisys of different algorithms for evaluating expensive predicates and analyse the performance of four algorithm's implementations that are based on Cherry Picking strategy, which aims at exploring the data dependency between input values to expensive predicates. The experiments considered conjunctive(AND) queries, and the general idea is to try evaluate the attributes in a order that minimizes the general cost of the tuples. [pt] OTIMIZACAO DE CONSULTAS [en] QUERY OPTIMIZATION [pt] ALGORITMO [en] ALGORITHM [pt] PREDICADOS CAROS [en] EXPENSIVE PREDICATES
39	Modelos de custo e estatísticas para consultas por similaridade / Cost models and statistics for similarity searching Marcos Vinícius Naves Bêdo 10 October 2017 (has links) Consultas por similaridade constituem um paradigma de busca que fornece suporte à diversas tarefas computacionais, tais como agrupamento, classificação e recuperação de informação. Neste contexto, medir a similaridade entre objetos requer comparar a distância entre eles, o que pode ser formalmente modelado pela teoria de espaços métricos. Recentemente, um grande esforço de pesquisa tem sido dedicado à inclusão de consultas por similaridade em Sistemas Gerenciadores de Bases de Dados (SGBDs), com o objetivo de (i) permitir a combinação de comparações por similaridade com as comparações por identidade e ordem já existentes em SGBDs e (ii) obter escalabilidade para grandes bases de dados. Nesta tese, procuramos dar um próximo passo ao estendermos também o otimizador de consultas de um SGBD. Em particular, propomos a ampliação de dois módulos do otimizador: o módulo de Espaço de Distribuição de Dados e o módulo de Modelo de Custo. Ainda que o módulo de Espaço de Distribuição de Dados permita representar os dados armazenados, essas representações são insuficientes para modelar o comportamento das comparações em espaços métricos, sendo necessário estender este módulo para contemplar distribuições de distância. De forma semelhante, o módulo Modelo de Custo precisa ser ampliado para dar suporte à modelos de custo que utilizem estimativas sobre distribuições de distância. Toda a investigação aqui conduzida se concentra em cinco contribuições. Primeiro, foi criada uma nova sinopse para distribuições de distância, o Histograma Compactado de Distância (CDH), de onde é possível inferir valores de seletividade e raios para consultas por similaridade. Uma comparação experimental permitiu mostrar os ganhos das estimativas da sinopse CDH com relação à diversos competidores. Também foi proposto um modelo de custo baseado na sinopse CDH, o modelo Stockpile, cujas estimativas se mostraram mais precisas na comparação com outros modelos. Os Histogramas-Omni são apresentados como a terceira contribuição desta tese. Estas estruturas de indexação, construídas a partir de restrições de particionamento de histogramas, permitem a execução otimizada de consultas que mesclam comparações por similaridade, identidade e ordem. A quarta contribuição de nossa investigação se refere ao modelo RVRM, que é capaz de indicar quanto é possível empregar as estimativas das sinopses de distância para otimizar consultas por similaridade em conjuntos de dados de alta dimensionalidade. O modelo RVRM se mostrou capaz de identificar intervalos de dimensões para os quais essas consultas podem ser executadas eficientes. Finalmente, a última contribuição desta tese propõe a integração das sinopses e modelos revisados em um sistema com sintaxe de alto nível que pode ser acoplado em um otimizador de consultas. / Similarity searching is a foundational paradigm for many modern computer applications, such as clustering, classification and information retrieval. Within this context, the meaning of similarity is related to the distance between objects, which can be formally expressed by the Metric Spaces Theory. Many studies have focused on the inclusion of similarity search into Database Management Systems (DBMSs) for (i) enabling similarity comparisons to be combined with the DBMSs identity and order comparisons and (ii) providing scalability for very large databases. As a step further, we propose the extension of the DBMS Query Optimizer and, particularly, the extension of two modules of the Query Optimizer, namely Data Distribution Space and Cost Model modules. Although the Data Distribution Space enables representations of stored data, such representations are unsuitable for modeling the behavior of similarity comparisons, which requires the extension of the module to support distance distributions. Likewise, the Cost Model module must be extended to support cost models that depend on distance distributions. Our study is based on five contributions. A new synopsis for distance distributions, called Compact-Distance Histogram (CDH), is proposed and enables radius and selectivity estimation for similarity searching. An experimental comparison showed the gains of the estimates drawn from CDH in comparison to several competitors. A cost model based on the CDH synopsis and with accurate estimates, called Stockpile, is also proposed. Omni-Histograms are presented as the third contribution of the thesis. Such indexing structures are constructed according to histogram partition constraints and enable the optimization of queries that combine similarity, identity and order comparisons. The fourth contribution refers to the model RVRM, which indicates the possible use of the estimates obtained from distance-based synopses for the query optimization of high-dimensional datasets and identifies intervals of dimensions where similarity searching can be efficiently executed. Finally, the thesis proposes the integration of the reviewed synopses and cost models into a single system with a high-level language that can be coupled to a DBMS Query Optimizer. Concentração de distâncias Consultas por similaridade Otimização de consultas Distance concentration Query optimization Similarity searching
40	Building High Performance Data Analytics Systems based on Scale-out Models Huai, Yin 21 May 2015 (has links) No description available. Computer Science Big Data Systems Table Placement Query Optimization Out-of-band Communications

Search results