Global ETD Search

311	Robust and Efficient Algorithms for Protein 3-D Structure Alignment and Genome Sequence Comparison Zhao, Zhiyu 07 August 2008 (has links) Sequence analysis and structure analysis are two of the fundamental areas of bioinformatics research. This dissertation discusses, specifically, protein structure related problems including protein structure alignment and query, and genome sequence related problems including haplotype reconstruction and genome rearrangement. It first presents an algorithm for pairwise protein structure alignment that is tested with structures from the Protein Data Bank (PDB). In many cases it outperforms two other well-known algorithms, DaliLite and CE. The preliminary algorithm is a graph-theory based approach, which uses the concept of \stars" to reduce the complexity of clique-finding algorithms. The algorithm is then improved by introducing \double-center stars" in the graph and applying a self-learning strategy. The updated algorithm is tested with a much larger set of protein structures and shown to be an improvement in accuracy, especially in cases of weak similarity. A protein structure query algorithm is designed to search for similar structures in the PDB, using the improved alignment algorithm. It is compared with SSM and shows better performance with lower maximum and average Q-score for missing proteins. An interesting problem dealing with the calculation of the diameter of a 3-D sequence of points arose and its connection to the sublinear time computation is discussed. The diameter calculation of a 3-D sequence is approximated by a series of sublinear time deterministic, zero-error and bounded-error randomized algorithms and we have obtained a series of separations about the power of sublinear time computations. This dissertation also discusses two genome sequence related problems. A probabilistic model is proposed for reconstructing haplotypes from SNP matrices with incomplete and inconsistent errors. The experiments with simulated data show both high accuracy and speed, conforming to the theoretically provable e ciency and accuracy of the algorithm. Finally, a genome rearrangement problem is studied. The concept of non-breaking similarity is introduced. Approximating the exemplar non-breaking similarity to factor n1..f is proven to be NP-hard. Interestingly, for several practical cases, several polynomial time algorithms are presented. Sequence Analysis Structure Analysis Protein Structure Alignment Protein Structure Query
312	Efficient Querying and Analytics of Semantic Web Data / Interrogation et Analyse Efficiente des Données du Web Sémantique Roatis, Alexandra 22 September 2014 (has links) L'utilité et la pertinence des données se trouvent dans l'information qui peut en être extraite.Le taux élevé de publication des données et leur complexité accrue, par exemple dans le cas des données du Web sémantique autodescriptives et hétérogènes, motivent l'intérêt de techniques efficaces pour la manipulation de données.Dans cette thèse, nous utilisons la technologie mature de gestion de données relationnelles pour l'interrogation des données du Web sémantique.La première partie se concentre sur l'apport de réponse aux requêtes sur les données soumises à des contraintes RDFS, stockées dans un système de gestion de données relationnelles. L'information implicite, résultant du raisonnement RDF est nécessaire pour répondre correctement à ces requêtes.Nous introduisons le fragment des bases de données RDF, allant au-delà de l'expressivité des fragments étudiés précédemment.Nous élaborons de nouvelles techniques pour répondre aux requêtes dans ce fragment, en étendant deux approches connues de manipulation de données sémantiques RDF, notamment par saturation de graphes et reformulation de requêtes.En particulier, nous considérons les mises à jour de graphe au sein de chaque approche et proposerons un procédé incrémental de maintenance de saturation. Nous étudions expérimentalement les performances de nos techniques, pouvant être déployées au-dessus de tout moteur de gestion de données relationnelles.La deuxième partie de cette thèse considère les nouvelles exigences pour les outils et méthodes d'analyse de données, issues de l'évolution du Web sémantique.Nous revisitons intégralement les concepts et les outils pour l'analyse de données, dans le contexte de RDF.Nous proposons le premier cadre formel pour l'analyse d'entrepôts RDF. Notamment, nous définissons des schémas analytiques adaptés aux graphes RDF hétérogènes à sémantique riche, des requêtes analytiques qui (au-delà de cubes relationnels) permettent l'interrogation flexible des données et schémas, ainsi que des opérations d'agrégation puissantes de type OLAP. Des expériences sur une plateforme entièrement implémentée démontrent l'intérêt pratique de notre approche. / The utility and relevance of data lie in the information that can be extracted from it.The high rate of data publication and its increased complexity, for instance the heterogeneous, self-describing Semantic Web data, motivate the interest in efficient techniques for data manipulation.In this thesis we leverage mature relational data management technology for querying Semantic Web data.The first part focuses on query answering over data subject to RDFS constraints, stored in relational data management systems. The implicit information resulting from RDF reasoning is required to correctly answer such queries. We introduce the database fragment of RDF, going beyond the expressive power of previously studied fragments. We devise novel techniques for answering Basic Graph Pattern queries within this fragment, exploring the two established approaches for handling RDF semantics, namely graph saturation and query reformulation. In particular, we consider graph updates within each approach and propose a method for incrementally maintaining the saturation. We experimentally study the performance trade-offs of our techniques, which can be deployed on top of any relational data management engine.The second part of this thesis considers the new requirements for data analytics tools and methods emerging from the development of the Semantic Web. We fully redesign, from the bottom up, core data analytics concepts and tools in the context of RDF data. We propose the first complete formal framework for warehouse-style RDF analytics. Notably, we define analytical schemas tailored to heterogeneous, semantic-rich RDF graphs, analytical queries which (beyond relational cubes) allow flexible querying of the data and the schema as well as powerful aggregation and OLAP-style operations. Experiments on a fully-implemented platform demonstrate the practical interest of our approach. RDF Réponse aux requêtes Raisonnement Entrepôt de données OLAP RDF Query answering Reasoning Data warehouse OLAP
313	[en] RXQEE - RELATIONAL-XML QUERY EXECUTION ENGINE / [pt] RXQEE: UMA MÁQUINA DE EXECUÇÃO DE CONSULTAS DE INTEGRAÇÃO DE DADOS RELACIONAIS E XML AMANDA VIEIRA LOPES 10 February 2005 (has links) [pt] Na abordagem tradicional para execução de consultas em um ambiente de integração de dados, os dados provenientes de fontes heterogêneas são convertidos para o modelo de dados global do sistema integrador, através do uso de adaptadores (wrappers), antes de serem submetidos aos operadores algébricos de uma consulta. Como conseqüência disto, planos de execução de consultas (PECs) contêm operadores que processam dados representados apenas no modelo de dados global. Esta dissertação apresenta uma nova abordagem para a execução de consultas de integração, denominada Moving Wrappers, na qual a conversão entre os modelos de dados acontece durante o processamento, em qualquer ponto do PEC, permitindo que os operadores processem dados representados no modelo de dados original de suas fontes. Baseada nesta abordagem, foi desenvolvida uma máquina de execução de consultas (MEC) que executa PECs de integração de dados de fontes Relacionais e XML, combinando, em um mesmo PEC, operadores em ambos os modelos. Esta MEC, denominada RXQEE (Relational-XML Query Execution Engine), foi instanciada a partir do framework QEEF (Query Execution Engine Framework), desenvolvido em um projeto de pesquisa do laboratório TecBD da PUC-Rio. De modo a permitir a execução de PECs de integração, a MEC RXQEE implementa operadores algébricos, nos modelos XML e Relacional, e operadores interalgébricos, desenvolvidos para a realizar a conversão entre esses modelos de dados na MEC construída. / [en] In the traditional approach for the evaluation of data integration queries, heterogeneous data in data sources are converted into the global data model by wrappers before being delivered to algebraic operators. Consequently, query execution plans (QEPs) are composed exclusively by operations in accordance to the global data model. This work proposes a new data integration query evaluation strategy, named Moving Wrappers, in which data conversion is considered as an operation placed in any part of the QEP, based on a query optimization process. This permits the use of algebraic operators of the data sourceís data model. So, a QEP may include fragments with operations in different data models converted to the global data model by inter-algebraic operators. Based on this strategy, a query execution engine (QEE), named RXQEE (Relational-XML Query Execution Engine), was developed as an instance of QEEF (Query Execution Engine Framework). In particular, RXQEE explores integration queries over Relational and XML data, and therefore it implements algebraic operators, in XML and Relational models, and inter-algebraic operators, permiting the execution of integration QEPs. [pt] BANCO DE DADOS [en] DATABASE [pt] PROCESSAMENTO DE CONSULTAS [en] QUERY PROCESSING [pt] SISTEMAS DE INTEGRACAO DE DADOS [en] DATA INTEGRATION SYSTEMS
314	Modelos de custo e estatísticas para consultas por similaridade / Cost models and statistics for similarity searching Bêdo, Marcos Vinícius Naves 10 October 2017 (has links) Consultas por similaridade constituem um paradigma de busca que fornece suporte à diversas tarefas computacionais, tais como agrupamento, classificação e recuperação de informação. Neste contexto, medir a similaridade entre objetos requer comparar a distância entre eles, o que pode ser formalmente modelado pela teoria de espaços métricos. Recentemente, um grande esforço de pesquisa tem sido dedicado à inclusão de consultas por similaridade em Sistemas Gerenciadores de Bases de Dados (SGBDs), com o objetivo de (i) permitir a combinação de comparações por similaridade com as comparações por identidade e ordem já existentes em SGBDs e (ii) obter escalabilidade para grandes bases de dados. Nesta tese, procuramos dar um próximo passo ao estendermos também o otimizador de consultas de um SGBD. Em particular, propomos a ampliação de dois módulos do otimizador: o módulo de Espaço de Distribuição de Dados e o módulo de Modelo de Custo. Ainda que o módulo de Espaço de Distribuição de Dados permita representar os dados armazenados, essas representações são insuficientes para modelar o comportamento das comparações em espaços métricos, sendo necessário estender este módulo para contemplar distribuições de distância. De forma semelhante, o módulo Modelo de Custo precisa ser ampliado para dar suporte à modelos de custo que utilizem estimativas sobre distribuições de distância. Toda a investigação aqui conduzida se concentra em cinco contribuições. Primeiro, foi criada uma nova sinopse para distribuições de distância, o Histograma Compactado de Distância (CDH), de onde é possível inferir valores de seletividade e raios para consultas por similaridade. Uma comparação experimental permitiu mostrar os ganhos das estimativas da sinopse CDH com relação à diversos competidores. Também foi proposto um modelo de custo baseado na sinopse CDH, o modelo Stockpile, cujas estimativas se mostraram mais precisas na comparação com outros modelos. Os Histogramas-Omni são apresentados como a terceira contribuição desta tese. Estas estruturas de indexação, construídas a partir de restrições de particionamento de histogramas, permitem a execução otimizada de consultas que mesclam comparações por similaridade, identidade e ordem. A quarta contribuição de nossa investigação se refere ao modelo RVRM, que é capaz de indicar quanto é possível empregar as estimativas das sinopses de distância para otimizar consultas por similaridade em conjuntos de dados de alta dimensionalidade. O modelo RVRM se mostrou capaz de identificar intervalos de dimensões para os quais essas consultas podem ser executadas eficientes. Finalmente, a última contribuição desta tese propõe a integração das sinopses e modelos revisados em um sistema com sintaxe de alto nível que pode ser acoplado em um otimizador de consultas. / Similarity searching is a foundational paradigm for many modern computer applications, such as clustering, classification and information retrieval. Within this context, the meaning of similarity is related to the distance between objects, which can be formally expressed by the Metric Spaces Theory. Many studies have focused on the inclusion of similarity search into Database Management Systems (DBMSs) for (i) enabling similarity comparisons to be combined with the DBMSs identity and order comparisons and (ii) providing scalability for very large databases. As a step further, we propose the extension of the DBMS Query Optimizer and, particularly, the extension of two modules of the Query Optimizer, namely Data Distribution Space and Cost Model modules. Although the Data Distribution Space enables representations of stored data, such representations are unsuitable for modeling the behavior of similarity comparisons, which requires the extension of the module to support distance distributions. Likewise, the Cost Model module must be extended to support cost models that depend on distance distributions. Our study is based on five contributions. A new synopsis for distance distributions, called Compact-Distance Histogram (CDH), is proposed and enables radius and selectivity estimation for similarity searching. An experimental comparison showed the gains of the estimates drawn from CDH in comparison to several competitors. A cost model based on the CDH synopsis and with accurate estimates, called Stockpile, is also proposed. Omni-Histograms are presented as the third contribution of the thesis. Such indexing structures are constructed according to histogram partition constraints and enable the optimization of queries that combine similarity, identity and order comparisons. The fourth contribution refers to the model RVRM, which indicates the possible use of the estimates obtained from distance-based synopses for the query optimization of high-dimensional datasets and identifies intervals of dimensions where similarity searching can be efficiently executed. Finally, the thesis proposes the integration of the reviewed synopses and cost models into a single system with a high-level language that can be coupled to a DBMS Query Optimizer. Concentração de distâncias Consultas por similaridade Distance concentration Otimização de consultas Query optimization Similarity searching
315	Architecting Query Compilers for Diverse Workloads Ruby Y Tahboub (6624119) 10 June 2019 (has links) <div>To leverage modern hardware platforms to their fullest, more and more database systems embrace compilation of query plans to native code. In the research community, there is an ongoing debate about the best way to architect such query compilers. This is perceived to be a difficult task, requiring techniques fundamentally different from traditional interpreted query execution. In this dissertation, we contribute to this discussion by drawing attention to an old but underappreciated idea known as Futamura projections, which fundamentally link interpreters and compilers. Guided by this idea, we demonstrate that efficient query compilation can actually be very simple, using techniques that are no more difficult than writing a query interpreter in a high-level language. We first develop LB2: a high-level query compiler implemented in this style that is competitive with the best compiled query engines both in sequential and parallel execution on the standard TPC-H benchmark. </div><div><br></div><div>Query engines process a variety of data types and structures including text, spatial, graphs, etc. Several spatial and graph engines are implemented as extensions to relational query engines to leverage optimized memory, storage, and evaluation. Still, the performance of these extensions is often stymied by the interpretive nature of the underlying data management, generic data structures, and the need to execute domain-specific external libraries. On that basis, compiling spatial and graph queries to native code is a desirable avenue to mitigate existing limitations and improve performance. To support compiling spatial queries, we extend the LB2 main-memory query compiler with spatial predicates, indexing structures, and spatial operators. To support compiling graph queries, we extend LB2 with graph data structures and operators. The spatial extension matches the performance of hand-written code and outperforms relational query engines and map-reduce extensions. Similarly, the graph extension matches, and sometimes outperforms, low-level graph engines.</div> Applied Computer Science Computer Software Query Compilation Generative Programming Futamura Projections
316	Automaton Meet Algebra: A Hybrid Paradigm for Efficiently Processing XQuery over XML Stream Su, Hong 30 January 2006 (has links) XML stream applications bring the challenge of efficiently processing queries on sequentially accessible token-based data streams. The automaton paradigm is naturally suited for pattern retrieval on tokenized XML streams, but requires patches for implementing the filtering or restructuring functionalities common for the XML query languages. In contrast, the algebraic paradigm is well-established for processing self-contained tuples. However, it does not traditionally support token inputs. This dissertation proposes a framework called Raindrop, which accommodates both the automaton and algebra paradigms to take advantage of both. First, we propose an architecture for Raindrop. Raindrop is an algebra framework that models queries at different abstraction levels. We represent the token-based automaton computations as an algebraic subplan at the high level while exposing the automaton details at the low level. The algebraic subplan modeling automaton computations can thus be integrated with the algebraic subplan modeling the non-automaton computations. Second, we explore a novel optimization opportunity. Other XML stream processing systems always retrieve all the patterns in a query in the automaton. In contrast, Raindrop allows a plan to retrieve some of the pattern retrieval in the automaton and some out of the automaton. This opens up an automaton-in-or-out optimization opportunity. We study this optimization in two types of run-time environments, one with stable data characteristics and one with fluctuating data characteristics. We provide search strategies catering to each environment. We also describe how to migrate from a currently running plan to a new plan at run-time. Third, we optimize the automaton computations using the schema knowledge. A set of criteria are established to decide what schema constraints are useful to a given query. Optimization rules utilizing different types of schema constraints are proposed based on the criteria. We design a rule application algorithm which ensures both completeness (i.e., no optimization is missed) and minimality (i.e., no redundant optimization is introduced). The experimentations on both real and synthetic data illustrate that these techniques bring significant performance improvement with little overhead. Database XML Stream Algebra Automaton Query Optimization XML (Document markup language) Mathematical optimization
317	Metadata-Aware Query Processing over Data Streams Ding, Luping 22 April 2008 (has links) Many modern applications need to process queries over potentially infinite data streams to provide answers in real-time. This dissertation proposes novel techniques to optimize CPU and memory utilization in stream processing by exploiting metadata on streaming data or queries. It focuses on four topics: 1) exploiting stream metadata to optimize SPJ query operators via operator configuration, 2) exploiting stream metadata to optimize SPJ query plans via query-rewriting, 3) exploiting workload metadata to optimize parameterized queries via indexing, and 4) exploiting event constraints to optimize event stream processing via run-time early termination. The first part of this dissertation proposes algorithms for one of the most common and expensive query operators, namely join, to at runtime identify and purge no-longer-needed data from the state based on punctuations. Exploitations of the combination of punctuation and commonly-used window constraints are also studied. Extensive experimental evaluations demonstrate both reduction on memory usage and improvements on execution time due to the proposed strategies. The second part proposes herald-driven runtime query plan optimization techniques. We identify four query optimization techniques, design a lightweight algorithm to efficiently detect the optimization opportunities at runtime upon receiving heralds. We propose a novel execution paradigm to support multiple concurrent logical plans by maintaining one physical plan. Extensive experimental study confirms that our techniques significantly reduce query execution times. The third part deals with the shared execution of parameterized queries instantiated from a query template. We design a lightweight index mechanism to provide multiple access paths to data to facilitate a wide range of parameterized queries. To withstand workload fluctuations, we propose an index tuning framework to tune the index configurations in a timely manner. Extensive experimental evaluations demonstrate the effectiveness of the proposed strategies. The last part proposes event query optimization techniques by exploiting event constraints such as exclusiveness or ordering relationships among events extracted from workflows. Significant performance gains are shown to be achieved by our proposed constraint-aware event processing techniques. metadata constraint data stream continuous query optimization Querying (Computer science) Metadata Data processing
318	Pattern Mining and Sense-Making Support for Enhancing the User Experience Mukherji, Abhishek 07 December 2018 (has links) While data mining techniques such as frequent itemset and sequence mining are well established as powerful pattern discovery tools in domains from science, medicine to business, a detriment is the lack of support for interactive exploration of high numbers of patterns generated with diverse parameter settings and the relationships among the mined patterns. To enhance the user experience, real-time query turnaround times and improved support for interactive mining are desired. There is also an increasing interest in applying data mining solutions for mobile data. Patterns mined over mobile data may enable context-aware applications ranging from automating frequently repeated tasks to providing personalized recommendations. Overall, this dissertation addresses three problems that limit the utility of data mining, namely, (a.) lack of interactive exploration tools for mined patterns, (b.) insufficient support for mining localized patterns, and (c.) high computational mining requirements prohibiting mining of patterns on smaller compute units such as a smartphone. This dissertation develops interactive frameworks for the guided exploration of mined patterns and their relationships. Contributions include the PARAS pre- processing and indexing framework; enabling analysts to gain key insights into rule relationships in a parameter space view due to the compact storage of rules that enables query-time reconstruction of complete rulesets. Contributions also include the visual rule exploration framework FIRE that presents an interactive dual view of the parameter space and the rule space, that together enable enhanced sense-making of rule relationships. This dissertation also supports the online mining of localized association rules computed on data subsets by selectively deploying alternative execution strategies that leverage multidimensional itemset-based data partitioning index. Finally, we designed OLAPH, an on-device context-aware service that learns phone usage patterns over mobile context data such as app usage, location, call and SMS logs to provide device intelligence. Concepts introduced for modeling mobile data as sequences include compressing context logs to intervaled context events, adding generalized time features, and identifying meaningful sequences via filter expressions.
319	Semantic Query Optimization for Processing XML Streams with Minimized Memory Footprint Li, Ming 25 August 2007 (has links) "XML streams have become increasingly prevalent in modern applications, ranging from network traffic monitoring to real-time information publishing. XQuery evaluation over XML streams require the temporary buffering of XML elements, which not only utilizes system buffer and CPU resources but also causes un-necessary output latency. This thesis presents a semantic query optimization solution to minimize memory footprint during XQuery evaluation by exploiting XML schema knowledge. In many practical applications, XML streams are generated conforming to pre-defined schema constraints typically expressed via a DTD or an XML schema specification. Utilizing such constraints enables us to on-the-fly predict the non-occurrence of a given pattern within a bound context. This helps us to avoid data buffering and to release buffered data at an earlier moment, thus achieving a minimized memory footprint. In this work, we focus on one particular class of constraints, namely, the Pattern Non-Occurrence (PNO) constraint. We develop an automaton-based technique to detect PNO constraints at runtime. For a given query, optimization opportunities which can be triggered by runtime PNO detection are explored for memory footprint minimization. Optimization decisions are encoded using our proposed Condition-Action Graph (CAG). The optimization-embedded execution strategy is then proposed to execute an optimized plan by detecting PNO constraints at run-time and then triggering the corresponding encoded actions when certain predefined conditions are satisfied. To ensure the efficiency of such PNO-triggered optimization, we propose optimization strategy on shrinking the CAGs by utilizing constraint knowledge during the query plan compiling phase. We implement our optimization technique within the Raindrop XQuery engine. Our system implementation processes XQuery utilizing the Raindrop algebra. It is efficiently augmented by our optimization module, which uses Glushkov automaton technique to capture and monitor PNO constraints in parallel with the query-driven pattern retrieval. Finally, we conduct experimental studies using both real and synthetic data streams to illustrate that our techniques bring significant performance improvement in both memory and CPU usage as well as improved output latency over state-of-the-art solutions, with little overhead." optimization XML query evaluation stream processing database XML (Document markup language) Querying (Computer science) Mathematical optimization
320	Privacy-preserving queries on encrypted databases Meng, Xianrui 07 December 2016 (has links) In today's Internet, with the advent of cloud computing, there is a natural desire for enterprises, organizations, and end users to outsource increasingly large amounts of data to a cloud provider. Therefore, ensuring security and privacy is becoming a significant challenge for cloud computing, especially for users with sensitive and valuable data. Recently, many efficient and scalable query processing methods over encrypted data have been proposed. Despite that, numerous challenges remain to be addressed due to the high complexity of many important queries on encrypted large-scale datasets. This thesis studies the problem of privacy-preserving database query processing on structured data (e.g., relational and graph databases). In particular, this thesis proposes several practical and provable secure structured encryption schemes that allow the data owner to encrypt data without losing the ability to query and retrieve it efficiently for authorized clients. This thesis includes two parts. The first part investigates graph encryption schemes. This thesis proposes a graph encryption scheme for approximate shortest distance queries. Such scheme allows the client to query the shortest distance between two nodes in an encrypted graph securely and efficiently. Moreover, this thesis also explores how the techniques can be applied to other graph queries. The second part of this thesis proposes secure top-k query processing schemes on encrypted relational databases. Furthermore, the thesis develops a scheme for the top-k join queries over multiple encrypted relations. Finally, this thesis demonstrates the practicality of the proposed encryption schemes by prototyping the encryption systems to perform queries on real-world encrypted datasets. Computer science Cryptography Database security Data management Graph theory Information security/privacy Query processing

Search results