1 |
Query Answering over Functional Dependency RepairsGaliullin, Artur 11 September 2013 (has links)
Inconsistency often arises in real-world databases and, as a result, critical queries over dirty data may lead users to make ill-informed decisions. Functional dependencies (FDs) can be used to specify intended semantics of the underlying data and aid with the cleaning task. Enumerating and evaluating all the possible repairs to FD violations is infeasible, while approaches that produce a single repair or attempt to isolate the dirty portion of data are often too destructive or constraining. In this thesis, we leverage a recent advance in data cleaning that allows sampling from a well-defined space of reasonable repairs, and provide the user with a data management tool that gives uncertain query answers over this space. We propose a framework to compute probabilistic query answers as though each repair sample were a possible world. We show experimentally that queries over many possible repairs produce results that are more useful than other approaches and that our system can scale to large datasets.
|
2 |
Query Answering over Functional Dependency RepairsGaliullin, Artur 11 September 2013 (has links)
Inconsistency often arises in real-world databases and, as a result, critical queries over dirty data may lead users to make ill-informed decisions. Functional dependencies (FDs) can be used to specify intended semantics of the underlying data and aid with the cleaning task. Enumerating and evaluating all the possible repairs to FD violations is infeasible, while approaches that produce a single repair or attempt to isolate the dirty portion of data are often too destructive or constraining. In this thesis, we leverage a recent advance in data cleaning that allows sampling from a well-defined space of reasonable repairs, and provide the user with a data management tool that gives uncertain query answers over this space. We propose a framework to compute probabilistic query answers as though each repair sample were a possible world. We show experimentally that queries over many possible repairs produce results that are more useful than other approaches and that our system can scale to large datasets.
|
3 |
Probabilistic Databases and Their ApplicationsZhao, Wenzhong 01 January 2004 (has links)
Probabilistic reasoning in databases has been an active area of research during the last twodecades. However, the previously proposed database approaches, including the probabilistic relationalapproach and the probabilistic object approach, are not good fits for storing and managingdiverse probability distributions along with their auxiliary information.The work in this dissertation extends significantly the initial semistructured probabilistic databaseframework proposed by Dekhtyar, Goldsmith and Hawkes in [20]. We extend the formal SemistructuredProbabilistic Object (SPO) data model of [20]. Accordingly, we also extend the SemistructuredProbabilistic Algebra (SP-algebra), the query algebra proposed for the SPO model.Based on the extended framework, we have designed and implemented a Semistructured ProbabilisticDatabase Management System (SPDBMS) on top of a relational DBMS. The SPDBMS isflexible enough to meet the need of storing and manipulating diverse probability distributions alongwith their associated information. Its query language supports standard database queries as wellas queries specific to probabilities, such as conditionalization and marginalization. Currently theSPDBMS serves as a storage backbone for the project Decision Making and Planning under Uncertaintywith Constraints 1‡ , that involves managing large quantities of probabilistic information. Wealso report our experimental results evaluating the performance of the SPDBMS.We describe an extension of the SPO model for handling interval probability distributions. TheExtended Semistructured Probabilistic Object (ESPO) framework improves the flexibility of theoriginal semistructured data model in two important features: (i) support for interval probabilitiesand (ii) association of context and conditionals with individual random variables. An extended SPO1 This project is partially supported by the National Science Foundation under Grant No. ITR-0325063.(ESPO) data model has been developed, and an extended query algebra for ESPO has also beenintroduced to manipulate probability distributions for probability intervals.The Bayesian Network Development Suite (BaNDeS), a system which builds Bayesian networkswith full data management support of the SPDBMS, has been described. It allows expertswith particular expertise to work only on specific subsystems during the Bayesian network constructionprocess independently and asynchronously while updating the model in real-time.There are three major foci of our ongoing and future work: (1) implementation of a queryoptimizer and performance evaluation of query optimization, (2) extension of the SPDBMS to handleinterval probability distributions, and (3) incorporation of machine learning techniques into theBaNDeS.
|
4 |
Unsupervised Bayesian Data Cleaning Techniques for Structured DataJanuary 2014 (has links)
abstract: Recent efforts in data cleaning have focused mostly on problems like data deduplication, record matching, and data standardization; few of these focus on fixing incorrect attribute values in tuples. Correcting values in tuples is typically performed by a minimum cost repair of tuples that violate static constraints like CFDs (which have to be provided by domain experts, or learned from a clean sample of the database). In this thesis, I provide a method for correcting individual attribute values in a structured database using a Bayesian generative model and a statistical error model learned from the noisy database directly. I thus avoid the necessity for a domain expert or master data. I also show how to efficiently perform consistent query answering using this model over a dirty database, in case write permissions to the database are unavailable. A Map-Reduce architecture to perform this computation in a distributed manner is also shown. I evaluate these methods over both synthetic and real data. / Dissertation/Thesis / Doctoral Dissertation Computer Science 2014
|
5 |
Most Probable Explanations for Probabilistic Database Queries: Extended VersionCeylan, Ismail Ilkan, Borgwardt, Stefan, Lukasiewicz, Thomas 28 December 2023 (has links)
Forming the foundations of large-scale knowledge bases, probabilistic databases have been widely studied in the literature. In particular, probabilistic query evaluation has been investigated intensively as a central inference mechanism. However, despite its power, query evaluation alone cannot extract all the relevant information encompassed in large-scale knowledge bases. To exploit this potential, we study two inference tasks; namely finding the most probable database and the most probable hypothesis for a given query. As natural counterparts of most probable explanations (MPE) and maximum a posteriori hypotheses (MAP) in probabilistic graphical models, they can be used in a variety of applications that involve prediction or diagnosis tasks. We investigate these problems relative to a variety of query languages, ranging from conjunctive queries to ontology-mediated queries, and provide a detailed complexity analysis.
|
6 |
Ontology-Mediated Queries for Probabilistic Databases: Extended VersionBorgwardt, Stefan, Ceylan, Ismail Ilkan, Lukasiewicz, Thomas 28 December 2023 (has links)
Probabilistic databases (PDBs) are usually incomplete, e.g., contain only the facts that have been extracted from the Web with high confidence. However, missing facts are often treated as being false, which leads to unintuitive results when querying PDBs. Recently, open-world probabilistic databases (OpenPDBs) were proposed to address this issue by allowing probabilities of unknown facts to take any value from a fixed probability interval. In this paper, we extend OpenPDBs by Datalog± ontologies, under which both upper and lower probabilities of queries become even more informative, enabling us to distinguish queries that were indistinguishable before. We show that the dichotomy between P and PP in (Open)PDBs can be lifted to the case of first-order rewritable positive programs (without negative constraints); and that the problem can become NP^PP-complete, once negative constraints are allowed. We also propose an approximating semantics that circumvents the increase in complexity caused by negative constraints.
|
7 |
Query Answering in Probabilistic Data and Knowledge BasesCeylan, Ismail Ilkan 04 June 2018 (has links) (PDF)
Probabilistic data and knowledge bases are becoming increasingly important in academia and industry. They are continuously extended with new data, powered by modern information extraction tools that associate probabilities with knowledge base facts. The state of the art to store and process such data is founded on probabilistic database systems, which are widely and successfully employed. Beyond all the success stories, however, such systems still lack the fundamental machinery to convey some of the valuable knowledge hidden in them to the end user, which limits their potential applications in practice. In particular, in their classical form, such systems are typically based on strong, unrealistic limitations, such as the closed-world assumption, the closed-domain assumption, the tuple-independence assumption, and the lack of commonsense knowledge. These limitations do not only lead to unwanted consequences, but also put such systems on weak footing in important tasks, querying answering being a very central one. In this thesis, we enhance probabilistic data and knowledge bases with more realistic data models, thereby allowing for better means for querying them. Building on the long endeavor of unifying logic and probability, we develop different rigorous semantics for probabilistic data and knowledge bases, analyze their computational properties and identify sources of (in)tractability and design practical scalable query answering algorithms whenever possible. To achieve this, the current work brings together some recent paradigms from logics, probabilistic inference, and database theory.
|
8 |
Ontology-Mediated Query Answering over Log-Linear Probabilistic Data: Extended VersionBorgwardt, Stefan, Ceylan, Ismail Ilkan, Lukasiewicz, Thomas 28 December 2023 (has links)
Large-scale knowledge bases are at the heart of modern information systems. Their knowledge is inherently uncertain, and hence they are often materialized as probabilistic databases. However, probabilistic database management systems typically lack the capability to incorporate implicit background knowledge and, consequently, fail to capture some intuitive query answers. Ontology-mediated query answering is a popular paradigm for encoding commonsense knowledge, which can provide more complete answers to user queries. We propose a new data model that integrates the paradigm of ontology-mediated query answering with probabilistic databases, employing a log-linear probability model. We compare our approach to existing proposals, and provide supporting computational results.
|
9 |
Query Answering in Probabilistic Data and Knowledge BasesCeylan, Ismail Ilkan 29 November 2017 (has links)
Probabilistic data and knowledge bases are becoming increasingly important in academia and industry. They are continuously extended with new data, powered by modern information extraction tools that associate probabilities with knowledge base facts. The state of the art to store and process such data is founded on probabilistic database systems, which are widely and successfully employed. Beyond all the success stories, however, such systems still lack the fundamental machinery to convey some of the valuable knowledge hidden in them to the end user, which limits their potential applications in practice. In particular, in their classical form, such systems are typically based on strong, unrealistic limitations, such as the closed-world assumption, the closed-domain assumption, the tuple-independence assumption, and the lack of commonsense knowledge. These limitations do not only lead to unwanted consequences, but also put such systems on weak footing in important tasks, querying answering being a very central one. In this thesis, we enhance probabilistic data and knowledge bases with more realistic data models, thereby allowing for better means for querying them. Building on the long endeavor of unifying logic and probability, we develop different rigorous semantics for probabilistic data and knowledge bases, analyze their computational properties and identify sources of (in)tractability and design practical scalable query answering algorithms whenever possible. To achieve this, the current work brings together some recent paradigms from logics, probabilistic inference, and database theory.
|
Page generated in 0.1188 seconds