Global ETD Search

1	Utility of Considering Multiple Alternative Rectifications in Data Cleaning January 2013 (has links) abstract: Most data cleaning systems aim to go from a given deterministic dirty database to another deterministic but clean database. Such an enterprise pre–supposes that it is in fact possible for the cleaning process to uniquely recover the clean versions of each dirty data tuple. This is not possible in many cases, where the most a cleaning system can do is to generate a (hopefully small) set of clean candidates for each dirty tuple. When the cleaning system is required to output a deterministic database, it is forced to pick one clean candidate (say the "most likely" candidate) per tuple. Such an approach can lead to loss of information. For example, consider a situation where there are three equally likely clean candidates of a dirty tuple. An appealing alternative that avoids such an information loss is to abandon the requirement that the output database be deterministic. In other words, even though the input (dirty) database is deterministic, I allow the reconstructed database to be probabilistic. Although such an approach does avoid the information loss, it also brings forth several challenges. For example, how many alternatives should be kept per tuple in the reconstructed database? Maintaining too many alternatives increases the size of the reconstructed database, and hence the query processing time. Second, while processing queries on the probabilistic database may well increase recall, how would they affect the precision of the query processing? In this thesis, I investigate these questions. My investigation is done in the context of a data cleaning system called BayesWipe that has the capability of producing multiple clean candidates per each dirty tuple, along with the probability that they are the correct cleaned version. I represent these alternatives as tuples in a tuple disjoint probabilistic database, and use the Mystiq system to process queries on it. This probabilistic reconstruction (called BayesWipe–PDB) is compared to a deterministic reconstruction (called BayesWipe–DET)—where the most likely clean candidate for each tuple is chosen, and the rest of the alternatives discarded. / Dissertation/Thesis / M.S. Computer Science 2013 Computer science False Positive Precision Probabilistic Database Probabilistic data cleaning Recall True Posiitive
2	Modeling and verification of probabilistic data-aware business processes / Modélisation et vérification des processus métier orientés données probabilistes Li, Haizhou 26 March 2015 (has links) Un large éventail de nouvelles applications met l’accent sur la nécessité de disposer de modèles de processus métier capables de manipuler des données imprécises ou incertaines. Du fait de la présence de données probabilistes, les comportements externes de tels processus métier sont non markoviens. Peu de travaux dans la littérature se sont intéressés à la vérification de tels systèmes. Ce travail de thèse étudie les questions de modélisation et d’analyse de ce type de processus métier. Il utilise comme modèle formel pour décrire les comportements des processus métier un système de transitions étiquetées dans lequel les transitions sont gardées par des conditions définies sur une base de données probabiliste. Il propose ensuite une approche de décomposition de ces processus qui permet de tester la relation de simulation entre processus dans ce contexte. Une analyse de complexité révèle que le problème de test de simulation est dans 2-EXPTIME, et qu’il est EXPTIME-difficile en termes de complexité d’expression, alors que du point de vue de la complexité en termes des données, il n’engendre pas de surcoût supplémentaire par rapport au coût de l’évaluation de requêtes booléennes sur des bases de données probabilistes. L’approche proposée est ensuite étendue pour permettre la vérification de propriétés exprimées dans les logiques P-LTL et P-CTL. Finalement, un prototype, nommé ‘PRODUS’, a été implémenté et utilisé dans le cadre d’une application liée aux systèmes d’information géographiques pour montrer la faisabilité de l’approche proposée. / There is a wide range of new applications that stress the need for business process models that are able to handle imprecise data. This thesis studies the underlying modelling and analysis issues. It uses as formal model to describe process behaviours a labelled transitions system in which transitions are guarded by conditions defined over a probabilistic database. To tackle verification problems, we decompose this model to a set of traditional automata associated with probabilities named as world-partition automata. Next, this thesis presents an approach for testing probabilistic simulation preorder in this context. A complexity analysis reveals that the problem is in 2-exptime, and is exptime-hard, w.r.t. expression complexity while it matches probabilistic query evaluation w.r.t. data-complexity. Then P-LTL and P-CTL model checking methods are studied to verify this model. In this context, the complexity of P-LTL and P-CTL model checking is in exptime. Finally a prototype called ”PRODUS” which is a modeling and verification tool is introduced and we model a realistic scenario in the domain of GIS (graphical information system) by using our approach. Bases de données probabilistes Processus métier Relation de simulation Vérification de modèles Probabilistic database Business processes Simulation relation test Model checking
3	Evaluation of relational algebra queries on probabilistic databases : tractability and approximation Fink, Robert D. January 2014 (has links) Query processing is a core task in probabilistic databases: Given a query and a database that encodes uncertainty in data by means of probability distributions, the problem is to compute possible query answers together with their respective probabilities of being correct. This thesis advances the state of the art in two aspects of query processing in probabilistic databases: complexity analysis and query evaluation techniques. A dichotomy is established for non-repeating, con- junctive relational algebra queries with negation that separates #P-hard queries from those with PTIME data complexity. A framework for computing proba- bilities of relational algebra queries is presented; the probability computation algorithm is based on decomposition methods and provides exact answers in the case of exhaustive decompositions, or anytime approximate answers with absolute or relative error guarantees in the case of partial decompositions. The framework is extended to queries with aggregation operators. An experimental evaluation of the proposed algorithms’ implementations within the SPROUT query engine complements the theoretical results. The SPROUT<sup>2</sup> system uses this query engine to compute answers to queries on uncertain, tabular Web data. 005.74

1

Page generated in 0.054 seconds