Return to search

A framework for processing correlated probabilistic data

The amount of digitally-born data has surged in recent years. In many scenarios, this data is inherently uncertain (or: probabilistic), such as data originating from sensor networks, image and voice recognition, location detection, and automated web data extraction. Probabilistic data requires novel and different approaches to data mining and analysis, which explicitly account for the uncertainty and the correlations therein. This thesis introduces ENFrame: a framework for processing and mining correlated probabilistic data. Using this framework, it is possible to express both traditional and novel algorithms for data analysis in a special user language, without having to explicitly address the uncertainty of the data on which the algorithms operate. The framework will subsequently execute the algorithm on the probabilistic input, and perform exact or approximate parallel probability computation. During the probability computation, correlations and provenance are succinctly encoded using probabilistic events. This thesis contains novel contributions in several directions. An expressive user language – a subset of Python – is introduced, which allows a programmer to implement algorithms for probabilistic data without requiring knowledge of the underlying probabilistic model. Furthermore, an event language is presented, which is used for the probabilistic interpretation of the user program. The event language can succinctly encode arbitrary correlations using events, which are the probabilistic counterparts of deterministic user program variables. These highly interconnected events are stored in an event network, a probabilistic interpretation of the original user program. Multiple techniques for exact and approximate probability computation (with error guarantees) of such event networks are presented, as well as techniques for parallel computation. Adaptations of multiple existing data mining algorithms are shown to work in the framework, and are subsequently subjected to an extensive experimental evaluation. Additionally, a use-case is presented in which a probabilistic adaptation of a clustering algorithm is used to predict faults in energy distribution networks. Lastly, this thesis presents techniques for integrating a number of different probabilistic data formalisms for use in this framework and in other applications.

Identiferoai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:680401
Date January 2014
Creatorsvan Schaik, Sebastiaan Johannes
ContributorsOlteanu, Dan
PublisherUniversity of Oxford
Source SetsEthos UK
Detected LanguageEnglish
TypeElectronic Thesis or Dissertation
Sourcehttp://ora.ox.ac.uk/objects/uuid:91aa418d-536e-472d-9089-39bef5f62e62

Page generated in 0.0016 seconds