A data stream is a transiently observed sequence of data elements that arrive unordered, with repetitions, and at very high rate of transmission. Examples include Internet traffic data, networks of banking and credit transactions, and radar derived meteorological data. Computer science and engineering communities have developed randomised, probabilistic algorithms to estimate statistics of interest over streaming data on the fly, with small computational complexity and storage requirements, by constructing low dimensional representations of the stream known as data sketches. This thesis combines techniques of statistical inference with algorithmic approaches, such as hashing and random projections, to derive efficient estimators for cardinality, l_{alpha} distance and quasi-distance, and entropy over streaming data. I demonstrate an unexpected connection between two approaches to cardinality estimation that involve indirect record keeping: the first using pseudo-random variates and storing selected order statistics, and the second using random projections. I show that l_{alpha} distances and quasi-distances between data streams, and entropy, can be recovered from random projections that exploit properties of alpha-stable distributions with full statistical efficiency. This is achieved by the method of L-estimation in a single-pass algorithm with modest computational requirements. The proposed estimators have good small sample performance, improved by the methods of trimming and winsorising; in other words, the value of these summary statistics can be approximated with high accuracy from data sketches of low dimension. Finally, I consider the problem of convergence assessment of Markov Chain Monte Carlo methods for simulating from complex, high dimensional, discrete distributions. I argue that online, fast, and efficient computation of summary statistics such as cardinality, entropy, and l_{alpha} distances may be a useful qualitative tool for detecting lack of convergence, and illustrate this with simulations of the posterior distribution of a decomposable Gaussian graphical model via the Metropolis-Hastings algorithm.
Identifer | oai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:504352 |
Date | January 2009 |
Creators | Cosma, Ioana Ada |
Contributors | Clifford, Peter |
Publisher | University of Oxford |
Source Sets | Ethos UK |
Detected Language | English |
Type | Electronic Thesis or Dissertation |
Source | http://ora.ox.ac.uk/objects/uuid:09eafd84-8cb3-4e54-8daf-18db7832bcfc |
Page generated in 0.0019 seconds