The ability to retrieve a value given a key is fundamental in computer science. Unfortunately as the a priori set from which keys are drawn grows in size, any exact data structure must use more space per key. This motivates our interest in approximate data structures. We consider the problem of succinctly encoding a map to support queries with bounded error when the distribution over values is known. We give a lower bound on the space required per key in terms of the entropy of the distribution over values and the error rate and present a generalization of the Bloom filter, the Bloom map, that achieves the lower bound up to a small constant factor. We then develop static and on-line approximation schemes for frequency data that use constant space per key to store frequencies with bounded relative error when these follow a power law. Our on-line construction has constant expected update complexity per observation and requires only a single pass over a data set. Finally we present a simple framework for using a priori knowledge to reduce the error rate of an approximate data structure with one-sided error. We evaluate the data structures proposed here empirically and use them to construct randomized language models that significantly reduce the space requirements of a state-of-the-art statistical machine translation system.
Identifer | oai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:726449 |
Date | January 2010 |
Creators | Talbot, David |
Publisher | University of Edinburgh |
Source Sets | Ethos UK |
Detected Language | English |
Type | Electronic Thesis or Dissertation |
Source | http://hdl.handle.net/1842/25235 |
Page generated in 0.0028 seconds