Return to search

Bloom maps for big data

The ability to retrieve a value given a key is fundamental in computer science. Unfortunately as the a priori set from which keys are drawn grows in size, any exact data structure must use more space per key. This motivates our interest in approximate data structures. We consider the problem of succinctly encoding a map to support queries with bounded error when the distribution over values is known. We give a lower bound on the space required per key in terms of the entropy of the distribution over values and the error rate and present a generalization of the Bloom filter, the Bloom map, that achieves the lower bound up to a small constant factor. We then develop static and on-line approximation schemes for frequency data that use constant space per key to store frequencies with bounded relative error when these follow a power law. Our on-line construction has constant expected update complexity per observation and requires only a single pass over a data set. Finally we present a simple framework for using a priori knowledge to reduce the error rate of an approximate data structure with one-sided error. We evaluate the data structures proposed here empirically and use them to construct randomized language models that significantly reduce the space requirements of a state-of-the-art statistical machine translation system.

Identiferoai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:726449
Date January 2010
CreatorsTalbot, David
PublisherUniversity of Edinburgh
Source SetsEthos UK
Detected LanguageEnglish
TypeElectronic Thesis or Dissertation
Sourcehttp://hdl.handle.net/1842/25235

Page generated in 0.0028 seconds