Return to search

A dip in the reservoir: Maintaining sample synopses of evolving datasets

Perhaps the most flexible synopsis of a database is a random sample of the data; such samples are widely used to speed up processing of analytic queries and data-mining tasks, enhance query optimization, and facilitate information integration. In this paper, we study methods for incrementally maintaining a uniform random sample of the items in a dataset in the presence of an arbitrary sequence of insertions and deletions. For “stable” datasets whose sizeremains roughly constant over time, we provide a novel sampling scheme, called “random pairing” (RP) which maintains a bounded-size uniform sample by using newly inserted data items to compensate for previous deletions. The RP algorithm is the first extension of the almost 40-year-old reservoir sampling algorithm to handle deletions. Experiments show that, when dataset-size fluctuations over time are not too extreme, RP is the algorithm of choice with respect to speed and sample-size stability. For “growing” datasets, we consider algorithms for periodically “resizing” a bounded-size random sample upwards. We prove that any such algorithm cannot avoid accessing the base data, and provide a novel resizing algorithm that minimizes the time needed to increase the sample size.

Identiferoai:union.ndltd.org:DRESDEN/oai:qucosa:de:qucosa:79145
Date30 May 2022
CreatorsGemulla, Rainer, Lehner, Wolfgang, Haas, Peter J.
PublisherACM
Source SetsHochschulschriftenserver (HSSS) der SLUB Dresden
LanguageEnglish
Detected LanguageEnglish
Typeinfo:eu-repo/semantics/acceptedVersion, doc-type:conferenceObject, info:eu-repo/semantics/conferenceObject, doc-type:Text
Rightsinfo:eu-repo/semantics/openAccess
Relation1-59593-385-9

Page generated in 0.002 seconds