Spelling suggestions: "subject:"database architectures, random pairing"" "subject:"catabase architectures, random pairing""
1 |
A dip in the reservoir: Maintaining sample synopses of evolving datasetsGemulla, Rainer, Lehner, Wolfgang, Haas, Peter J. 30 May 2022 (has links)
Perhaps the most flexible synopsis of a database is a random sample of the data; such samples are widely used to speed up processing of analytic queries and data-mining tasks, enhance query optimization, and facilitate information integration. In this paper, we study methods for incrementally maintaining a uniform random sample of the items in a dataset in the presence of an arbitrary sequence of insertions and deletions. For “stable” datasets whose sizeremains roughly constant over time, we provide a novel sampling scheme, called “random pairing” (RP) which maintains a bounded-size uniform sample by using newly inserted data items to compensate for previous deletions. The RP algorithm is the first extension of the almost 40-year-old reservoir sampling algorithm to handle deletions. Experiments show that, when dataset-size fluctuations over time are not too extreme, RP is the algorithm of choice with respect to speed and sample-size stability. For “growing” datasets, we consider algorithms for periodically “resizing” a bounded-size random sample upwards. We prove that any such algorithm cannot avoid accessing the base data, and provide a novel resizing algorithm that minimizes the time needed to increase the sample size.
|
Page generated in 0.097 seconds