Return to search

Deferred Maintenance of Disk-Based Random Samples

Random sampling is a well-known technique for approximate processing of large datasets. We introduce a set of algorithms for incremental maintenance of large random samples on secondary storage. We show that the sample maintenance cost can be reduced by refreshing the sample in a deferred manner. We introduce a novel type of log file which follows the intuition that only a “sample” of the operations on the base data has to be considered to maintain a random sample in a statistically correct way. Additionally, we develop a deferred refresh algorithm which updates the sample by using fast sequential disk access only, and which does not require any main memory. We conducted an extensive set of experiments and found, that our algorithms reduce maintenance cost by several orders of magnitude.

Identiferoai:union.ndltd.org:DRESDEN/oai:qucosa:de:qucosa:82216
Date12 January 2023
CreatorsGemulla, Rainer, Lehner, Wolfgang
PublisherSpringer
Source SetsHochschulschriftenserver (HSSS) der SLUB Dresden
LanguageEnglish
Detected LanguageEnglish
Typeinfo:eu-repo/semantics/acceptedVersion, doc-type:conferenceObject, info:eu-repo/semantics/conferenceObject, doc-type:Text
Rightsinfo:eu-repo/semantics/openAccess
Relation978-3-540-32960-2, 978-3-540-32961-9, 10.1007/11687238_27

Page generated in 0.0015 seconds