Global ETD Search

11	Sampling Algorithms for Evolving Datasets Gemulla, Rainer 20 October 2008 (has links) Perhaps the most flexible synopsis of a database is a uniform random sample of the data; such samples are widely used to speed up the processing of analytic queries and data-mining tasks, to enhance query optimization, and to facilitate information integration. Most of the existing work on database sampling focuses on how to create or exploit a random sample of a static database, that is, a database that does not change over time. The assumption of a static database, however, severely limits the applicability of these techniques in practice, where data is often not static but continuously evolving. In order to maintain the statistical validity of the sample, any changes to the database have to be appropriately reflected in the sample. In this thesis, we study efficient methods for incrementally maintaining a uniform random sample of the items in a dataset in the presence of an arbitrary sequence of insertions, updates, and deletions. We consider instances of the maintenance problem that arise when sampling from an evolving set, from an evolving multiset, from the distinct items in an evolving multiset, or from a sliding window over a data stream. Our algorithms completely avoid any accesses to the base data and can be several orders of magnitude faster than algorithms that do rely on such expensive accesses. The improved efficiency of our algorithms comes at virtually no cost: the resulting samples are provably uniform and only a small amount of auxiliary information is associated with the sample. We show that the auxiliary information not only facilitates efficient maintenance, but it can also be exploited to derive unbiased, low-variance estimators for counts, sums, averages, and the number of distinct items in the underlying dataset. In addition to sample maintenance, we discuss methods that greatly improve the flexibility of random sampling from a system's point of view. More specifically, we initiate the study of algorithms that resize a random sample upwards or downwards. Our resizing algorithms can be exploited to dynamically control the size of the sample when the dataset grows or shrinks; they facilitate resource management and help to avoid under- or oversized samples. Furthermore, in large-scale databases with data being distributed across several remote locations, it is usually infeasible to reconstruct the entire dataset for the purpose of sampling. To address this problem, we provide efficient algorithms that directly combine the local samples maintained at each location into a sample of the global dataset. We also consider a more general problem, where the global dataset is defined as an arbitrary set or multiset expression involving the local datasets, and provide efficient solutions based on hashing. info:eu-repo/classification/ddc/004 ddc:004
12	Shift gray codes Williams, Aaron Michael 11 December 2009 (has links) Combinatorial objects can be represented by strings, such as 21534 for the permutation (1 2) (3 5 4), or 110100 for the binary tree corresponding to the balanced parentheses (()()). Given a string s = s1 s2 sn, the right-shift operation shift(s, i, j) replaces the substring si si+1..sj by si+1..sj si. In other words, si is right-shifted into position j by applying the permutation (j j−1 .. i) to the indices of s. Right-shifts include prefix-shifts (i = 1) and adjacent-transpositions (j = i+1). A fixed-content language is a set of strings that contain the same multiset of symbols. Given a fixed-content language, a shift Gray code is a list of its strings where consecutive strings differ by a shift. This thesis asks if shift Gray codes exist for a variety of combinatorial objects. This abstract question leads to a number of practical answers. The first prefix-shift Gray code for multiset permutations is discovered, and it provides the first algorithm for generating multiset permutations in O(1)-time while using O(1) additional variables. Applications of these results include more efficient exhaustive solutions to stacker-crane problems, which are natural NP-complete traveling salesman variants. This thesis also produces the fastest algorithm for generating balanced parentheses in an array, and the first minimal-change order for fixed-content necklaces and Lyndon words. These results are consequences of the following theorem: Every bubble language has a right-shift Gray code. Bubble languages are fixed-content languages that are closed under certain adjacent-transpositions. These languages generalize classic combinatorial objects: k-ary trees, ordered trees with fixed branching sequences, unit interval graphs, restricted Schr oder and Motzkin paths, linear-extensions of B-posets, and their unions, intersections, and quotients. Each Gray code is circular and is obtained from a new variation of lexicographic order known as cool-lex order. Gray codes using only shift(s, 1, n) and shift(s, 1, n−1) are also found for multiset permutations. A universal cycle that omits the last (redundant) symbol from each permutation is obtained by recording the first symbol of each permutation in this Gray code. As a special case, these shorthand universal cycles provide a new fixed-density analogue to de Bruijn cycles, and the first universal cycle for the "middle levels" (binary strings of length 2k + 1 with sum k or k + 1). shorthand universal cycles combinatorial generation minimal-change order loopless algorithm efficient algorithm combinations multiset permutations balanced parentheses Dyck words Catalan paths Schroder paths Motzkin words linear-extensions posets connected unit interval graphs inversions binary trees k-ary trees Lyndon words pre-necklaces theoretical computer science discrete mathematics combinatorics brute forcs de Bruijn cycles bubble languages cool-lex order lexicographic order combinatorial enumeration stacker-crane problem traveling salesman problem middle levels fixed-density de Bruijn cycle fixed-content

Search results

Sampling Algorithms for Evolving Datasets

Shift gray codes