Global ETD Search

1	Streaming Random Forests Abdulsalam, Hanady 16 July 2008 (has links) Recent research addresses the problem of data-stream mining to deal with applications that require processing huge amounts of data such as sensor data analysis and financial applications. Data-stream mining algorithms incorporate special provisions to meet the requirements of stream-management systems, that is stream algorithms must be online and incremental, processing each data record only once (or few times); adaptive to distribution changes; and fast enough to accommodate high arrival rates. We consider the problem of data-stream classification, introducing an online and incremental stream-classification ensemble algorithm, Streaming Random Forests, an extension of the Random Forests algorithm by Breiman, which is a standard classification algorithm. Our algorithm is designed to handle multi-class classification problems. It is able to deal with data streams having an evolving nature and a random arrival rate of training/test data records. The algorithm, in addition, automatically adjusts its parameters based on the data seen so far. Experimental results on real and synthetic data demonstrate that the algorithm gives a successful behavior. Without losing classification accuracy, our algorithm is able to handle multi-class problems for which the underlying class boundaries drift, and handle the case when blocks of training records are not big enough to build/update the classification model. / Thesis (Ph.D, Computing) -- Queen's University, 2008-07-15 16:12:33.221 Data mining Streams Classification algorithms Streaming algorithms
2	Agglomerative clustering for community detection in dynamic graphs Godbole, Pushkar J. 27 May 2016 (has links) Agglomerative Clustering techniques work by recursively merging graph vertices into communities, to maximize a clustering quality metric. The metric of Modularity coined by Newman and Girvan, measures the cluster quality based on the premise that, a cluster has collections of vertices more strongly connected internally than would occur from random chance. Various fast and efficient algorithms for community detection based on modularity maximization have been developed for static graphs. However, since many (contemporary) networks are not static but rather evolve over time, the static approaches are rendered inappropriate for clustering of dynamic graphs. Modularity optimization in changing graphs is a relatively new field that entails the need to develop efficient algorithms for detection and maintenance of a community structure while minimizing the “Size of change” and computational effort. The objective of this work was to develop an efficient dynamic agglomerative clustering algorithm that attempts to maximize modularity while minimizing the “size of change” in the transitioning community structure. First we briefly discuss the previous memoryless dynamic reagglomeration approach with localized vertex freeing and illustrate its performance and limitations. Then we describe the new backtracking algorithm followed by its performance results and observations. In experimental analysis of both typical and pathological cases, we evaluate and justify various backtracking and agglomeration strategies in context of the graph structure and incoming stream topologies. Evaluation of the algorithm on social network datasets, including Facebook (SNAP) and PGP Giant Component networks shows significantly improved performance over its conventional static counterpart in terms of execution time, Modularity and Size of Change. Dynamic graphs Agglomeration Clustering Modularity Social networks Community detection Stinger Streaming algorithms
3	Computations on Massive Data Sets : Streaming Algorithms and Two-party Communication / Calculs sur des grosses données : algorithmes de streaming et communication entre deux joueurs Konrad, Christian 05 July 2013 (has links) Dans cette thèse on considère deux modèles de calcul qui abordent des problèmes qui se posent lors du traitement des grosses données. Le premier modèle est le modèle de streaming. Lors du traitement des grosses données, un accès aux données de façon aléatoire est trop couteux. Les algorithmes de streaming ont un accès restreint aux données: ils lisent les données de façon séquentielle (par passage) une fois ou peu de fois. De plus, les algorithmes de streaming utilisent une mémoire d'accès aléatoire de taille sous-linéaire dans la taille des données. Le deuxième modèle est le modèle de communication. Lors du traitement des données par plusieurs entités de calcul situées à des endroits différents, l'échange des messages pour la synchronisation de leurs calculs est souvent un goulet d'étranglement. Il est donc préférable de minimiser la quantité de communication. Un modèle particulier est la communication à sens unique entre deux participants. Dans ce modèle, deux participants calculent un résultat en fonction des données qui sont partagées entre eux et la communication se réduit à un seul message. On étudie les problèmes suivants: 1) Les couplages dans le modèle de streaming. L'entrée du problème est un flux d'arêtes d'un graphe G=(V,E) avec n=\|V\|. On recherche un algorithme de streaming qui calcule un couplage de grande taille en utilisant une mémoire de taille O(n polylog n). L'algorithme glouton remplit ces contraintes et calcule un couplage de taille au moins 1/2 fois la taille d'un couplage maximum. Une question ouverte depuis longtemps demande si l'algorithme glouton est optimal si aucune hypothèse sur l'ordre des arêtes dans le flux est faite. Nous montrons qu'il y a un meilleur algorithme que l'algorithme glouton si les arêtes du graphe sont dans un ordre uniformément aléatoire. De plus, nous montrons qu'avec deux passages on peut calculer un couplage de taille strictement supérieur à 1/2 fois la taille d'un couplage maximum sans contraintes sur l'ordre des arêtes. 2) Les semi-couplages en streaming et en communication. Un semi-couplage dans un graphe biparti G=(A,B,E) est un sous-ensemble d'arêtes qui couple tous les sommets de type A exactement une fois aux sommets de type B de façon pas forcement injective. L'objectif est de minimiser le nombre de sommets de type A qui sont couplés aux même sommets de type B. Pour ce problème, nous montrons un algorithme qui, pour tout 0<=ε<=1, calcule une O(n^((1-ε)/2))-approximation en utilisant une mémoire de taille Ô(n^(1+ε)). De plus, nous montrons des bornes supérieures et des bornes inférieurs pour la complexité de communication entre deux participants pour ce problème et des nouveaux résultats concernant la structure des semi-couplages. 3) Validité des fichiers XML dans le modèle de streaming. Un fichier XML de taille n est une séquence de balises ouvrantes et fermantes. Une DTD est un ensemble de contraintes de validité locales d'un fichier XML. Nous étudions des algorithmes de streaming pour tester si un fichier XML satisfait les contraintes décrites dans une DTD. Notre résultat principal est un algorithme de streaming qui fait O(log n) passages, utilise 3 flux auxiliaires et une mémoire de taille O(log^2 n). De plus, pour le problème de validation des fichiers XML qui décrivent des arbres binaires, nous présentons des algorithmes en un passage et deux passages qui une mémoire de taille sous-linéaire. 4) Correction d'erreur pour la distance du cantonnier. Alice et Bob ont des ensembles de n points sur une grille en d dimensions. Alice envoit un échantillon de petite taille à Bob qui, après réception, déplace ses points pour que la distance du cantonnier entre les points d'Alice et les points de Bob diminue. Pour tout k>0 nous montrons qu'il y a un protocole presque optimal de communication avec coût de communication Ô(kd) tel que les déplacements des points effectués par Bob aboutissent à un facteur d'approximation de O(d) par rapport aux meilleurs déplacements de d points. / In this PhD thesis, we consider two computational models that address problems that arise when processing massive data sets. The first model is the Data Streaming Model. When processing massive data sets, random access to the input data is very costly. Therefore, streaming algorithms only have restricted access to the input data: They sequentially scan the input data once or only a few times. In addition, streaming algorithms use a random access memory of sublinear size in the length of the input. Sequential input access and sublinear memory are drastic limitations when designing algorithms. The major goal of this PhD thesis is to explore the limitations and the strengths of the streaming model. The second model is the Communication Model. When data is processed by multiple computational units at different locations, then the message exchange of the participating parties for synchronizing their calculations is often a bottleneck. The amount of communication should hence be as little as possible. A particular setting is the one-way two-party communication setting. Here, two parties collectively compute a function of the input data that is split among the two parties, and the whole message exchange reduces to a single message from one party to the other one. We study the following four problems in the context of streaming algorithms and one-way two-party communication: (1) Matchings in the Streaming Model. We are given a stream of edges of a graph G=(V,E) with n=\|V\|, and the goal is to design a streaming algorithm that computes a matching using a random access memory of size O(n polylog n). The Greedy matching algorithm fits into this setting and computes a matching of size at least 1/2 times the size of a maximum matching. A long standing open question is whether the Greedy algorithm is optimal if no assumption about the order of the input stream is made. We show that it is possible to improve on the Greedy algorithm if the input stream is in uniform random order. Furthermore, we show that with two passes an approximation ratio strictly larger than 1/2 can be obtained if no assumption on the order of the input stream is made. (2) Semi-matchings in Streaming and in Two-party Communication. A semi-matching in a bipartite graph G=(A,B,E) is a subset of edges that matches all A vertices exactly once to B vertices, not necessarily in an injective way. The goal is to minimize the maximal number of A vertices that are matched to the same B vertex. We show that for any 0<=ε<=1, there is a one-pass streaming algorithm that computes an O(n^((1-ε)/2))-approximation using Ô(n^(1+ε)) space. Furthermore, we provide upper and lower bounds on the two-party communication complexity of this problem, as well as new results on the structure of semi-matchings. (3) Validity of XML Documents in the Streaming Model. An XML document of length n is a sequence of opening and closing tags. A DTD is a set of local validity constraints of an XML document. We study streaming algorithms for checking whether an XML document fulfills the validity constraints of a given DTD. Our main result is an O(log n)-pass streaming algorithm with 3 auxiliary streams and O(log^2 n) space for this problem. Furthermore, we present one-pass and two-pass sublinear space streaming algorithms for checking validity of XML documents that encode binary trees. (4) Budget-Error-Correcting under Earth-Mover-Distance. We study the following one-way two-party communication problem. Alice and Bob have sets of n points on a d-dimensional grid [Δ]^d for an integer Δ. Alice sends a small sketch of her points to Bob and Bob adjusts his point set towards Alice's point set so that the Earth-Mover-Distance of Bob's points and Alice's points decreases. For any k>0, we show that there is an almost tight randomized protocol with communication cost Ô(kd) such that Bob's adjustments lead to an O(d)-approximation compared to the k best possible adjustments that Bob could make. Algorithmes de streaming Complexité de la communication Mémoire sous-linéaire Couplage Semi-couplage Xml Validité Distance du cantonnier Streaming algorithms Communication complexity Sublinear space Matching Semi-matching Xml Validity Earth-mover-distance
4	Algorithms for large graphs Das Sarma, Atish 01 July 2010 (has links) No description available. Random walks Distances Distributed computing Graphs PageRank Distributed algorithms Online algorithms Streaming algorithms Algorithms Parallel algorithms Graph algorithms Computer algorithms
5	Computations on Massive Data Sets : Streaming Algorithms and Two-party Communication Konrad, Christian 05 July 2013 (has links) (PDF) In this PhD thesis, we consider two computational models that address problems that arise when processing massive data sets. The first model is the Data Streaming Model. When processing massive data sets, random access to the input data is very costly. Therefore, streaming algorithms only have restricted access to the input data: They sequentially scan the input data once or only a few times. In addition, streaming algorithms use a random access memory of sublinear size in the length of the input. Sequential input access and sublinear memory are drastic limitations when designing algorithms. The major goal of this PhD thesis is to explore the limitations and the strengths of the streaming model. The second model is the Communication Model. When data is processed by multiple computational units at different locations, then the message exchange of the participating parties for synchronizing their calculations is often a bottleneck. The amount of communication should hence be as little as possible. A particular setting is the one-way two-party communication setting. Here, two parties collectively compute a function of the input data that is split among the two parties, and the whole message exchange reduces to a single message from one party to the other one. We study the following four problems in the context of streaming algorithms and one-way two-party communication: (1) Matchings in the Streaming Model. We are given a stream of edges of a graph G=(V,E) with n=\|V\|, and the goal is to design a streaming algorithm that computes a matching using a random access memory of size O(n polylog n). The Greedy matching algorithm fits into this setting and computes a matching of size at least 1/2 times the size of a maximum matching. A long standing open question is whether the Greedy algorithm is optimal if no assumption about the order of the input stream is made. We show that it is possible to improve on the Greedy algorithm if the input stream is in uniform random order. Furthermore, we show that with two passes an approximation ratio strictly larger than 1/2 can be obtained if no assumption on the order of the input stream is made. (2) Semi-matchings in Streaming and in Two-party Communication. A semi-matching in a bipartite graph G=(A,B,E) is a subset of edges that matches all A vertices exactly once to B vertices, not necessarily in an injective way. The goal is to minimize the maximal number of A vertices that are matched to the same B vertex. We show that for any 0<=ε<=1, there is a one-pass streaming algorithm that computes an O(n^((1-ε)/2))-approximation using Ô(n^(1+ε)) space. Furthermore, we provide upper and lower bounds on the two-party communication complexity of this problem, as well as new results on the structure of semi-matchings. (3) Validity of XML Documents in the Streaming Model. An XML document of length n is a sequence of opening and closing tags. A DTD is a set of local validity constraints of an XML document. We study streaming algorithms for checking whether an XML document fulfills the validity constraints of a given DTD. Our main result is an O(log n)-pass streaming algorithm with 3 auxiliary streams and O(log^2 n) space for this problem. Furthermore, we present one-pass and two-pass sublinear space streaming algorithms for checking validity of XML documents that encode binary trees. (4) Budget-Error-Correcting under Earth-Mover-Distance. We study the following one-way two-party communication problem. Alice and Bob have sets of n points on a d-dimensional grid [Δ]^d for an integer Δ. Alice sends a small sketch of her points to Bob and Bob adjusts his point set towards Alice's point set so that the Earth-Mover-Distance of Bob's points and Alice's points decreases. For any k>0, we show that there is an almost tight randomized protocol with communication cost Ô(kd) such that Bob's adjustments lead to an O(d)-approximation compared to the k best possible adjustments that Bob could make. [INFO:INFO_OH] Computer Science/Other [INFO:INFO_OH] Informatique/Autre Streaming algorithms Communication complexity Sublinear space Matching Semi-matching Xml Validity Earth-mover-distance
6	Learning with Complex Performance Measures : Theory, Algorithms and Applications Narasimhan, Harikrishna January 2016 (has links) (PDF) We consider supervised learning problems, where one is given objects with labels, and the goal is to learn a model that can make accurate predictions on new objects. These problems abound in applications, ranging from medical diagnosis to information retrieval to computer vision. Examples include binary or multiclass classi cation, where the goal is to learn a model that can classify objects into two or more categories (e.g. categorizing emails into spam or non-spam); bipartite ranking, where the goal is to learn a model that can rank relevant objects above the irrelevant ones (e.g. ranking documents by relevance to a query); class probability estimation (CPE), where the goal is to predict the probability of an object belonging to different categories (e.g. probability of an internet ad being clicked by a user). In each case, the accuracy of a model is evaluated in terms of a specified `performance measure'. While there has been much work on designing and analyzing algorithms for different supervised learning tasks, we have complete understanding only for settings where the performance measure of interest is the standard 0-1 or a loss-based classification measure. These performance measures have a simple additive structure, and can be expressed as an expectation of errors on individual examples. However, in many real-world applications, the performance measure used to evaluate a model is often more complex, and does not decompose into a sum or expectation of point-wise errors. These include the binary or multiclass G-mean used in class-imbalanced classification problems; the F1-measure and its multiclass variants popular in text retrieval; and the (partial) area under the ROC curve (AUC) and precision@ employed in ranking applications. How does one design efficient learning algorithms for such complex performance measures, and can these algorithms be shown to be statistically consistent, i.e. shown to converge in the limit of infinite data to the optimal model for the given measure? How does one develop efficient learning algorithms for complex measures in online/streaming settings where the training examples need to be processed one at a time? These are questions that we seek to address in this thesis. Firstly, we consider the bipartite ranking problem with the AUC and partial AUC performance measures. We start by understanding how bipartite ranking with AUC is related to the standard 0-1 binary classification and CPE tasks. It is known that a good binary CPE model can be used to obtain both a good binary classification model and a good bipartite ranking model (formally, in terms of regret transfer bounds), and that a binary classification model does not necessarily yield a CPE model. However, not much is known about other directions. We show that in a weaker sense (where the mapping needed to transform a model from one problem to another depends on the underlying probability distribution), a good bipartite ranking model for AUC can indeed be used to construct a good binary classification model, and also a good binary CPE model. Next, motivated by the increasing number of applications (e.g. biometrics, medical diagnosis, etc.), where performance is measured, not in terms of the full AUC, but in terms of the partial AUC between two false positive rates (FPRs), we design batch algorithms for optimizing partial AUC in any given FPR range. Our algorithms optimize structural support vector machine based surrogates, which unlike for the full AUC; do not admit a straightforward decomposition into simpler terms. We develop polynomial time cutting plane solvers for solving the optimization, and provide experiments to demonstrate the efficacy of our methods. We also present an application of our approach to predicting chemotherapy outcomes for cancer patients, with the aim of improving treatment decisions. Secondly, we develop algorithms for optimizing (surrogates for) complex performance mea-sures in the presence of streaming data. A well-known method for solving this problem for standard point-wise surrogates such as the hinge surrogate, is the stochastic gradient descent (SGD) method, which performs point-wise updates using unbiased gradient estimates. How-ever, this method cannot be applied to complex objectives, as here one can no longer obtain unbiased gradient estimates from a single point. We develop a general stochastic method for optimizing complex measures that avoids point-wise updates, and instead performs gradient updates on mini-batches of incoming points. The method is shown to provably converge for any performance measure that satis es a uniform convergence requirement, such as the partial AUC, precision@ and F1-measure, and in experiments, is often several orders of magnitude faster than the state-of-the-art batch methods, while achieving similar or better accuracies. Moreover, for specific complex binary classification measures, which are concave functions of the true positive rate (TPR) and true negative rate (TNR), we are able to develop stochastic (primal-dual) methods that can indeed be implemented with point-wise updates, by using an adaptive linearization scheme. These methods admit convergence rates that match the rate of the SGD method, and are again several times faster than the state-of-the-art methods. Finally, we look at the design of consistent algorithms for complex binary and multiclass measures. For binary measures, we consider the practically popular plug-in algorithm that constructs a classifier by applying an empirical threshold to a suitable class probability estimate, and provide a general methodology for proving consistency of these methods. We apply this technique to show consistency for the F1-measure, and under a continuity assumption on the distribution, for any performance measure that is monotonic in the TPR and TNR. For the case of multiclass measures, a simple plug-in method is no longer tractable, as in the place of a single threshold parameter, one needs to tune at least as many parameters as the number of classes. Using an optimization viewpoint, we provide a framework for designing learning algorithms for multiclass measures that are general functions of the confusion matrix, and as an instantiation, provide an e cient and provably consistent algorithm based on the bisection method for multiclass measures that are ratio-of-linear functions of the confusion matrix (e.g. micro F1). The algorithm outperforms the state-of-the-art SVMPerf method in terms of both accuracy and running time. Overall, in this thesis, we have looked at various aspects of complex performance measures used in supervised learning problems, leading to several new algorithms that are often significantly better than the state-of-the-art, to improved theoretical understanding of the performance measures studied, and to novel real-life applications of the algorithms developed. Multiclass Classification Binary Classification Streaming Algorithms Complex Performance Measures Learning Algorithms Bipartite Ranking Complex Multiclass Measures AUC Complex Binary Measures Computer Science
7	Submodular Order Maximization Subject to a p-Matchoid Constraint / Submodulär ordermaximering som är föremål för ett p-matchoid-begränsningsvillkor Wu, Yizhan January 2022 (has links) Recently, Udwani defined a new class of set functions under monotonicity and subadditivity, called submodular order functions, which is a subfamily of submodular functions. Informally, the submodular order function admits a very limited form of submodularity which is defined over a specific permutation of the ground set. His work pointed out the intriguing connection between streaming submodular maximization and submodular order maximization. Inspired by a 0.25-approximation streaming algorithm for maximizing a monotone submodular function subject to a matroid constraint, Udwani gave a 0.25-approximation algorithm for submodular order functions maximization subject to a matroid constraint. Based on the above results, we would like to explore further in which cases it is feasible to generalize from streaming submodular maximization algorithms to submodular order maximization algorithms. As a more general constraint than matroid, p-matchoid is a collection of p matroids with each matroid defined on some subsets of the ground set. Related work gave a 1/4p-approximation streaming algorithm for monotone submodular functions maximization under a p-matchoid constraint. Inspired by the above algorithms and the intriguing connection, we used some techniques to try to generalize several streaming algorithms for submodular functions to the offline algorithms for submodular order functions, including interleaved partitions and incremental values. Assuming that the objective function f is subadditive and non-negative, we gave a 1/4p-approximation algorithm for monotone submodular order maximization to a p-matchoid constraint. In addition, we summarize the failures of other cases. / Nyligen definierade Udwani en ny klass av mängdfunktioner under monotonicitet och subadditivitet, som kallas submodulära ordningsfunktioner och som är en underfamilj av submodulära funktioner. Informellt sett medger den submodulära ordningsfunktionen en mycket begränsad form av submodularitet som är definierad över en specifik permutation av grundmängden. Hans arbete pekade på det spännande sambandet mellan strömmande submodulär maximering och submodulär ordermaximering. Inspirerad av en strömningsalgoritm med 0.25-approximation för maximering av en monoton submodulär funktion som är föremål för en matroidbegränsning, gav Udwani en algoritm med 0.25-approximation för maximering av submodulära ordningsfunktioner som är föremål för en matroidbegränsning. Baserat på ovanstående resultat skulle vi vilja utforska ytterligare i vilka fall det är möjligt att generalisera från algoritmer för strömning av submodulära maximeringsfunktioner till algoritmer för maximering av submodulära orderfunktioner. Som en mer allmän begränsning än matroid är p-matchoid en samling av p matroider där varje matroid definieras på vissa delmängder av grundmängden. Relaterade arbeten gav en strömmingsalgoritm med 1/4p-tillnärmning för monoton submodulär funktionsmaximering under en p-matchoid-begränsning. Inspirerade av ovanstående algoritmer och det spännande sambandet använde vi vissa tekniker för att försöka generalisera flera strömningsalgoritmer för submodulära funktioner till offline-algoritmer för submodulära ordningsfunktioner, inklusive interleaved partitions och inkrementella värden. Under förutsättning att målfunktionen f är subadditiv och icke-negativ gav vi en algoritm för 1/4p-tillnärmning för monoton submodulär ordermaximering till ett p-matchoid-begränsningsvillkor. Dessutom sammanfattar vi misslyckanden i andra fall. Theoretical computer science approximation algorithms combinatorial optimization matchoid submodular order maximization submodular maximization streaming algorithms Teoretisk datavetenskap approximationsalgoritmer kombinatorisk optimering matchoid submodulär ordermaximering submodulär maximering strömningsalgoritmer Computer and Information Sciences Data- och informationsvetenskap

1

Page generated in 0.0791 seconds