Global ETD Search

1	Big Data Algorithms for Visualization and Supervised Learning Djuric, Nemanja January 2013 (has links) Explosive growth in data size, data complexity, and data rates, triggered by emergence of high-throughput technologies such as remote sensing, crowd-sourcing, social networks, or computational advertising, in recent years has led to an increasing availability of data sets of unprecedented scales, with billions of high-dimensional data examples stored on hundreds of terabytes of memory. In order to make use of this large-scale data and extract useful knowledge, researchers in machine learning and data mining communities are faced with numerous challenges, since the data mining and machine learning tools designed for standard desktop computers are not capable of addressing these problems due to memory and time constraints. As a result, there exists an evident need for development of novel, scalable algorithms for big data. In this thesis we address these important problems, and propose both supervised and unsupervised tools for handling large-scale data. First, we consider unsupervised approach to big data analysis, and explore scalable, efficient visualization method that allows fast knowledge extraction. Next, we consider supervised learning setting and propose algorithms for fast training of accurate classification models on large data sets, capable of learning state-of-the-art classifiers on data sets with millions of examples and features within minutes. Data visualization have been used for hundreds of years in scientific research, as it allows humans to easily get a better insight into complex data they are studying. Despite its long history, there is a clear need for further development of visualization methods when working with large-scale, high-dimensional data, where commonly used visualization tools are either too simplistic to gain a deeper insight into the data properties, or are too cumbersome or computationally costly. We present a novel method for data ordering and visualization. By combining efficient clustering using k-means algorithm and near-optimal ordering of found clusters using state-of-the-art TSP-solver, we obtain efficient algorithm that achieves performance better than existing, computationally intensive methods. In addition, we present visualization method for smaller-scale problems based on object matching. The experiments show that the methods allow for fast detection of hidden patterns, even by users without expertise in the areas of data mining and machine learning. Supervised learning is another important task, often intractable in many modern applications due to time and memory constraints, considering prohibitively large scales of the data sets. To address this issue, we first consider Multi-hyperplane Machine (MM) classification model, and propose online Adaptive MM algorithm which represents a trade-off between linear and kernel Support Vector Machines (SVMs), as it trains MMs in linear time on limited memory while achieving competitive accuracies on large-scale non-linear problems. Moreover, we present a C++ toolbox for developing scalable classification models, which provides an Application Programming Interface (API) for training of large-scale classifiers, as well as highly-optimized implementations of several state-of-the-art SVM approximators. Lastly, we consider parallelization and distributed learning approaches to large-scale supervised learning, and propose AROW-MapReduce, a distributed learning algorithm for confidence-weighted models using MapReduce framework. Experimental evaluation of the proposed methods shows state-of-the-art performance on a number of synthetic and real-world data sets, further paving a way for efficient and effective knowledge extraction from big data problems. / Computer and Information Science Computer Science Big Data Data Mining Data Visualization Large-scale Learning Machine Learning
2	Budgeted Online Kernel Classifiers for Large Scale Learning Wang, Zhuang January 2010 (has links) In the environment where new large scale problems are emerging in various disciplines and pervasive computing applications are becoming more common, there is an urgent need for machine learning algorithms that could process increasing amounts of data using comparatively smaller computing resources in a computational efficient way. Previous research has resulted in many successful learning algorithms that scale linearly or even sub-linearly with sample size and dimension, both in runtime and in space. However, linear or even sub-linear space scaling is often not sufficient, because it implies an unbounded growth in memory with sample size. This clearly opens another challenge: how to learn from large, or practically infinite, data sets or data streams using memory limited resources. Online learning is an important learning scenario in which a potentially unlimited sequence of training examples is presented one example at a time and can only be seen in a single pass. This is opposed to offline learning where the whole collection of training examples is at hand. The objective is to learn an accurate prediction model from the training stream. Upon on repetitively receiving fresh example from stream, typically, online learning algorithms attempt to update the existing model without retraining. The invention of the Support Vector Machines (SVM) attracted a lot of interest in adapting the kernel methods for both offline and online learning. Typical online learning for kernel classifiers consists of observing a stream of training examples and their inclusion as prototypes when specified conditions are met. However, such procedure could result in an unbounded growth in the number of prototypes. In addition to the danger of the exceeding the physical memory, this also implies an unlimited growth in both update and prediction time. To address this issue, in my dissertation I propose a series of kernel-based budgeted online algorithms, which have constant space and constant update and prediction time. This is achieved by maintaining a fixed number of prototypes under the memory budget. Most of the previous works on budgeted online algorithms focus on kernel perceptron. In the first part of the thesis, I review and discuss these existing algorithms and then propose a kernel perceptron algorithm which removes the prototype with the minimal impact on classification accuracy to maintain the budget. This is achieved by dual use of cached prototypes for both model presentation and validation. In the second part, I propose a family of budgeted online algorithms based on the Passive-Aggressive (PA) style. The budget maintenance is achieved by introducing an additional constraint into the original PA optimization problem. A closed-form solution was derived for the budget maintenance and model update. In the third part, I propose a budgeted online SVM algorithm. The proposed algorithm guarantees that the optimal SVM solution is maintained on all the prototype examples at any time. To maximize the accuracy, prototypes are constructed to approximate the data distribution near the decision boundary. In the fourth part, I propose a family of budgeted online algorithms for multi-class classification. The proposed algorithms are the recently proposed SVM training algorithm Pegasos. I prove that the gap between the budgeted Pegasos and the optimal SVM solution directly depends on the average model degradation due to budget maintenance. Following the analysis, I studied greedy multi-class budget maintenance methods based on removal, projection and merging of SVs. In each of these four parts, the proposed algorithms were experimentally evaluated against the state-of-art competitors. The results show that the proposed budgeted online algorithms outperform the competitive algorithm and achieve accuracy comparable to non-budget counterparts while being extremely computationally efficient. / Computer and Information Science Computer Science Kernel Method Large-scale Learning Online Learning Perceptron Support Vector Machine

Search results

Big Data Algorithms for Visualization and Supervised Learning

Budgeted Online Kernel Classifiers for Large Scale Learning