Spelling suggestions: "subject:"large scale data mining"" "subject:"marge scale data mining""
1 |
Evolutionary Granular Kernel MachinesJin, Bo 03 May 2007 (has links)
Kernel machines such as Support Vector Machines (SVMs) have been widely used in various data mining applications with good generalization properties. Performance of SVMs for solving nonlinear problems is highly affected by kernel functions. The complexity of SVMs training is mainly related to the size of a training dataset. How to design a powerful kernel, how to speed up SVMs training and how to train SVMs with millions of examples are still challenging problems in the SVMs research. For these important problems, powerful and flexible kernel trees called Evolutionary Granular Kernel Trees (EGKTs) are designed to incorporate prior domain knowledge. Granular Kernel Tree Structure Evolving System (GKTSES) is developed to evolve the structures of Granular Kernel Trees (GKTs) without prior knowledge. A voting scheme is also proposed to reduce the prediction deviation of GKTSES. To speed up EGKTs optimization, a master-slave parallel model is implemented. To help SVMs challenge large-scale data mining, a Minimum Enclosing Ball (MEB) based data reduction method is presented, and a new MEB-SVM algorithm is designed. All these kernel methods are designed based on Granular Computing (GrC). In general, Evolutionary Granular Kernel Machines (EGKMs) are investigated to optimize kernels effectively, speed up training greatly and mine huge amounts of data efficiently.
|
2 |
Dataflow parallelism for large scale data miningDaruru, Srivatsava 20 December 2010 (has links)
The unprecedented and exponential growth of data along with the advent
of multi-core processors has triggered a massive paradigm shift from traditional
single threaded programming to parallel programming. A number of
parallel programming paradigms have thus been proposed and have become
pervasive and inseparable from any large production environment. Also with
the massive amounts of data available and with the ever increasing business
need to process and analyze this data quickly at the minimum cost, there is
much more demand for implementing fast data mining algorithms on cheap
hardware.
This thesis explores a parallel programming model called dataflow, the essence of which is computation organized by the flow of data through
a graph of operators. This paradigm exhibits pipeline, horizontal and vertical
parallelism and requires only the data of the active operators in memory at
any given time allowing it to scale easily to very large datasets. The thesis describes the dataflow implementation of two data mining applications on
huge datasets. We first develop an efficient dataflow implementation of a
Collaborative Filtering (CF) algorithm based on weighted co-clustering and
test its effectiveness on a large and sparse Netflix data. This implementation
of the recommender system was able to rapidly train and predict over 100
million ratings within 17 minutes on a commodity multi-core machine. We
then describe a dataflow implementation of a non-parametric density based
clustering algorithm called Auto-HDS to automatically detect small and
dense clusters on a massive astronomy dataset. This implementation was able
to discover dense clusters at varying density thresholds and generate a compact
cluster hierarchy on 100k points in less than 1.3 hours. We also show its ability
to scale to millions of points as we increase the number of available resources.
Our experimental results illustrate the ability of this model to “scale”
well to massive datasets and its ability to rapidly discover useful patterns in
two different applications. / text
|
Page generated in 0.0919 seconds