Spelling suggestions: "subject:"frequent itemsets mining""
1 |
A distributed approach to Frequent Itemset Mining at low support levelsClark, Neal 22 December 2014 (has links)
Frequent Itemset Mining, the process of finding frequently co-occurring sets of items in a dataset, has been at the core of the field of data mining for the past 25 years. During this time the datasets have grown much faster than the algorithms capacity to process them. Great progress was made at optimizing this task on a single computer however, despite years of research, very little progress has been made on parallelizing this task. FPGrowth based algorithms have proven notoriously difficult to parallelize and Apriori has largely fallen out of favor with the research community.
In this thesis we introduce a parallel, Apriori based, Frequent Itemset Mining algo-
rithm capable of distributing computation across large commodity clusters. Our case study demonstrates that our algorithm can efficiently scale to hundreds of cores, on a standard Hadoop MapReduce cluster, and can improve executions times by at least an order of magnitude at the lowest support levels. / Graduate / 0984 / 0800 / nclark@uvic.ca
|
2 |
TEXT MINER FOR HYPERGRAPHS USING OUTPUT SPACE SAMPLINGTirupattur, Naveen 16 August 2011 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / Text Mining is process of extracting high-quality knowledge from analysis of
textual data. Rapidly growing interest and focus on research in many fields is
resulting in an overwhelming amount of research literature. This literature is a vast source of knowledge. But due to huge volume of literature, it is practically impossible for researchers to manually extract the knowledge. Hence, there is a need for automated approach to extract knowledge from unstructured data. Text mining is right approach for automated extraction of knowledge from textual data.
The objective of this thesis is to mine documents pertaining to research literature, to find novel associations among entities appearing in that literature using Incremental Mining. Traditional text mining approaches provide binary associations. But it is important to understand context in which these associations occur. For example entity A has association with entity B in context of entity C. These contexts can be visualized as multi-way associations among the entities which are represented by a Hypergraph. This thesis work talks about extracting such multi-way associations among the entities using Frequent Itemset
Mining and application of a new concept called Output space sampling to extract
such multi-way associations in space and time efficient manner. We incorporated concept of personalization in Output space sampling so that user can specify his/her interests as the frequent hyper-associations are extracted from the text.
|
3 |
Knowledge Accelerated Algorithms and the Knowledge CacheGoyder, Matthew 19 July 2012 (has links)
No description available.
|
4 |
Parallel Closet+ Algorithm For Finding Frequent Closed ItemsetsSen, Tayfun 01 July 2009 (has links) (PDF)
Data mining is proving itself to be a very important field as the data available is increasing exponentially, thanks to first computerization and now internetization. On the other hand, cluster computing systems made up of commodity hardware are becoming widespread, along with the multicore processor architectures. This high computing power is synthesized with data mining to process huge amounts of data and to reach information and knowledge.
Frequent itemset mining is a special subtopic of data mining because it is an integral part of many types of data mining tasks. Often this task is a prerequisite for many other data mining algorithms, most notably algorithms in the association rule mining area. For this reason, it is studied heavily in the literature.
In this thesis, a parallel implementation of CLOSET+, a frequent closed itemset mining algorithm, is presented. The CLOSET+ algorithm has been modified to run on multiple processors simultaneously, in order to obtain results faster. Open MPI and Boost libraries have been used for the communication between different processes and the program has been tested on different inputs and parameters. Experimental results show that the algorithm exhibits high speedup and eficiency for dense data when the support value is higher than a determined value. Proposed parallel algorithm could prove to be useful for application areas where fast response is needed for low to medium number of frequent closed itemsets. A particular application area is the Web where online applications have similar requirements.
|
5 |
Bayesian mixture models for frequent itemset miningHe, Ruofei January 2012 (has links)
In binary-transaction data-mining, traditional frequent itemset mining often produces results which are not straightforward to interpret. To overcome this problem, probability models are often used to produce more compact and conclusive results, albeit with some loss of accuracy. Bayesian statistics have been widely used in the development of probability models in machine learning in recent years and these methods have many advantages, including their abilities to avoid overfitting. In this thesis, we develop two Bayesian mixture models with the Dirichlet distribution prior and the Dirichlet process (DP) prior to improve the previous non-Bayesian mixture model developed for transaction dataset mining. First, we develop a finite Bayesian mixture model by introducing conjugate priors to the model. Then, we extend this model to an infinite Bayesian mixture using a Dirichlet process prior. The Dirichlet process mixture model is a nonparametric Bayesian model which allows for the automatic determination of an appropriate number of mixture components. We implement the inference of both mixture models using two methods: a collapsed Gibbs sampling scheme and a variational approximation algorithm. Experiments in several benchmark problems have shown that both mixture models achieve better performance than a non-Bayesian mixture model. The variational algorithm is the faster of the two approaches while the Gibbs sampling method achieves a more accurate result. The Dirichlet process mixture model can automatically grow to a proper complexity for a better approximation. However, these approaches also show that mixture models underestimate the probabilities of frequent itemsets. Consequently, these models have a higher sensitivity but a lower specificity.
|
6 |
Efficient Temporal Synopsis of Social Media StreamsAbouelnagah, Younes January 2013 (has links)
Search and summarization of streaming social media, such as Twitter, requires the ongoing analysis of large volumes of data with dynamically changing characteristics. Tweets are short and repetitious -- lacking context and structure -- making it difficult to generate a coherent synopsis of events within a given time period. Although some established algorithms for frequent itemset analysis might provide an efficient foundation for synopsis generation, the unmodified application of standard methods produces a complex mass of rules, dominated by common language constructs and many trivial variations on topically related results. Moreover, these results are not necessarily specific to events within the time period of interest. To address these problems, we build upon the Linear time Closed itemset Mining (LCM) algorithm, which is particularly suited to the large and sparse vocabulary of tweets. LCM generates only closed itemsets, providing an immediate reduction in the number of trivial results. To reduce the impact of function words and common language constructs, we apply a filltering step that preserves these terms only when they may form part of a relevant collocation. To further reduce trivial results, we propose a novel strengthening of the closure condition of LCM to retain only those results that exceed a threshold of distinctiveness. Finally, we perform temporal ranking, based on information gain, to identify results that are particularly relevant to the time period of interest. We evaluate our work over a collection of tweets gathered in late 2012, exploring the efficiency and filtering characteristic of each processing step, both individually and collectively. Based on our experience, the resulting synopses from various time periods provide understandable and meaningful pictures of events within those periods, with potential application to tasks such as temporal summarization and query expansion for search.
|
7 |
Efficient Temporal Synopsis of Social Media StreamsAbouelnagah, Younes January 2013 (has links)
Search and summarization of streaming social media, such as Twitter, requires the ongoing analysis of large volumes of data with dynamically changing characteristics. Tweets are short and repetitious -- lacking context and structure -- making it difficult to generate a coherent synopsis of events within a given time period. Although some established algorithms for frequent itemset analysis might provide an efficient foundation for synopsis generation, the unmodified application of standard methods produces a complex mass of rules, dominated by common language constructs and many trivial variations on topically related results. Moreover, these results are not necessarily specific to events within the time period of interest. To address these problems, we build upon the Linear time Closed itemset Mining (LCM) algorithm, which is particularly suited to the large and sparse vocabulary of tweets. LCM generates only closed itemsets, providing an immediate reduction in the number of trivial results. To reduce the impact of function words and common language constructs, we apply a filltering step that preserves these terms only when they may form part of a relevant collocation. To further reduce trivial results, we propose a novel strengthening of the closure condition of LCM to retain only those results that exceed a threshold of distinctiveness. Finally, we perform temporal ranking, based on information gain, to identify results that are particularly relevant to the time period of interest. We evaluate our work over a collection of tweets gathered in late 2012, exploring the efficiency and filtering characteristic of each processing step, both individually and collectively. Based on our experience, the resulting synopses from various time periods provide understandable and meaningful pictures of events within those periods, with potential application to tasks such as temporal summarization and query expansion for search.
|
8 |
Scalable APRIORI-based frequent pattern discoveryChester, Sean 28 April 2009 (has links)
Frequent itemset mining, the task of finding sets of items that frequently occur to-
gether in a dataset, has been at the core of the field of data mining for the past
sixteen years. In that time, the size of datasets has grown much faster than has the
ability of existing algorithms to handle those datasets. Consequentely, improvements
are needed.
In this thesis, we take the classic algorithm for the problem, A Priori, and improve it quite significantly by introducing what we call a vertical sort. We then use the benchmark large dataset, webdocs, from the FIMI 2004 conference to contrast our
performance against several state-of-the-art implementations and demonstrate not
only equal efficiency with lower memory usage at all support thresholds, but also the
ability to mine support thresholds as yet unattempted in literature. We also indicate
how we believe this work can be extended to achieve yet more impressive results.
|
9 |
Discovering Neglected Conditions in Software by Mining Program Dependence GraphsCHANG, RAY-YAUNG January 2009 (has links)
No description available.
|
10 |
Evolutionary algorithms and frequent itemset mining for analyzing epileptic oscillationsSmart, Otis Lkuwamy 28 March 2007 (has links)
This research presents engineering tools that address an important area impacting many persons worldwide: epilepsy. Over 60 million people are affected by epilepsy, a neurological disorder characterized by recurrent seizures that occur suddenly. Surgery and anti-epileptic drugs (AED s) are common therapies for epilepsy patients. However, only persons with seizures that originate in an unambiguous, focal portion of the brain are candidates for surgery, while AED s can lead to very adverse side-effects. Although medical devices based upon focal cooling, drug infusion or electrical stimulation are viable alternatives for therapy, a reliable method to automatically pinpoint dysfunctional brain and direct these devices is needed. This research introduces a method to effectively localize epileptic networks, or connectivity between dysfunctional brain, to guide where to insert electrodes in the brain for therapeutic devices, surgery, or further investigation. The method uses an evolutionary algorithm (EA) and frequent itemset mining (FIM) to detect and cluster frequent concentrations of epileptic neuronal action potentials within human intracranial electroencephalogram (EEG) recordings. In an experiment applying the method to seven patients with neocortical epilepsy (a total of 35 seizures), the approach reliably identifies the seizure onset zone, in six of the subjects (a total of 31 seizures). Hopefully, this research will lead to a better control of seizures and an improved quality of life for the millions of persons affected by epilepsy.
|
Page generated in 0.0844 seconds