Global ETD Search

Return to search

Mining frequent highly-correlated item-pairs at very low support levels

The ability to extract frequent pairs from a set of transactions is one of the fundamental
building blocks of data mining. When the number of items in a given transaction is
relatively small the problem is trivial. Even when dealing with millions of transactions it
is still trivial if the number of unique items in the transaction set is small. The problem
becomes much more challenging when we deal with millions of transactions, each
containing hundreds of items that are part of a set of millions of potential items.
Especially when we are looking for highly correlated results at extremely low support
levels.
For 25 years the Direct Hashing and Pruning Park Chen Yu (PCY) algorithm has been
the principal technique used when there are billions of potential pairs that need to be
counted. In this paper we propose a new approach that allows us to take full advantage of
both multi-core and multi-CPU availability which works in cases where PCY fails, with
excellent performance scaling that continues even when the number of processors, unique
items and items per transaction are at their highest.
We believe that our approach has much broader applicability in the field of co-occurrence
counting, and can be used to generate much more interesting results when
mining very large data sets. / Graduate

data mining

park chen yu algorithm

map reduce

mining frequent datasets

Identifer	oai:union.ndltd.org:uvic.ca/oai:dspace.library.uvic.ca:1828/3756
Date	20 December 2011
Creators	Sandler, Ian
Contributors	Thomo, Alex
Source Sets	University of Victoria
Language	English, English
Detected Language	English
Type	Thesis
Rights	Available to the World Wide Web

Page generated in 0.0018 seconds

Mining frequent highly-correlated item-pairs at very low support levels

Description

Links & Downloads

Tags

Additional Fields