Discovery of {it association rules} is an important problem in the area of data
mining. Given a database of sales transactions, it is desirable to discover
the important associations among items such that the presence of some items
in a transaction will imply the presence of other items in the same
transaction.
Since mining association rules may require to repeatedly scan through a large
transaction database to find different association patterns, the amount of
processing could be huge, and performance improvement is an essential
concern.
Among this problem, how to efficiently {it count large
itemsets} is the major work, where a large itemset is a set of items
appearing in a sufficient number of transactions.
In this thesis, we propose efficient algorithms for mining association
rules based on a high-level set-based approach.
A set-based approach allows a clear expression of what needs to be done
as opposed to specifying exactly how the operations are carried out in a low-level approach, where
a low-level approach means to retrieve one tuple from the database at a time.
The advantage of the set-based approach, like the SETM algorithm,
is simple and stable over the range of parameter values.
However, the SETM algorithm proposed by Houtsma and Swami may generate too many invalid candidate itemsets.
Therefore, in this thesis, we propose a set-based algorithm called SETM*,
which provides the same advantages of the SETM algorithm,
while it avoids the disadvantages of the SETM algorithm.
In the SETM* algorithm, we reduce the size of the candidate database by
modifying the way of constructing it,
where a candidate database is a transaction database formed with candidate
$k$-itemsets.
Then, based on the new way to construct the candidate database in the SETM*
algorithm, we propose SETM*-2K, mbox{SETM*-MaxK} and SETM*-Lmax algorithms.
In the SETM*-2K algorithm, given a $k$, we efficiently construct $L_{k}$
based on $L_{w}$, where $w=2^{lceil log_{2}k
ceil - 1}$,
instead of step by step.
In the SETM*-MaxK algorithm, we efficiently to find the $L_{k}$ based on $L_{w}$,
where $L_{k}
ot= emptyset, L_{k+1}=emptyset$ and $w=2^{lceil log_{2}k
ceil - 1}$,
instead of step by step.
In the SETM*-Lmax algorithm, we use a forward approach to find all maximal large itemsets from $L_{k}$,
and the $k$-itemset is not included in the $k$-subsets of the $j$-itemset,
except $k=MaxK$, where $1 leq k < j leq MaxK$,
$L_{MaxK}
ot= emptyset$ and $L_{MaxK+1}=emptyset$.
We conduct several experiments using different synthetic relational databases.
The simulation results show that the
SETM* algorithm outperforms the SETM algorithm in terms of storage space or the
execution time for all relational database settings.
Moreover, we show that the proposed SETM*-2K
and SETM*-MaxK algorithms also require shorter time to achieve their goals than the SETM or SETM* algorithms.
Furthermore, we also show that the proposed forward approach (SETM*-Lmax)
to find all maximal large itemsets requires shorter time than the backward approach proposed by Agrawal.
Identifer | oai:union.ndltd.org:NSYSU/oai:NSYSU:etd-0728100-115617 |
Date | 28 July 2000 |
Creators | Hsieh, Yu-Ming |
Contributors | Ye-In Chang, San-Yih Hwang, Tei-Wei Kuo, Chien-I Lee |
Publisher | NSYSU |
Source Sets | NSYSU Electronic Thesis and Dissertation Archive |
Language | English |
Detected Language | English |
Type | text |
Format | application/pdf |
Source | http://etd.lib.nsysu.edu.tw/ETD-db/ETD-search/view_etd?URN=etd-0728100-115617 |
Rights | not_available, Copyright information available at source archive |
Page generated in 0.0021 seconds