• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 1
  • Tagged with
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

An Efficient Bitmap-Based Approach to Mining Sequential Patterns for Large Databases

Wu, Chien-Hui 29 July 2004 (has links)
The task of Data Mining is to find the useful information within the incredible sets of data. One of important research areas of Data Mining is Mining Sequential Patterns. For a transaction database, sequential pattern means that there are some relations between the items bought by customers in a period of time. If we can find these relations by mining sequential patterns, we can provide better selling strategy to gain more customers' attentions. However, since the transaction database contains a lot of data, and it will be scanned during the mining process again and again, to improve the running efficiency is an important topic. In the GSP algorithm proposed by Srikant and Agrawal, they use a complex data structure to store and generate candidates. The generated candidates satisfy a property, ``the subsets of a frequent itemset are also frequent'. The property leads to fewer number of candidates; however, it still spends too much time to counting candidates. In the SPAM algorithm proposed by Aryes et al., they use the bitwise operations to reduce the time for counting candidates. However, it generates too many candidates which will never become frequent itemsets, which decreases the efficiency. In this thesis, we proposed a new bitmap-based algorithm. By modifying the way to generate candidates in the GSP algorithm and applying the bitwise operations in the SPAM algorithm, the proposed algorithm can mine sequential patterns efficiently. That is, we use the similar candidate generation method presented in the GSP algorithm to reduce the number of candidates and the similar counting method proposed in the SPAM algorithm to reduce the time of counting candidates. In the proposed algorithm, we classify the itemsets into two cases, simultaneous occurrence (noted as AB) and sequential occurrence (noted as A-> B). In the case of simultaneous occurrence, the number of candidate is C(n,k) based on the exhausted method. In order to prevent too many candidates generated, we make use of the property, ``the subsets of a frequent itemset are also frequent', to reduce the number of candidates from C(n,k) to C(y,k), k <= y < n. In the case of sequential occurrence, the candidates are generated by using a special join operation which could combine, for example, A->B and B->C to A->B->C. Moreover, we have to consider two other cases: (1) combing A->B and A->C to A->BC; (2) combing A->C and B->C to AB->C. The method of counting candidates is similar to the SPAM algorithm (i.e., bitwise operations). From our simulation results, based on the same bit representation for the transaction database, we show that our proposed algorithm could provide better performance than the SPAM algorithm in terms of the processing time, since our algorithm could generate fewer number of candidates than the SPAM algorithm.

Page generated in 0.0637 seconds