With the rapid development of science and technology, our society has been in the big data era. In human activities, we produce a lot of data in every second and every minute, what contain much information. Then, how to select the useful information from those complicated data is a significant issue. So the association rules mining, a technique of mining patterns or associations between itemsets, comes into being. And this technique aims to find some important associations in data to get useful knowledge. Nowadays, most scholars at home and abroad focus on the frequent pattern mining. However, it is undeniable that the rare pattern mining also plays an important role in many areas, such as the medical, financial, and scientific field. Comparing with frequent pattern mining, studying rare pattern mining is more valuable, because it tends to find unknown, unexpected, and more interesting rules. But the study of rare pattern mining is little difficult because of the scarcity of data used for verifying rules. In the frequent pattern mining, there are two general algorithms of discovering frequent itemsets, i.e., Apriori, the earliest algorithm which is proposed by R.Agrawal in 1994, and FP-Tree, the improved algorithm which reduced the time complexity. And in rare pattern mining, there are also two algorithms, Arima and Rarity, what are similar to Apriori and FP-Tree algorithms, but they still exist some problems, for example, Arima is time-consuming because of repeatedly scanning the large database, and Rarity is space-consuming because of the establishment of the full-combination tree. Therefore, based on the Rarity algorithm, this report presents an improved method to efficiently discover interesting association rules among rare itemsets and aims to get a balance between time and space. It is a top-down strategy which uses the graph structure to indicate all combinations of existing items, defines pattern matrix to record itemsets, and combines the hash table to accelerate calculation process. This method decreases both the time cost and the space cost when comparing with Arima, and reduces the space waste to solve the problem of Rarity, but its searching time of mining rare itemsets is more than Rarity, and we verified the feasibility of this algorithm only on abstract and small databases. Thus in the future, on the one hand, we will continue improving our method to explore how to decrease the searching time in the process and adjust the hash function to optimize the space utilization. And on the other hand, we will apply our method to actual large databases, such as the clinical database of the diabetic patients to mine association rules in diabetic complications.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:miun-35320 |
Date | January 2018 |
Creators | Xiang, Lan |
Publisher | Mittuniversitetet, Avdelningen för informationssystem och -teknologi |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0015 seconds