1 |
Knowledge discovery using pattern taxonomy model in text miningWu, Sheng-Tang January 2007 (has links)
In the last decade, many data mining techniques have been proposed for fulfilling various knowledge discovery tasks in order to achieve the goal of retrieving useful information for users. Various types of patterns can then be generated using these techniques, such as sequential patterns, frequent itemsets, and closed and maximum patterns. However, how to effectively exploit the discovered patterns is still an open research issue, especially in the domain of text mining. Most of the text mining methods adopt the keyword-based approach to construct text representations which consist of single words or single terms, whereas other methods have tried to use phrases instead of keywords, based on the hypothesis that the information carried by a phrase is considered more than that by a single term. Nevertheless, these phrase-based methods did not yield significant improvements due to the fact that the patterns with high frequency (normally the shorter patterns) usually have a high value on exhaustivity but a low value on specificity, and thus the specific patterns encounter the low frequency problem. This thesis presents the research on the concept of developing an effective Pattern Taxonomy Model (PTM) to overcome the aforementioned problem by deploying discovered patterns into a hypothesis space. PTM is a pattern-based method which adopts the technique of sequential pattern mining and uses closed patterns as features in the representative. A PTM-based information filtering system is implemented and evaluated by a series of experiments on the latest version of the Reuters dataset, RCV1. The pattern evolution schemes are also proposed in this thesis with the attempt of utilising information from negative training examples to update the discovered knowledge. The results show that the PTM outperforms not only all up-to-date data mining-based methods, but also the traditional Rocchio and the state-of-the-art BM25 and Support Vector Machines (SVM) approaches.
|
2 |
Rough set-based reasoning and pattern mining for information filteringZhou, Xujuan January 2008 (has links)
An information filtering (IF) system monitors an incoming document stream to find the documents that match the information needs specified by the user profiles. To learn to use the user profiles effectively is one of the most challenging tasks when developing an IF system. With the document selection criteria better defined based on the users’ needs, filtering large streams of information can be more efficient and effective. To learn the user profiles, term-based approaches have been widely used in the IF community because of their simplicity and directness. Term-based approaches are relatively well established. However, these approaches have problems when dealing with polysemy and synonymy, which often lead to an information overload problem. Recently, pattern-based approaches (or Pattern Taxonomy Models (PTM) [160]) have been proposed for IF by the data mining community. These approaches are better at capturing sematic information and have shown encouraging results for improving the effectiveness of the IF system. On the other hand, pattern discovery from large data streams is not computationally efficient. Also, these approaches had to deal with low frequency pattern issues. The measures used by the data mining technique (for example, “support” and “confidences”) to learn the profile have turned out to be not suitable for filtering. They can lead to a mismatch problem. This thesis uses the rough set-based reasoning (term-based) and pattern mining approach as a unified framework for information filtering to overcome the aforementioned problems. This system consists of two stages - topic filtering and pattern mining stages. The topic filtering stage is intended to minimize information overloading by filtering out the most likely irrelevant information based on the user profiles. A novel user-profiles learning method and a theoretical model of the threshold setting have been developed by using rough set decision theory. The second stage (pattern mining) aims at solving the problem of the information mismatch. This stage is precision-oriented. A new document-ranking function has been derived by exploiting the patterns in the pattern taxonomy. The most likely relevant documents were assigned higher scores by the ranking function. Because there is a relatively small amount of documents left after the first stage, the computational cost is markedly reduced; at the same time, pattern discoveries yield more accurate results. The overall performance of the system was improved significantly. The new two-stage information filtering model has been evaluated by extensive experiments. Tests were based on the well-known IR bench-marking processes, using the latest version of the Reuters dataset, namely, the Reuters Corpus Volume 1 (RCV1). The performance of the new two-stage model was compared with both the term-based and data mining-based IF models. The results demonstrate that the proposed information filtering system outperforms significantly the other IF systems, such as the traditional Rocchio IF model, the state-of-the-art term-based models, including the BM25, Support Vector Machines (SVM), and Pattern Taxonomy Model (PTM).
|
Page generated in 0.089 seconds