Global ETD Search

1	Feature Reduction and Multi-label Classification Approaches for Document Data Jiang, Jung-Yi 08 August 2011 (has links) This thesis proposes some novel approaches for feature reduction and multi-label classification for text datasets. In text processing, the bag-of-words model is commonly used, with each document modeled as a vector in a high dimensional space. This model is often called the vector-space model. Usually, the dimensionality of the document vector is huge. Such high-dimensionality can be a severe obstacle for text processing algorithms. To improve the performance of text processing algorithms, we propose a feature clustering approach to reduce the dimensionality of document vectors. We also propose an efficient algorithm for text classification. Feature clustering is a powerful method to reduce the dimensionality of feature vectors for text classification. We propose a fuzzy similarity-based self-constructing algorithm for feature clustering. The words in the feature vector of a document set are grouped into clusters based on similarity test. Words that are similar to each other are grouped into the same cluster. Each cluster is characterized by a membership function with statistical mean and deviation. When all the words have been fed in, a desired number of clusters are formed automatically. We then have one extracted feature for each cluster. The extracted feature corresponding to a cluster is a weighted combination of the words contained in the cluster. By this algorithm, the derived membership functions match closely with and describe properly the real distribution of the training data. Besides, the user need not specify the number of extracted features in advance, and trial-and-error for determining the appropriate number of extracted features can then be avoided. Experimental results show that our method can run faster and obtain better extracted features than other methods. We also propose a fuzzy similarity clustering scheme for multi-label text categorization in which a document can belong to one or more than one category. Firstly, feature transformation is performed. An input document is transformed to a fuzzy-similarity vector. Next, the relevance degrees of the input document to a collection of clusters are calculated, which are then combined to obtain the relevance degree of the input document to each participating category. Finally, the input document is classified to a certain category if the associated relevance degree exceeds a threshold. In text categorization, the number of the involved terms is usually huge. An automatic classification system may suffer from large memory requirements and poor efficiency. Our scheme can do without these difficulties. Besides, we allow the region a category covers to be a combination of several sub-regions that are not necessarily connected. The effectiveness of our proposed scheme is demonstrated by the results of several experiments. multi-label document classification self-constructing clustering text classification dimension reduction feature clustering

Search results

Feature Reduction and Multi-label Classification Approaches for Document Data