1 |
Automatic Attribute Clustering and Feature Selection Based on Genetic AlgorithmsWang, Po-Cheng 21 August 2009 (has links)
Feature selection is an important pre-processing step in mining and learning. A good set of features can not only improve the accuracy of classification, but also reduce the time to derive rules. It is executed especially when the amount of attributes in a given training data is very large. This thesis thus proposes three GA-based clustering methods for attribute clustering and feature selection. In the first method, each feasible clustering result is encoded into a chromosome with positive integers and a gene in the chromosome is for an attribute. The value of a gene represents the cluster to which the attribute belongs. The fitness of each individual is evaluated using both the average accuracy of attribute substitutions in clusters and the cluster balance. The second method further extends the first method to improve the time performance. A new fitness function based on both the accuracy and the attribute dependency is proposed. It can reduce the time of scanning the data base. The third approach uses another encoding method for representing chromosomes. It can achieve a faster convergence and a better result than the second one. At last, the experimental comparison with the k-means clustering approach and with all combinations of attributes also shows the proposed approach can get a good trade-off between accuracy and time complexity. Besides, after feature selection, the rules derived from only the selected features may usually be hard to use if some values of the selected features cannot be obtained in current environments. This problem can be easily solved in our proposed approaches. The attributes with missing values can be replaced by other attributes in the same clusters. The proposed approaches thus provide flexible alternatives for feature selection.
|
2 |
Feature Reduction and Multi-label Classification Approaches for Document DataJiang, Jung-Yi 08 August 2011 (has links)
This thesis proposes some novel approaches for feature reduction and multi-label classification for text datasets. In text processing, the bag-of-words model is commonly used, with each document modeled as a vector in a high dimensional space. This model is often called the vector-space model. Usually, the dimensionality of the document vector is huge. Such high-dimensionality can be a severe obstacle for text processing algorithms. To improve the performance of text processing algorithms, we propose a feature clustering approach to reduce the dimensionality of document vectors. We also propose an efficient algorithm for text classification.
Feature clustering is a powerful method to reduce the dimensionality
of feature vectors for text classification. We
propose a fuzzy similarity-based self-constructing algorithm for
feature clustering. The words in the feature vector of a document
set are grouped into clusters based on similarity test. Words that
are similar to each other are grouped into the same cluster. Each
cluster is characterized by a membership function with statistical
mean and deviation. When all the words have been fed in, a desired
number of clusters are formed automatically. We then have one
extracted feature for each cluster. The extracted feature
corresponding to a cluster is a weighted combination of the words
contained in the cluster. By this algorithm, the derived membership
functions match closely with and describe properly the real
distribution of the training data. Besides, the user need not
specify the number of extracted features in advance, and
trial-and-error for determining the appropriate number of extracted
features can then be avoided. Experimental results show
that our method can run faster and obtain better extracted features than other methods.
We also propose a fuzzy similarity clustering scheme for multi-label
text categorization in which a document can belong to one or more
than one category. Firstly, feature transformation is performed. An
input document is transformed to a fuzzy-similarity vector. Next,
the relevance degrees of the input document to a collection of
clusters are calculated, which are then combined to obtain the
relevance degree of the input document to each participating
category. Finally, the input document is classified to a certain
category if the associated relevance degree exceeds a threshold. In
text categorization, the number of the involved terms is usually
huge. An automatic classification system may suffer from large
memory requirements and poor efficiency. Our scheme can do without
these difficulties. Besides, we allow the region a category covers
to be a combination of several sub-regions that are not necessarily
connected. The effectiveness of our proposed scheme is demonstrated
by the results of several experiments.
|
3 |
A Self-Constructing Fuzzy Feature Clustering for Text CategorizationLiu, Ren-jia 26 August 2009 (has links)
Feature clustering is a powerful method to reduce the dimensionality of feature vectors for text classification. In this paper, we propose a fuzzy similarity-based self-constructing algorithm for feature clustering. The words in the feature vector of a document set are grouped into clusters based on similarity test. Words that are similar to each other are grouped into the same cluster. Each cluster is characterized by a membership function with statistical mean and deviation. When all the words have been fed in, a desired number of clusters are formed automatically. We then have one extracted feature for each cluster. The extracted feature corresponding to a cluster is a weighted combination of the words contained in the cluster.
By this algorithm, the derived membership functions match closely with and describe properly the real distribution of the training data. Besides, the user need not specify the number of extracted features in advance, and trial-and-error for determining the appropriate number of extracted features can then be avoided. 20 Newsgroups data set and Cade 12 web directory are introduced to be our experimental data. We adopt the support vector machine to classify the documents. Experimental results show that our method can run faster and obtain better extracted features than other methods.
|
Page generated in 0.0805 seconds