This thesis proposes some novel approaches for feature reduction and multi-label classification for text datasets. In text processing, the bag-of-words model is commonly used, with each document modeled as a vector in a high dimensional space. This model is often called the vector-space model. Usually, the dimensionality of the document vector is huge. Such high-dimensionality can be a severe obstacle for text processing algorithms. To improve the performance of text processing algorithms, we propose a feature clustering approach to reduce the dimensionality of document vectors. We also propose an efficient algorithm for text classification.
Feature clustering is a powerful method to reduce the dimensionality
of feature vectors for text classification. We
propose a fuzzy similarity-based self-constructing algorithm for
feature clustering. The words in the feature vector of a document
set are grouped into clusters based on similarity test. Words that
are similar to each other are grouped into the same cluster. Each
cluster is characterized by a membership function with statistical
mean and deviation. When all the words have been fed in, a desired
number of clusters are formed automatically. We then have one
extracted feature for each cluster. The extracted feature
corresponding to a cluster is a weighted combination of the words
contained in the cluster. By this algorithm, the derived membership
functions match closely with and describe properly the real
distribution of the training data. Besides, the user need not
specify the number of extracted features in advance, and
trial-and-error for determining the appropriate number of extracted
features can then be avoided. Experimental results show
that our method can run faster and obtain better extracted features than other methods.
We also propose a fuzzy similarity clustering scheme for multi-label
text categorization in which a document can belong to one or more
than one category. Firstly, feature transformation is performed. An
input document is transformed to a fuzzy-similarity vector. Next,
the relevance degrees of the input document to a collection of
clusters are calculated, which are then combined to obtain the
relevance degree of the input document to each participating
category. Finally, the input document is classified to a certain
category if the associated relevance degree exceeds a threshold. In
text categorization, the number of the involved terms is usually
huge. An automatic classification system may suffer from large
memory requirements and poor efficiency. Our scheme can do without
these difficulties. Besides, we allow the region a category covers
to be a combination of several sub-regions that are not necessarily
connected. The effectiveness of our proposed scheme is demonstrated
by the results of several experiments.
Identifer | oai:union.ndltd.org:NSYSU/oai:NSYSU:etd-0808111-134811 |
Date | 08 August 2011 |
Creators | Jiang, Jung-Yi |
Contributors | Chen-Sen Ouyang, Hsien-Liang Tsai, Shing-Tai Pan, Chih-Hung Wu, Chih-Chin Lai, Chie-Jue Lee, Chun-Liang Hou |
Publisher | NSYSU |
Source Sets | NSYSU Electronic Thesis and Dissertation Archive |
Language | English |
Detected Language | English |
Type | text |
Format | application/pdf |
Source | http://etd.lib.nsysu.edu.tw/ETD-db/ETD-search/view_etd?URN=etd-0808111-134811 |
Rights | unrestricted, Copyright information available at source archive |
Page generated in 0.0019 seconds