Spelling suggestions: "subject:"multilabel document classification"" "subject:"kulturlabel document classification""
1 |
A Mixed Approach for Multi-Label Document ClassificationTsai, Shian-Chi 10 August 2010 (has links)
Unlike single-label document classification, where each document exactly belongs to a single category, when the document is classified into two or more categories, known as multi-label file, how to classify such documents accurately has become a hot research topic in recent years. In this paper, we propose a algorithm named fuzzy similarity measure multi-label K nearest neighbors(FSMLKNN) which combines a fuzzy similarity measure with the multi-label K nearest neighbors(MLKNN) algorithm for multi-label document classification, the algorithm improved fuzzy similarity measure to calculate the similarity between a document and the center of cluster similarity, and proposed algorithm can significantly improve the performance and accuracy for multi-label document classification. In the experiment, we compare FSMLKNN and the existing classification methods, including decision tree C4.5, support vector machine(SVM) and MLKNN algorithm, the experimental results show that, FSMLKNN method is better than others.
|
2 |
Feature Reduction and Multi-label Classification Approaches for Document DataJiang, Jung-Yi 08 August 2011 (has links)
This thesis proposes some novel approaches for feature reduction and multi-label classification for text datasets. In text processing, the bag-of-words model is commonly used, with each document modeled as a vector in a high dimensional space. This model is often called the vector-space model. Usually, the dimensionality of the document vector is huge. Such high-dimensionality can be a severe obstacle for text processing algorithms. To improve the performance of text processing algorithms, we propose a feature clustering approach to reduce the dimensionality of document vectors. We also propose an efficient algorithm for text classification.
Feature clustering is a powerful method to reduce the dimensionality
of feature vectors for text classification. We
propose a fuzzy similarity-based self-constructing algorithm for
feature clustering. The words in the feature vector of a document
set are grouped into clusters based on similarity test. Words that
are similar to each other are grouped into the same cluster. Each
cluster is characterized by a membership function with statistical
mean and deviation. When all the words have been fed in, a desired
number of clusters are formed automatically. We then have one
extracted feature for each cluster. The extracted feature
corresponding to a cluster is a weighted combination of the words
contained in the cluster. By this algorithm, the derived membership
functions match closely with and describe properly the real
distribution of the training data. Besides, the user need not
specify the number of extracted features in advance, and
trial-and-error for determining the appropriate number of extracted
features can then be avoided. Experimental results show
that our method can run faster and obtain better extracted features than other methods.
We also propose a fuzzy similarity clustering scheme for multi-label
text categorization in which a document can belong to one or more
than one category. Firstly, feature transformation is performed. An
input document is transformed to a fuzzy-similarity vector. Next,
the relevance degrees of the input document to a collection of
clusters are calculated, which are then combined to obtain the
relevance degree of the input document to each participating
category. Finally, the input document is classified to a certain
category if the associated relevance degree exceeds a threshold. In
text categorization, the number of the involved terms is usually
huge. An automatic classification system may suffer from large
memory requirements and poor efficiency. Our scheme can do without
these difficulties. Besides, we allow the region a category covers
to be a combination of several sub-regions that are not necessarily
connected. The effectiveness of our proposed scheme is demonstrated
by the results of several experiments.
|
Page generated in 0.1533 seconds