Spelling suggestions: "subject:"[een] DOCUMENT CLASSIFICATION"" "subject:"[enn] DOCUMENT CLASSIFICATION""
1 |
Lecture Structure Based Automatic Item Classification on an Examination SystemFeng, Chi-hui 19 August 2007 (has links)
In this paper, we present a automatic item classification system,called AICS. This system is according the content structure that are provided from the teacher for create a content tree. This content tree can correlate the item with content. The main works of AICS classify the item and find the most similar content. After than the system compute the relationship between the item and content, AICS can automatic compute the difficulty of item and examination. The work of this research has two categories: 1. The system can show the content that are related to the item and help the teacher understand the difficulty of the examination paper quickly. 2. When after the examination, the system provide the content for student understand the irrelevant items.
|
2 |
Information and Representation Tradeoffs in Document ClassificationJin, Timothy 23 May 2022 (has links)
No description available.
|
3 |
Head Tail Open: Open Tailed Classification of Imbalanced Document DataJoshi, Chetan 23 April 2024 (has links) (PDF)
Deep learning models for scanned document image classification and form understand- ing have made significant progress in the last few years. High accuracy can be achieved by a model with the help of copious amounts of labelled training data for closed-world classification. However, very little work has been done in the domain of fine-grained and head-tailed(class imbalance with some classes having high numbers of data points and some having a low number of data points) open-world classification for documents. Our proposed method achieves a better classification results than the baseline of the head-tail-novel/open dataset. Our techniques include separating the head-tail classes and transferring the knowledge from head data to the tail data. This transfer of knowledge also improves the capability of recognizing a novel category by 15% as compared to the baseline.
|
4 |
Entity extraction, animal disease-related event recognition and classification from webVolkova, Svitlana January 1900 (has links)
Master of Science / Department of Computing and Information Sciences / William H. Hsu / Global epidemic surveillance is an essential task for national biosecurity management
and bioterrorism prevention. The main goal is to protect the public from major health
threads. To perform this task effectively one requires reliable, timely and accurate medical
information from a wide range of sources. Towards this goal, we present a framework for epidemiological
analytics that can be used to extract and visualize infectious disease outbreaks
from the variety of unstructured web sources automatically. More precisely, in this thesis, we
consider several research tasks including document relevance classification, entity extraction
and animal disease-related event recognition in the veterinary epidemiology domain. First,
we crawl web sources and classify collected documents by topical relevance using supervised
learning algorithms. Next, we propose a novel approach for automated ontology construction
in the veterinary medicine domain. Our approach is based on semantic relationship
discovery using syntactic patterns. We then apply our automatically-constructed ontology
for the domain-specific entity extraction task. Moreover, we compare our ontology-based
entity extraction results with an alternative sequence labeling approach. We introduce a
sequence labeling method for the entity tagging that relies on syntactic feature extraction
using a sliding window. Finally, we present our novel sentence-based event recognition
approach that includes three main steps: entity extraction of animal diseases, species, locations,
dates and the confirmation status n-grams; event-related sentence classification into
two categories - suspected or confirmed; automated event tuple generation and aggregation.
We show that our document relevance classification results as well as entity extraction
and disease-related event recognition results are significantly better compared to the results
reported by other animal disease surveillance systems.
|
5 |
A probabilistic and incremental model for online classification of documents : DV-INBCRodrigues, Thiago Fredes January 2016 (has links)
Recentemente, houve um aumento rápido na criação e disponibilidade de repositórios de dados, o que foi percebido nas áreas de Mineração de Dados e Aprendizagem de Máquina. Este fato deve-se principalmente à rápida criação de tais dados em redes sociais. Uma grande parte destes dados é feita de texto, e a informação armazenada neles pode descrever desde perfis de usuários a temas comuns em documentos como política, esportes e ciência, informação bastante útil para várias aplicações. Como muitos destes dados são criados em fluxos, é desejável a criação de algoritmos com capacidade de atuar em grande escala e também de forma on-line, já que tarefas como organização e exploração de grandes coleções de dados seriam beneficiadas por eles. Nesta dissertação um modelo probabilístico, on-line e incremental é apresentado, como um esforço em resolver o problema apresentado. O algoritmo possui o nome DV-INBC e é uma extensão ao algoritmo INBC. As duas principais características do DV-INBC são: a necessidade de apenas uma iteração pelos dados de treino para criar um modelo que os represente; não é necessário saber o vocabulário dos dados a priori. Logo, pouco conhecimento sobre o fluxo de dados é necessário. Para avaliar a performance do algoritmo, são apresentados testes usando datasets populares. / Recently the fields of Data Mining and Machine Learning have seen a rapid increase in the creation and availability of data repositories. This is mainly due to its rapid creation in social networks. Also, a large part of those data is made of text documents. The information stored in such texts can range from a description of a user profile to common textual topics such as politics, sports and science, information very useful for many applications. Besides, since many of this data are created in streams, scalable and on-line algorithms are desired, because tasks like organization and exploration of large document collections would be benefited by them. In this thesis an incremental, on-line and probabilistic model for document classification is presented, as an effort of tackling this problem. The algorithm is called DV-INBC and is an extension to the INBC algorithm. The two main characteristics of DV-INBC are: only a single scan over the data is necessary to create a model of it; the data vocabulary need not to be known a priori. Therefore, little knowledge about the data stream is needed. To assess its performance, tests using well known datasets are presented.
|
6 |
Extração de metadados utilizando uma ontologia de domínio / Metadata extraction using a domain ontologyOliveira, Luis Henrique Gonçalves de January 2009 (has links)
O objetivo da Web Semântica é prover a descrição semântica dos recursos através de metadados processáveis por máquinas. Essa camada semântica estende a Web já existente agregando facilidades para a execução de pesquisas, filtragem, resumo ou intercâmbio de conhecimento de maior complexidade. Dentro deste contexto, as bibliotecas digitais são as aplicações que estão iniciando o processo de agregar anotações semânticas às informações disponíveis na Web. Uma biblioteca digital pode ser definida como uma coleção de recursos digitais selecionados segundo critérios determinados, com alguma organização lógica e de modo acessível para recuperação distribuída em rede. Para facilitar o processo de recuperação são utilizados metadados para descrever o conteúdo armazenado. Porém, a geração manual de metadados é uma tarefa complexa e que demanda tempo, além de sujeita a falhas. Portanto a extração automática ou semi-automática desses metadados seria de grande ajuda para os autores, subtraindo uma tarefa do processo de publicação de documentos. A pesquisa realizada nesta dissertação visou abordar esse problema, desenvolvendo um extrator de metadados que popula uma ontologia de documentos e classifica o documento segundo uma hierarquia pré-definida. A ontologia de documentos OntoDoc foi criada para armazenar e disponibilizar os metadados extraídos, assim como a classificação obtida para o documento. A implementação realizada focou-se em artigos científicos de Ciência da Computação e utilizou a classificação das áreas da ACM na tarefa de classificação dos documentos. Um conjunto de exemplos retirados da Biblioteca Digital da ACM foi gerado para a realização do treinamento e de experimentos sobre a implementação. As principais contribuições desta pesquisa são o modelo de extração de metadados e classificação de documentos de forma integrada e a descrição dos documentos através de metadados armazenados em um ontologia, a OntoDoc. / The main purpose of the Semantic Web is to provide machine processable metadata that describes the semantics of resources to facilitate the search, filter, condense, or negotiate knowledge for their human users. In this context, digital libraries are applications where the semantic annotation process of information available in the Web is beginning. Digital library can be defined as a collection of digital resources selected by some criteria, with some organization and available through distributed network retrieval. To facilitate the retrieval process, metadata are applied to describe stored content. However, manual metadata generation is a complex task, time-consuming and error-prone. Thus, automatic or semiautomatic metadata generation would be great help to the authors, subtracting this task from the document publishing process. The research in this work approached this problem through the developing of a metadata extractor that populates a document ontology and classify the document according to a predefined hierarchy. The document ontology OntoDoc was created to store and to make available all the extracted metadata, as well as the obtained document classification. The implementation aimed on Computer Science papers and used the ACM Computing Classification system in the document classification task. A sample set extracted from the ACM Digital Libray was generated for implementation training and validation. The main contributions of this work are the integrated metadata extraction and classification model and the description of documents through a metadata stored in an ontology.
|
7 |
Using Machine Learning to Categorize Documents in a Construction ProjectBjörkendal, Nicklas January 2019 (has links)
Automation of document handling in the construction industries could save large amounts of time, effort and money and classifying a document is an important step in that automation. In the field of machine learning, lots of research have been done on perfecting the algorithms and techniques, but there are many areas where those techniques could be used that has not yet been studied. In this study I looked at how effectively the machine learning algorithm multinomial Naïve-Bayes would be able to classify 1427 documents split up into 19 different categories from a construction project. The experiment achieved an accuracy of 92.7% and the paper discusses some of the ways that accuracy can be improved. However, data extraction proved to be a bottleneck and only 66% of the original documents could be used for testing the classifier.
|
8 |
A Mixed Approach for Multi-Label Document ClassificationTsai, Shian-Chi 10 August 2010 (has links)
Unlike single-label document classification, where each document exactly belongs to a single category, when the document is classified into two or more categories, known as multi-label file, how to classify such documents accurately has become a hot research topic in recent years. In this paper, we propose a algorithm named fuzzy similarity measure multi-label K nearest neighbors(FSMLKNN) which combines a fuzzy similarity measure with the multi-label K nearest neighbors(MLKNN) algorithm for multi-label document classification, the algorithm improved fuzzy similarity measure to calculate the similarity between a document and the center of cluster similarity, and proposed algorithm can significantly improve the performance and accuracy for multi-label document classification. In the experiment, we compare FSMLKNN and the existing classification methods, including decision tree C4.5, support vector machine(SVM) and MLKNN algorithm, the experimental results show that, FSMLKNN method is better than others.
|
9 |
Feature Reduction and Multi-label Classification Approaches for Document DataJiang, Jung-Yi 08 August 2011 (has links)
This thesis proposes some novel approaches for feature reduction and multi-label classification for text datasets. In text processing, the bag-of-words model is commonly used, with each document modeled as a vector in a high dimensional space. This model is often called the vector-space model. Usually, the dimensionality of the document vector is huge. Such high-dimensionality can be a severe obstacle for text processing algorithms. To improve the performance of text processing algorithms, we propose a feature clustering approach to reduce the dimensionality of document vectors. We also propose an efficient algorithm for text classification.
Feature clustering is a powerful method to reduce the dimensionality
of feature vectors for text classification. We
propose a fuzzy similarity-based self-constructing algorithm for
feature clustering. The words in the feature vector of a document
set are grouped into clusters based on similarity test. Words that
are similar to each other are grouped into the same cluster. Each
cluster is characterized by a membership function with statistical
mean and deviation. When all the words have been fed in, a desired
number of clusters are formed automatically. We then have one
extracted feature for each cluster. The extracted feature
corresponding to a cluster is a weighted combination of the words
contained in the cluster. By this algorithm, the derived membership
functions match closely with and describe properly the real
distribution of the training data. Besides, the user need not
specify the number of extracted features in advance, and
trial-and-error for determining the appropriate number of extracted
features can then be avoided. Experimental results show
that our method can run faster and obtain better extracted features than other methods.
We also propose a fuzzy similarity clustering scheme for multi-label
text categorization in which a document can belong to one or more
than one category. Firstly, feature transformation is performed. An
input document is transformed to a fuzzy-similarity vector. Next,
the relevance degrees of the input document to a collection of
clusters are calculated, which are then combined to obtain the
relevance degree of the input document to each participating
category. Finally, the input document is classified to a certain
category if the associated relevance degree exceeds a threshold. In
text categorization, the number of the involved terms is usually
huge. An automatic classification system may suffer from large
memory requirements and poor efficiency. Our scheme can do without
these difficulties. Besides, we allow the region a category covers
to be a combination of several sub-regions that are not necessarily
connected. The effectiveness of our proposed scheme is demonstrated
by the results of several experiments.
|
10 |
Extração de metadados utilizando uma ontologia de domínio / Metadata extraction using a domain ontologyOliveira, Luis Henrique Gonçalves de January 2009 (has links)
O objetivo da Web Semântica é prover a descrição semântica dos recursos através de metadados processáveis por máquinas. Essa camada semântica estende a Web já existente agregando facilidades para a execução de pesquisas, filtragem, resumo ou intercâmbio de conhecimento de maior complexidade. Dentro deste contexto, as bibliotecas digitais são as aplicações que estão iniciando o processo de agregar anotações semânticas às informações disponíveis na Web. Uma biblioteca digital pode ser definida como uma coleção de recursos digitais selecionados segundo critérios determinados, com alguma organização lógica e de modo acessível para recuperação distribuída em rede. Para facilitar o processo de recuperação são utilizados metadados para descrever o conteúdo armazenado. Porém, a geração manual de metadados é uma tarefa complexa e que demanda tempo, além de sujeita a falhas. Portanto a extração automática ou semi-automática desses metadados seria de grande ajuda para os autores, subtraindo uma tarefa do processo de publicação de documentos. A pesquisa realizada nesta dissertação visou abordar esse problema, desenvolvendo um extrator de metadados que popula uma ontologia de documentos e classifica o documento segundo uma hierarquia pré-definida. A ontologia de documentos OntoDoc foi criada para armazenar e disponibilizar os metadados extraídos, assim como a classificação obtida para o documento. A implementação realizada focou-se em artigos científicos de Ciência da Computação e utilizou a classificação das áreas da ACM na tarefa de classificação dos documentos. Um conjunto de exemplos retirados da Biblioteca Digital da ACM foi gerado para a realização do treinamento e de experimentos sobre a implementação. As principais contribuições desta pesquisa são o modelo de extração de metadados e classificação de documentos de forma integrada e a descrição dos documentos através de metadados armazenados em um ontologia, a OntoDoc. / The main purpose of the Semantic Web is to provide machine processable metadata that describes the semantics of resources to facilitate the search, filter, condense, or negotiate knowledge for their human users. In this context, digital libraries are applications where the semantic annotation process of information available in the Web is beginning. Digital library can be defined as a collection of digital resources selected by some criteria, with some organization and available through distributed network retrieval. To facilitate the retrieval process, metadata are applied to describe stored content. However, manual metadata generation is a complex task, time-consuming and error-prone. Thus, automatic or semiautomatic metadata generation would be great help to the authors, subtracting this task from the document publishing process. The research in this work approached this problem through the developing of a metadata extractor that populates a document ontology and classify the document according to a predefined hierarchy. The document ontology OntoDoc was created to store and to make available all the extracted metadata, as well as the obtained document classification. The implementation aimed on Computer Science papers and used the ACM Computing Classification system in the document classification task. A sample set extracted from the ACM Digital Libray was generated for implementation training and validation. The main contributions of this work are the integrated metadata extraction and classification model and the description of documents through a metadata stored in an ontology.
|
Page generated in 0.0749 seconds