Global ETD Search

1	[en] A MULTI-AGENT FRAMEWORK FOR SEARCH AND FLEXIBILIZATION OF DOCUMENT CLASSIFICATION ALGORITHMS / [pt] UM FRAMEWORK MULTI-AGENTES PARA BUSCA E FLEXIBILIZAÇÃO DE ALGORITMOS DE CLASSIFICAÇÃO DE DOCUMENTOS JOAO ALFREDO PINTO DE MAGALHAES 18 June 2003 (has links) [pt] Vivemos na era da informação, onde o conhecimento é criado numa velocidade nunca antes vista. Esse aumento de velocidade teve como principalrazão a Internet, que alterou os paradigmas até então existentes de troca de informações entre as pessoas. Através da rede, trabalhos inteiros podem ser publicados, atingindo um público alvo impossível de ser alcançado através dos meios existentes anteriormente. Porém, o excesso de informação também pode agir no sentido contrário: muita informação pode ser igual a nenhuma informação. Nosso trabalho foi o de produzir um sistema multi-agentes para busca e classificação de documentos textuais de um domínio específico. Foi construída uma infra-estrutura que separa as questões referentes à busca e seleção dos documentos (plataforma) das referentes ao algoritmo de classificação utilizado (uma aplicação do conceito de separation of concerns). Dessa forma, é possível não só acoplar algoritmos já existentes, mas também gerar novos algoritmos levando em consideração características específicas do domínio de documentos abordado. Foram geradas quatro instâncias a partir do framework, uma aplicação de webclipping, um componente para auxílio a knowledge management, um motor de busca para websites e uma aplicação para a web semântica. / [en] We are living in the information age, where knowledge is constantly being created in a rate that was never seen before. This is mainly due to Internet, that changed all the information exchange paradigms between people. Through the net, it is possible to publish or exchange whole works, reaching an audience impossible to be reached through other means. However, excess of information can be harmful: having too much information can be equal to having no information at all. Our work was to build a multi-agent framework for search and flexibilization of textual document classification algorithms of a specific domain. We have built an infra-structure that separates the concerns of document search and selection (platform) from the concerns of document classification (an application of the separation of concerns concept). It is possible not only to use existing algorithms, but also to generate new ones that consider domain-specific characteristics of documents. We generated four instances of the framework, a webclipping application, a knowledge management component, a search engine for websites and an application for the semantic web. [pt] FRAMEWORK [en] FRAMEWORK [pt] SISTEMAS MULTI-AGENTES [en] MULTI-AGENT SYSTEMS [pt] CLASSIFICACAO DE DOCUMENTOS [en] DOCUMENT CLASSIFICATION [pt] SEPARACAO DE RESPONSABILIDADES [en] SEPARATION OF CONCERNS [pt] WEB SEMANTICA [en] SEMANTIC WEB
2	[en] TEXT CATEGORIZATION: CASE STUDY: PATENT S APPLICATION DOCUMENTS IN PORTUGUESE / [pt] CATEGORIZAÇÃO DE TEXTOS: ESTUDO DE CASO: DOCUMENTOS DE PEDIDOS DE PATENTE NO IDIOMA PORTUGUÊS NEIDE DE OLIVEIRA GOMES 08 January 2015 (has links) [pt] Atualmente os categorizadores de textos construídos por técnicas de aprendizagem de máquina têm alcançado bons resultados, tornando viável a categorização automática de textos. A proposição desse estudo foi a definição de vários modelos direcionados à categorização de pedidos de patente, no idioma português. Para esse ambiente foi proposto um comitê composto de 6 (seis) modelos, onde foram usadas várias técnicas. A base de dados foi constituída de 1157 (hum mil cento e cinquenta e sete) resumos de pedidos de patente, depositados no INPI, por depositantes nacionais, distribuídos em várias categorias. Dentre os vários modelos propostos para a etapa de processamento da categorização de textos, destacamos o desenvolvido para o Método 01, ou seja, o k-Nearest-Neighbor (k-NN), modelo também usado no ambiente de patentes, para o idioma inglês. Para os outros modelos, foram selecionados métodos que não os tradicionais para ambiente de patentes. Para quatro modelos, optou-se por algoritmos, onde as categorias são representadas por vetores centróides. Para um dos modelos, foi explorada a técnica do High Order Bit junto com o algoritmo k- NN, sendo o k todos os documentos de treinamento. Para a etapa de préprocessamento foram implementadas duas técnicas: os algoritmos de stemização de Porter; e o StemmerPortuguese; ambos com modificações do original. Foram também utilizados na etapa do pré-processamento: a retirada de stopwords; e o tratamento dos termos compostos. Para a etapa de indexação foi utilizada principalmente a técnica de pesagem dos termos intitulada: frequência de termos modificada versus frequência de documentos inversa TF -IDF . Para as medidas de similaridade ou medidas de distância destacamos: cosseno; Jaccard; DICE; Medida de Similaridade; HOB. Para a obtenção dos resultados foram usadas as técnicas de predição da relevância e do rank. Dos métodos implementados nesse trabalho, destacamos o k-NN tradicional, o qual apresentou bons resultados embora demande muito tempo computacional. / [en] Nowadays, the text s categorizers constructed based on learning techniques, had obtained good results and the automatic text categorization became viable. The purpose of this study was the definition of various models directed to text categorization of patent s application in Portuguese language. For this environment was proposed a committee composed of 6 (six) models, where were used various techniques. The text base was constituted of 1157 (one thousand one hundred fifty seven) abstracts of patent s applications, deposited in INPI, by national applicants, distributed in various categories. Among the various models proposed for the step of text categorization s processing, we emphasized the one devellopped for the 01 Method, the k-Nearest-Neighbor (k-NN), model also used in the English language patent s categorization environment. For the others models were selected methods, that are not traditional in the English language patent s environment. For four models, there were chosen for the algorithms, centroid vectors representing the categories. For one of the models, was explored the High Order Bit technique together with the k-NN algorithm, being the k all the training documents. For the pre-processing step, there were implemented two techniques: the Porter s stemization algorithm; and the StemmerPortuguese algorithm; both with modifications of the original. There were also used in the pre-processing step: the removal of the stopwards; and the treatment of the compound terms. For the indexing step there was used specially the modified documents term frequency versus documents term inverse frequency TF-IDF . For the similarity or distance measures there were used: cosine; Jaccard; DICE; Similarity Measure; HOB. For the results, there were used the relevance and the rank technique. Among the methods implemented in this work it was emphasized the traditional k-NN, which had obtained good results, although demands much computational time. [pt] CATEGORIZACAO DE TEXTOS [en] TEXT CATEGORIZATION [pt] CLASSIFICACAO DE TEXTOS [en] TEXT CLASSIFICATION [pt] STEMIZACAO [en] STEMMING [en] CENTROID OR PROTOTYPE ALGORITHM

Search results

[en] A MULTI-AGENT FRAMEWORK FOR SEARCH AND FLEXIBILIZATION OF DOCUMENT CLASSIFICATION ALGORITHMS / [pt] UM FRAMEWORK MULTI-AGENTES PARA BUSCA E FLEXIBILIZAÇÃO DE ALGORITMOS DE CLASSIFICAÇÃO DE DOCUMENTOS

[en] TEXT CATEGORIZATION: CASE STUDY: PATENT S APPLICATION DOCUMENTS IN PORTUGUESE / [pt] CATEGORIZAÇÃO DE TEXTOS: ESTUDO DE CASO: DOCUMENTOS DE PEDIDOS DE PATENTE NO IDIOMA PORTUGUÊS