Global ETD Search

1	Automatic Document Classification Applied to Swedish News Blein, Florent January 2005 (has links) <p>The first part of this paper presents briefly the ELIN[1] system, an electronic newspaper project. ELIN is a framework that stores news and displays them to the end-user. Such news are formatted using the xml[2] format. The project partner Corren[3] provided ELIN with xml articles, however the format used was not the same. My first task has been to develop a software that converts the news from one xml format (Corren) to another (ELIN).</p><p>The second and main part addresses the problem of automatic document classification and tries to find a solution for a specific issue. The goal is to automatically classify news articles from a Swedish newspaper company (Corren) into the IPTC[4] news categories.</p><p>This work has been carried out by implementing several classification algorithms, testing them and comparing their accuracy with existing software. The training and test documents were 3 weeks of the Corren newspaper that had to be classified into 2 categories.</p><p>The last tests were run with only one algorithm (Naïve Bayes) over a larger amount of data (7, then 10 weeks) and categories (12) to simulate a more real environment.</p><p>The results show that the Naïve Bayes algorithm, although the oldest, was the most accurate in this particular case. An issue raised by the results is that feature selection improves speed but can seldom reduce accuracy by removing too many features.</p> ELIN automatic text classification Naïve Bayes network Rocchio
2	Automatic Document Classification Applied to Swedish News Blein, Florent January 2005 (has links) The first part of this paper presents briefly the ELIN[1] system, an electronic newspaper project. ELIN is a framework that stores news and displays them to the end-user. Such news are formatted using the xml[2] format. The project partner Corren[3] provided ELIN with xml articles, however the format used was not the same. My first task has been to develop a software that converts the news from one xml format (Corren) to another (ELIN). The second and main part addresses the problem of automatic document classification and tries to find a solution for a specific issue. The goal is to automatically classify news articles from a Swedish newspaper company (Corren) into the IPTC[4] news categories. This work has been carried out by implementing several classification algorithms, testing them and comparing their accuracy with existing software. The training and test documents were 3 weeks of the Corren newspaper that had to be classified into 2 categories. The last tests were run with only one algorithm (Naïve Bayes) over a larger amount of data (7, then 10 weeks) and categories (12) to simulate a more real environment. The results show that the Naïve Bayes algorithm, although the oldest, was the most accurate in this particular case. An issue raised by the results is that feature selection improves speed but can seldom reduce accuracy by removing too many features. ELIN automatic text classification Naïve Bayes network Rocchio
3	Busca guiada de patentes de Bioinformática / Guided Search of Bioinformatics Patents Dutra, Marcio Branquinho 17 October 2013 (has links) As patentes são licenças públicas temporárias outorgadas pelo Estado e que garantem aos inventores e concessionários a exploração econômica de suas invenções. Escritórios de marcas e patentes recomendam aos interessados na concessão que, antes do pedido formal de uma patente, efetuem buscas em diversas bases de dados utilizando sistemas clássicos de busca de patentes e outras ferramentas de busca específicas, com o objetivo de certificar que a criação a ser depositada ainda não foi publicada, seja na sua área de origem ou em outras áreas. Pesquisas demonstram que a utilização de informações de classificação nas buscas por patentes melhoram a eficiência dos resultados das consultas. A pesquisa associada ao trabalho aqui reportado tem como objetivo explorar artefatos linguísticos, técnicas de Recuperação de Informação e técnicas de Classificação Textual para guiar a busca por patentes de Bioinformática. O resultado dessa investigação é o Sistema de Busca Guiada de Patentes de Bioinformática (BPS), o qual utiliza um classificador automático para guiar as buscas por patentes de Bioinformática. A utilização do BPS é demonstrada em comparações com ferramentas de busca de patentes atuais para uma coleção específica de patentes de Bioinformática. No futuro, deve-se experimentar o BPS em coleções diferentes e mais robustas. / Patents are temporary public licenses granted by the State to ensure to inventors and assignees economical exploration rights. Trademark and patent offices recommend to perform wide searches in different databases using classic patent search systems and specific tools before a patent\'s application. The goal of these searches is to ensure the invention has not been published yet, either in its original field or in other fields. Researches have shown the use of classification information improves the efficiency on searches for patents. The objetive of the research related to this work is to explore linguistic artifacts, Information Retrieval techniques and Automatic Classification techniques, to guide searches for Bioinformatics patents. The result of this work is the Bioinformatics Patent Search System (BPS), that uses automatic classification to guide searches for Bioinformatics patents. The utility of BPS is illustrated by a comparison with other patent search tools. In the future, BPS system must be experimented with more robust collections. automatic text classification bioinformática bioinformatics classificação automática de textos information retrieval ontologia ontology patent patentes recuperação de informação
4	Classificação automática de texto por meio de similaridade de palavras: um algoritmo mais eficiente. / Automatic text classification using word similarities: a more efficient algorithm. Catae, Fabricio Shigueru 08 January 2013 (has links) A análise da semântica latente é uma técnica de processamento de linguagem natural, que busca simplificar a tarefa de encontrar palavras e sentenças por similaridade. Através da representação de texto em um espaço multidimensional, selecionam-se os valores mais significativos para sua reconstrução em uma dimensão reduzida. Essa simplificação lhe confere a capacidade de generalizar modelos, movendo as palavras e os textos para uma representação semântica. Dessa forma, essa técnica identifica um conjunto de significados ou conceitos ocultos sem a necessidade do conhecimento prévio da gramática. O objetivo desse trabalho foi determinar a dimensionalidade ideal do espaço semântico em uma tarefa de classificação de texto. A solução proposta corresponde a um algoritmo semi-supervisionado que, a partir de exemplos conhecidos, aplica o método de classificação pelo vizinho mais próximo e determina uma curva estimada da taxa de acerto. Como esse processamento é demorado, os vetores são projetados em um espaço no qual o cálculo se torna incremental. Devido à isometria dos espaços, a similaridade entre documentos se mantém equivalente. Esta proposta permite determinar a dimensão ideal do espaço semântico com pouco esforço além do tempo requerido pela análise da semântica latente tradicional. Os resultados mostraram ganhos significativos em adotar o número correto de dimensões. / The latent semantic analysis is a technique in natural language processing, which aims to simplify the task of finding words and sentences similarity. Using a vector space model for the text representation, it selects the most significant values for the space reconstruction into a smaller dimension. This simplification allows it to generalize models, moving words and texts towards a semantic representation. Thus, it identifies a set of underlying meanings or hidden concepts without prior knowledge of grammar. The goal of this study was to determine the optimal dimensionality of the semantic space in a text classification task. The proposed solution corresponds to a semi-supervised algorithm that applies the method of the nearest neighbor classification on known examples, and plots the estimated accuracy on a graph. Because it is a very time consuming process, the vectors are projected on a space in such a way the calculation becomes incremental. Since the spaces are isometric, the similarity between documents remains equivalent. This proposal determines the optimal dimension of the semantic space with little effort, not much beyond the time required by traditional latent semantic analysis. The results showed significant gains in adopting the correct number of dimensions. Algorithms Algoritmos Automatic text classification Classificação automática de texto Pattern recognition Reconhecimento de padrões
5	Knowledge-enhanced text classification : descriptive modelling and new approaches Martinez-Alvarez, Miguel January 2014 (has links) The knowledge available to be exploited by text classification and information retrieval systems has significantly changed, both in nature and quantity, in the last years. Nowadays, there are several sources of information that can potentially improve the classification process, and systems should be able to adapt to incorporate multiple sources of available data in different formats. This fact is specially important in environments where the required information changes rapidly, and its utility may be contingent on timely implementation. For these reasons, the importance of adaptability and flexibility in information systems is rapidly growing. Current systems are usually developed for specific scenarios. As a result, significant engineering effort is needed to adapt them when new knowledge appears or there are changes in the information needs. This research investigates the usage of knowledge within text classification from two different perspectives. On one hand, the application of descriptive approaches for the seamless modelling of text classification, focusing on knowledge integration and complex data representation. The main goal is to achieve a scalable and efficient approach for rapid prototyping for Text Classification that can incorporate different sources and types of knowledge, and to minimise the gap between the mathematical definition and the modelling of a solution. On the other hand, the improvement of different steps of the classification process where knowledge exploitation has traditionally not been applied. In particular, this thesis introduces two classification sub-tasks, namely Semi-Automatic Text Classification (SATC) and Document Performance Prediction (DPP), and several methods to address them. SATC focuses on selecting the documents that are more likely to be wrongly assigned by the system to be manually classified, while automatically labelling the rest. Document performance prediction estimates the classification quality that will be achieved for a document, given a classifier. In addition, we also propose a family of evaluation metrics to measure degrees of misclassification, and an adaptive variation of k-NN.
6	Classificação automática de texto por meio de similaridade de palavras: um algoritmo mais eficiente. / Automatic text classification using word similarities: a more efficient algorithm. Fabricio Shigueru Catae 08 January 2013 (has links) A análise da semântica latente é uma técnica de processamento de linguagem natural, que busca simplificar a tarefa de encontrar palavras e sentenças por similaridade. Através da representação de texto em um espaço multidimensional, selecionam-se os valores mais significativos para sua reconstrução em uma dimensão reduzida. Essa simplificação lhe confere a capacidade de generalizar modelos, movendo as palavras e os textos para uma representação semântica. Dessa forma, essa técnica identifica um conjunto de significados ou conceitos ocultos sem a necessidade do conhecimento prévio da gramática. O objetivo desse trabalho foi determinar a dimensionalidade ideal do espaço semântico em uma tarefa de classificação de texto. A solução proposta corresponde a um algoritmo semi-supervisionado que, a partir de exemplos conhecidos, aplica o método de classificação pelo vizinho mais próximo e determina uma curva estimada da taxa de acerto. Como esse processamento é demorado, os vetores são projetados em um espaço no qual o cálculo se torna incremental. Devido à isometria dos espaços, a similaridade entre documentos se mantém equivalente. Esta proposta permite determinar a dimensão ideal do espaço semântico com pouco esforço além do tempo requerido pela análise da semântica latente tradicional. Os resultados mostraram ganhos significativos em adotar o número correto de dimensões. / The latent semantic analysis is a technique in natural language processing, which aims to simplify the task of finding words and sentences similarity. Using a vector space model for the text representation, it selects the most significant values for the space reconstruction into a smaller dimension. This simplification allows it to generalize models, moving words and texts towards a semantic representation. Thus, it identifies a set of underlying meanings or hidden concepts without prior knowledge of grammar. The goal of this study was to determine the optimal dimensionality of the semantic space in a text classification task. The proposed solution corresponds to a semi-supervised algorithm that applies the method of the nearest neighbor classification on known examples, and plots the estimated accuracy on a graph. Because it is a very time consuming process, the vectors are projected on a space in such a way the calculation becomes incremental. Since the spaces are isometric, the similarity between documents remains equivalent. This proposal determines the optimal dimension of the semantic space with little effort, not much beyond the time required by traditional latent semantic analysis. The results showed significant gains in adopting the correct number of dimensions. Algoritmos Classificação automática de texto Reconhecimento de padrões Algorithms Automatic text classification Pattern recognition
7	Busca guiada de patentes de Bioinformática / Guided Search of Bioinformatics Patents Marcio Branquinho Dutra 17 October 2013 (has links) As patentes são licenças públicas temporárias outorgadas pelo Estado e que garantem aos inventores e concessionários a exploração econômica de suas invenções. Escritórios de marcas e patentes recomendam aos interessados na concessão que, antes do pedido formal de uma patente, efetuem buscas em diversas bases de dados utilizando sistemas clássicos de busca de patentes e outras ferramentas de busca específicas, com o objetivo de certificar que a criação a ser depositada ainda não foi publicada, seja na sua área de origem ou em outras áreas. Pesquisas demonstram que a utilização de informações de classificação nas buscas por patentes melhoram a eficiência dos resultados das consultas. A pesquisa associada ao trabalho aqui reportado tem como objetivo explorar artefatos linguísticos, técnicas de Recuperação de Informação e técnicas de Classificação Textual para guiar a busca por patentes de Bioinformática. O resultado dessa investigação é o Sistema de Busca Guiada de Patentes de Bioinformática (BPS), o qual utiliza um classificador automático para guiar as buscas por patentes de Bioinformática. A utilização do BPS é demonstrada em comparações com ferramentas de busca de patentes atuais para uma coleção específica de patentes de Bioinformática. No futuro, deve-se experimentar o BPS em coleções diferentes e mais robustas. / Patents are temporary public licenses granted by the State to ensure to inventors and assignees economical exploration rights. Trademark and patent offices recommend to perform wide searches in different databases using classic patent search systems and specific tools before a patent\'s application. The goal of these searches is to ensure the invention has not been published yet, either in its original field or in other fields. Researches have shown the use of classification information improves the efficiency on searches for patents. The objetive of the research related to this work is to explore linguistic artifacts, Information Retrieval techniques and Automatic Classification techniques, to guide searches for Bioinformatics patents. The result of this work is the Bioinformatics Patent Search System (BPS), that uses automatic classification to guide searches for Bioinformatics patents. The utility of BPS is illustrated by a comparison with other patent search tools. In the future, BPS system must be experimented with more robust collections. bioinformática classificação automática de textos ontologia patentes recuperação de informação automatic text classification bioinformatics information retrieval ontology patent
8	Classification automatique de commentaires synchrones dans les vidéos de danmaku Peng, Youyang 01 1900 (has links) Le danmaku désigne les commentaires synchronisés qui s’affichent et défilent directement en surimpression sur des vidéos au fil du visionnement. Bien que les danmakus proposent à l’audience une manière originale de partager leur sentiments, connaissances, compréhensions et prédictions sur l’histoire d’une série, etc., et d’interagir entre eux, la façon dont les commentaires s’affichent peut nuire à l’expérience de visionnement, lorsqu’une densité excessive de commentaires dissimule complètement les images de la vidéo ou distrait l’audience. Actuellement, les sites de vidéo chinois emploient principalement des méthodes par mots-clés s’appuyant sur des expressions régulières pour éliminer les commentaires non désirés. Ces approches risquent fortement de surgénéraliser en supprimant involontairement des commentaires intéressants contenant certains mots-clés ou, au contraire, de sous-généraliser en étant incapables de détecter ces mots lorsqu’ils sont camouflés sous forme d’homophones. Par ailleurs, les recherches existantes sur la classification automatique du danmaku se consacrent principalement à la reconnaissance de la polarité des sentiments exprimés dans les commentaires. Ainsi, nous avons cherché à regrouper les commentaires par classes fonctionnelles, à évaluer la robustesse d’une telle classification et la possibilité de l’automatiser dans la perspective de développer de meilleurs systèmes de filtrage des commentaires. Nous avons proposé une nouvelle taxonomie pour catégoriser les commentaires en nous appuyant sur la théorie des actes de parole et la théorie des gratifications dans l’usage des médias, que nous avons utilisées pour produire un corpus annoté. Un fragment de ce corpus a été co-annoté pour estimer un accord inter-annotateur sur la classification manuelle. Enfin, nous avons réalisé plusieurs expériences de classification automatique. Celles-ci comportent trois étapes : 1) des expériences de classification binaire où l’on examine si la machine est capable de faire la distinction entre la classe majoritaire et les classes minoritaires, 2) des expériences de classification multiclasses à granularité grosse cherchant à classifier les commentaires selon les catégories principales de notre taxonomie, et 3) des expériences de classification à granularité fine sur certaines sous-catégories. Nous avons expérimenté avec des méthodes d’apprentissage automatique supervisé et semi-supervisé avec différents traits. / Danmaku denotes synchronized comments which are displayed and scroll directly on top of videos as they unfold. Although danmaku offers an innovative way to share their sentiments, knowledge, predictions on the plot of a series, etc., as well as to interact with each other, the way comments display can have a negative impact on the watching experience, when the number of comments displayed in a given timespan is so high that they completely hide the pictures, or distract audience. Currently, Chinese video websites mainly ressort to keyword approaches based on regular expressions to filter undesired comments. These approaches are at high risk to overgeneralize, thus deleting interesting comments coincidentally containing some keywords, or, to the contrary, undergeneralize due to their incapacity to detect occurrences of these keywords disguised as homophones. On another note, existing research focus essentially on recognizing the polarity of sentiments expressed within comments. Hence, we have sought to regroup comments into functional classes, evaluate the robustness of such a classification and the feasibility of its automation, under an objective of developping better comments filtering systems. Building on the theory of speech acts and the theory of gratification in media usage, we have proposed a new taxonomy of danmaku comments, and applied it to produce an annotated corpus. A fragment of the corpus has been co-annotated to estimate an interannotator agreement for human classification. Finally, we performed several automatic classification experiments. These involved three steps: 1) binary classification experiments evaluating whether the machine can distinguish the most frequent class from all others, 2) coarse-grained multi-class classification experiments aiming at classifying comments within the main categories of our taxonomy, and 3) fine-grained multi-class classification experiments on specific subcategories. We experimented both with supervised and semi-supervised learning algorithms with diffrent features. Danmaku Taxonomie Annotation du corpus Traitement automatique des langues Classification automatique de textes Apprentissage automatique Taxonomy Corpus annotation Natural language processing Automatic text classification Machine learning Linguistics / Linguistique (UMI : 0290)
9	Automatic Text Classification of Research Grant Applications / Automatisk textklassificering av forskningsbidragsansökningar Lindqvist, Robin January 2024 (has links) This study aims to construct a state-of-the-art classifier model and compare it against a largelanguage model. A variation of SVM called LinearSVC was utilised and the BERT model usingbert-base-uncased was used. The data, provided by the Swedish Research Council, consisted ofresearch grant applications. The research grant applications were divided into two groups, whichwere further divided into several subgroups. The subgroups represented research fields such ascomputer science and applied physics. Significant class imbalances were present, with someclasses having only a tenth of the applications of the largest class. To address these imbalances,a new dataset was created using data that had been randomly oversampled. The models weretrained and tested on their ability to correctly assign a subgroup to a research grant application.Results indicate that the BERT model outperformed the SVM model on the original dataset,but not on the balanced dataset . Furthermore, the BERT model’s performance decreased whentransitioning from the original to the balanced dataset, due to overfitting or randomness. / Denna studie har som mål att bygga en state-of-the-art klassificerar model och sedan jämföraden mot en stor språkmodel. SVM modellen var en variation av SVM vid namn LinearSVC ochför BERT användes bert-base-uncased. Data erhölls från Vetenskapsrådet och bestod av forskn-ingsbidragsansökningar. Forskningsbidragsansökningarna var uppdelade i två grupper, som varytterligare uppdelade i ett flertal undergrupper. Dessa undergrupper representerar forsknings-fält såsom datavetenskap och tillämpad fysik. I den data som användes i studien fanns storaskillnader mellan klasserna, där somliga klasser hade en tiondel av ansökningarna som de storaklasserna hade. I syfte att lösa dessa klassbalanseringsproblem skapades en datamängd somundergått slumpmässig översampling. Modellerna tränades och testades på deras förmåga attkorrekt klassificera en forskningsbidragsansökan in i rätt undergrupp. Studiens fynd visade attBERT modellen presterade bättre än SVM modellen på både den ursprungliga datamängden,dock inte på den balanserade datamängden. Tilläggas kan, BERTs prestanda sjönk vid övergångfrån den ursprungliga datamängden till den balanserade datamängden, något som antingen berorpå överanpassning eller slump. SVM BERT automatic text classification F1-score parameter Automatisk textklassificering SVM BERT parameter F1-score

Search results