• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 102
  • 39
  • 5
  • 5
  • 4
  • 3
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 182
  • 182
  • 76
  • 70
  • 56
  • 54
  • 41
  • 38
  • 38
  • 35
  • 28
  • 28
  • 25
  • 21
  • 19
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
71

Enhancing Text Readability Using Deep Learning Techniques

Alkaldi, Wejdan 20 July 2022 (has links)
In the information era, reading becomes more important to keep up with the growing amount of knowledge. The ability to read a document varies from person to person depending on their skills and knowledge. It also depends on the readability level of the text, whether it matches the reader’s level or not. In this thesis, we propose a system that uses state-of-the-art technology in machine learning and deep learning to classify and simplify a text taking into consideration the reader’s level of reading. The system classifies any text to its equivalent readability level. If the text readability level is higher than the reader’s level, i.e. too difficult to read, the system performs text simplification to meet the desired readability level. The classification and simplification models are trained on data annotated with readability levels from in the Newsela corpus. The trained simplification model performs at sentence level, to simplify a given text to match a specific readability level. Moreover, the trained classification model is used to classify more unlabelled sentences using Wikipedia Corpus and Mechanical Turk Corpus in order to enrich the text simplification dataset. The augmented dataset is then used to improve the quality of the simplified sentences. The system generates simplified versions of a text based on the desired readability levels. This can help people with low literacy to read and understand any documents they need. It can also be beneficial to educators who assist readers with different reading levels.
72

Semi-supervised Sentiment Analysis for Sentence Classification

Tsakiri, Eirini January 2022 (has links)
In our work, we deploy semi-supervised learning methods to perform Sentiment Analysis on a corpus of sentences, meant to be labeled as either happy, neutral, sad, or angry. Sentence-BERT is used to obtain high-dimensional embeddings for the sentences in the training and testing sets, on which three classification methods are applied: the K-Nearest Neighbors classifier (KNN), Label Propagation, and Label Spreading. The latter two are graph-based classifying methods that are expected to provide better predictions compared to the supervised KNN, due to their ability to propagate labels of known data to similar (and spatially close) unknown data. In our study, we experiment with multiple combinations of labeled and unlabeled data, various hyperparameters, and 4 distinct classes of data, and we perform both binary and fine-grained classification tasks. A custom Radial Basis Function kernel is created for this study, in which Euclidean distance is replaced with Cosine Similarity, in order to correspond to the metric used in SentenceBERT. It is found that, for 2 out of 4 tasks, and more specifically 3-class and 2-class classification, the two graph-based algorithms outperform the chosen baseline, although the scores are not significantly higher. The supervised KNN classifier performs better for the second 3-class classification, as well as the 4-class classification, especially when using embeddings of lower dimensionality. The conclusions drawn from the results are, firstly, that the dataset used is most likely not quite suitable for graph creation, and, secondly, that larger volumes of labeled data should be used for further interpretation.
73

Death of the Dictionary? – The Rise of Zero-Shot Sentiment Classification

Borst, Janos, Burghardt, Manuel, Klähn, Jannis 04 July 2024 (has links)
In our study, we conduct a comparative analysis between dictionary-based sentiment analysis and entailment zero-shot text classification for German sentiment analysis. We evaluate the performance of a selection of dictionaries on eleven data sets, including four domain-specific data sets with a focus on historic German language. Our results demonstrate that, in the majority of cases, zero-shot text classification outperforms general-purpose dictionary-based approaches but falls short of the performance achieved by specifically fine-tuned models. Notably, the zero-shot approach exhibits superior performance, particularly in historic German cases, surpassing both general-purpose dictionaries and even a broadly trained sentiment model. These findings indicate that zero-shot text classification holds significant promise as an alternative, reducing the necessity for domain-specific sentiment dictionaries and narrowing the availability gap of off-the-shelf methods for German sentiment analysis. Additionally, we thoroughly discuss the inherent trade-offs associated with the application of these approaches.
74

Automatic Categorization of News Articles With Contextualized Language Models / Automatisk kategorisering av nyhetsartiklar med kontextualiserade språkmodeller

Borggren, Lukas January 2021 (has links)
This thesis investigates how pre-trained contextualized language models can be adapted for multi-label text classification of Swedish news articles. Various classifiers are built on pre-trained BERT and ELECTRA models, exploring global and local classifier approaches. Furthermore, the effects of domain specialization, using additional metadata features and model compression are investigated. Several hundred thousand news articles are gathered to create unlabeled and labeled datasets for pre-training and fine-tuning, respectively. The findings show that a local classifier approach is superior to a global classifier approach and that BERT outperforms ELECTRA significantly. Notably, a baseline classifier built on SVMs yields competitive performance. The effect of further in-domain pre-training varies; ELECTRA’s performance improves while BERT’s is largely unaffected. It is found that utilizing metadata features in combination with text representations improves performance. Both BERT and ELECTRA exhibit robustness to quantization and pruning, allowing model sizes to be cut in half without any performance loss.
75

A study on plagiarism detection and plagiarism direction identification using natural language processing techniques

Chong, Man Yan Miranda January 2013 (has links)
Ever since we entered the digital communication era, the ease of information sharing through the internet has encouraged online literature searching. With this comes the potential risk of a rise in academic misconduct and intellectual property theft. As concerns over plagiarism grow, more attention has been directed towards automatic plagiarism detection. This is a computational approach which assists humans in judging whether pieces of texts are plagiarised. However, most existing plagiarism detection approaches are limited to super cial, brute-force stringmatching techniques. If the text has undergone substantial semantic and syntactic changes, string-matching approaches do not perform well. In order to identify such changes, linguistic techniques which are able to perform a deeper analysis of the text are needed. To date, very limited research has been conducted on the topic of utilising linguistic techniques in plagiarism detection. This thesis provides novel perspectives on plagiarism detection and plagiarism direction identi cation tasks. The hypothesis is that original texts and rewritten texts exhibit signi cant but measurable di erences, and that these di erences can be captured through statistical and linguistic indicators. To investigate this hypothesis, four main research objectives are de ned. First, a novel framework for plagiarism detection is proposed. It involves the use of Natural Language Processing techniques, rather than only relying on the vii traditional string-matching approaches. The objective is to investigate and evaluate the in uence of text pre-processing, and statistical, shallow and deep linguistic techniques using a corpus-based approach. This is achieved by evaluating the techniques in two main experimental settings. Second, the role of machine learning in this novel framework is investigated. The objective is to determine whether the application of machine learning in the plagiarism detection task is helpful. This is achieved by comparing a thresholdsetting approach against a supervised machine learning classi er. Third, the prospect of applying the proposed framework in a large-scale scenario is explored. The objective is to investigate the scalability of the proposed framework and algorithms. This is achieved by experimenting with a large-scale corpus in three stages. The rst two stages are based on longer text lengths and the nal stage is based on segments of texts. Finally, the plagiarism direction identi cation problem is explored as supervised machine learning classi cation and ranking tasks. Statistical and linguistic features are investigated individually or in various combinations. The objective is to introduce a new perspective on the traditional brute-force pair-wise comparison of texts. Instead of comparing original texts against rewritten texts, features are drawn based on traits of texts to build a pattern for original and rewritten texts. Thus, the classi cation or ranking task is to t a piece of text into a pattern. The framework is tested by empirical experiments, and the results from initial experiments show that deep linguistic analysis contributes to solving the problems we address in this thesis. Further experiments show that combining shallow and viii deep techniques helps improve the classi cation of plagiarised texts by reducing the number of false negatives. In addition, the experiment on plagiarism direction detection shows that rewritten texts can be identi ed by statistical and linguistic traits. The conclusions of this study o er ideas for further research directions and potential applications to tackle the challenges that lie ahead in detecting text reuse.
76

應用情感分析於輿情之研究-以台灣2016總統選舉為例 / A Study of using sentiment analysis for emotion in Taiwan's presidential election of 2016

陳昭元, Chen, Chao-Yuan Unknown Date (has links)
從2014年九合一選舉到今年總統大選,網路在選戰的影響度越來越大,後選人可透過網路上之熱門討論議題即時掌握民眾需求。 文字情感分析通常使用監督式或非監督式的方法來分析文件,監督式透過文件量化可達很高的正確率,但無法預期未知趨勢,耗費人力標注文章。 本研究針對網路上之政治新聞輿情,提出一個混合非監督式與監督式學習的中文情感分析方法,先透過非監督式方法標注新聞,再用監督式方法建立分類模型,驗證分類準確率。 在實驗結果中,主題標注方面,本研究發現因文本數量遠大於議題詞數量造成TFIDF矩陣過於稀疏,使得TFIDF-Kmeans主題模型分類效果不佳;而NPMI-Concor主題模型分類效果較佳但是所分出的議題詞數量不均衡,然而LDA主題模型基於所有主題被所有文章共享的特性,使得在字詞分群與主題分類準確度都優於TFIDF-Kmeans和NPMI-Concor主題模型,分類準確度高達97%,故後續採用LDA主題模型進行主題標注。 情緒傾向標注方面,證實本研究擴充後的情感詞集比起NTUSD有更好的字詞極性判斷效果,並且進一步使用ChineseWordnet 和 SentiWordNet,找出詞彙的情緒強度,使得在網友評論的情緒計算更加準確。亦發現所有文本的情緒指數皆具皆能反應民調指數,故本研究用文本的情緒指數來建立民調趨勢分類模型。 在關注議題分類結果的實驗,整體正確率達到95%,而在民調趨勢分類結果的實驗,整體正確率達到85%。另外建立全面性的視覺化報告以瞭解民眾的正反意見,提供候選人在選戰上之競爭智慧。 / From Taiwanese local elections, 2014 to Taiwan presidential elections, 2016. Network is in growing influence of the election. The nominee can immediately grasp the needs of the people through a popular subject of discussion on the website. Sentiment Analysis research encompasses supervised and unsupervised methods for analyzing review text. The supervised learning is proved as a powerful method with high accuracy, but there are limits where future trend cannot be recognized, and the labels of individual classes must be made manually. In the study, we propose a Chinese Sentiment Analysis method which combined supervised and unsupervised learning. First, we used unsupervised learning to label every articles. Secondly, we used supervised learning to build classification model and verified the result. According to the result of finding subject labeling, we found that TFIDF-Kmeans model is not suitable because of document characteristic. NPMI-Concor model is better than TFIDF-Kmeans model. But the subject words is not balanced. However, LDA model has the feature that all subject is share by all articles. LDA model classification performance can reach 97% accuracy. So we choose it to decide article subject. According to the result of sentimental labeling, the sentimental dictionary we build has higher accuracy than NTUSD on judging word polarity. Moreover, we used ChineseWordnet and SentiWordNet to calculate the strength of word. So we can have more accuracy on calculate public’s sentiment. So we use these sentiment index to build prediction model. In the result of subject labeling, our accuracy is 95%. Meanwhile, In the result of prediction our accuracy is 85%. We also create the Visualization report for the nominee to understand the positive and the negative options of public. Our research can help the nominee by providing competitive wisdom.
77

Coh-Metrix-Dementia: análise automática de distúrbios de linguagem nas demências utilizando Processamento de Línguas Naturais / Coh-Metrix-Dementia: automatic analysis of language impairment in dementia using Natural Language Processing

Cunha, Andre Luiz Verucci da 27 October 2015 (has links)
(Contexto) Segundo a Organização Mundial da Saúde, as demências são um problema de custo social elevado, cujo manejo é um desafio para as próximas décadas. Demências comuns incluem a Doença de Alzheimer (DA), bastante conhecida. Outra síndrome menos conhecida, o Comprometimento Cognitivo Leve (CCL), é relevante por ser o estágio inicial clinicamente definido da DA. Embora o CCL não seja tão conhecido do público, pessoas com um tipo especial dessa síndrome, o CCL amnéstico, evoluem para a DA a uma taxa bastante maior que a da população em geral. O diagnóstico das demências e síndromes relacionadas é feito com base na análise de aspectos linguísticos e cognitivos do paciente. Testes clássicos incluem testes de fluência, nomeação, e repetição. Entretanto, pesquisas recentes têm reconhecido cada vez mais a importância da análise da produção discursiva, especialmente de narrativas, como uma alternativa mais adequada, principalmente para a detecção do CCL. (Lacuna) Enquanto uma análise qualitativa do discurso pode revelar o tipo da doença apresentada pelo paciente, uma análise quantitativa é capaz de revelar a intensidade do dano cerebral existente. A grande dificuldade de análises quantitativas de discurso é sua exigência de esforços: o processo de análise rigorosa e detalhada da produção oral é bastante laborioso, o que dificulta sua adoção em larga escala. Nesse cenário, análises computadorizadas despontam como uma solução de interesse. Ferramentas de análise automática de discurso com vistas ao diagnóstico de demências de linguagem já existem para o inglês, mas nenhum trabalho nesse sentido foi feito para o português até o presente momento. (Objetivo) Este projeto visa criar um ambiente unificado, intitulado Coh-Metrix-Dementia, que se valerá de recursos e ferramentas de Processamento de Línguas Naturais (PLN) e de Aprendizado de Máquina para possibilitar a análise e o reconhecimento automatizados de demências, com foco inicial na DA e no CCL. (Hipótese) Tendo como base o ambiente Coh-Metrix adaptado para o português do Brasil, denominado Coh-Metrix-Port, e incluindo a adaptação para o português e inserção de vinte e cinco novas métricas para calcular a complexidade sintática, a densidade de ideias, e a coerência textual, via semântica latente, é possível classificar narrativas de sujeitos normais, com DA, e com CCL, em uma abordagem de aprendizado de máquina, com precisão comparável a dos testes clássicos. (Conclusão) Nos resultados experimentais, foi possível separar os pacientes entre controles, CCL, e DA com medida F de 81,7%, e separar controles e CCL com medida F de 90%. Os resultados indicam que o uso das métricas da ferramenta Coh-Metrix-Dementia é bastante promissor como recurso na detecção precoce de declínio nas habilidades de linguagem. / (Backgroung) According to the World Health Organization, dementia is a costly social issue, whose management will be a challenge on the coming decades. One common form of dementia is Alzheimers Disease (AD). Another less known syndrome, Mild Cognitive Impairment (MCI), is relevant for being the initial clinically defined stage of AD. Even though MCI is less known by the public, patients with a particular variant of this syndrome, Amestic MCI, evolve to AD in a considerably larger proportion than that of the general population. The diagnosis of dementia and related syndromes is based on the analysis of linguistic and cognitive aspects. Classical exams include fluency, naming, and repetition tests. However, recent research has been recognizing the importance of discourse analysis, specially narrative-based, as a more suitable alternative, specially for MCI detection. (Gap) While qualitative discourse analyses can determine the nature of the patients disease, quantitative analyses can reveal the extent of the existing brain damage. The greatest challenge in quantitative discourse analyses is that a rigorous and thorough evaluation of oral production is very labor-intensive, which hinders its large-scale adoption. In this scenario, computerized analyses become of increasing interest. Automated discourse analysis tools aiming at the diagnosis of language-impairing dementias already exist for the English language, but no such work has been made for Brasilian Portuguese so far. (Goal) This project aims to create a unified environment, entitled Coh-Metrix-Dementia, that will make use of Natural Language Processing and Machine Learning resources and tools to enable automated dementia analysis and classification, initially focusing on AD and MCI. (Hypothesis) Basing our work on Coh-Metrix-Port, the Brazilian Portuguese adaption of Coh-Metrix, and including the adaptation and inclusion of twenty-five new metrics for measuring syntactical complexity, idea density, and text cohesion through latent semantics, it is possible to classify narratives of healthy, AD, and MCI patients, in a machine learning approach, with a precision comparable to classical tests. (Conclusion) In our experiments, it was possible to separate patients in controls, DA, and CCL with 81.7% F-measure, and separate controls and CCL with 90% F-measure. These results indicate that Coh-Metrix-Dementia is a very promising resource in the early detection of language impairment.
78

Coh-Metrix-Dementia: análise automática de distúrbios de linguagem nas demências utilizando Processamento de Línguas Naturais / Coh-Metrix-Dementia: automatic analysis of language impairment in dementia using Natural Language Processing

Andre Luiz Verucci da Cunha 27 October 2015 (has links)
(Contexto) Segundo a Organização Mundial da Saúde, as demências são um problema de custo social elevado, cujo manejo é um desafio para as próximas décadas. Demências comuns incluem a Doença de Alzheimer (DA), bastante conhecida. Outra síndrome menos conhecida, o Comprometimento Cognitivo Leve (CCL), é relevante por ser o estágio inicial clinicamente definido da DA. Embora o CCL não seja tão conhecido do público, pessoas com um tipo especial dessa síndrome, o CCL amnéstico, evoluem para a DA a uma taxa bastante maior que a da população em geral. O diagnóstico das demências e síndromes relacionadas é feito com base na análise de aspectos linguísticos e cognitivos do paciente. Testes clássicos incluem testes de fluência, nomeação, e repetição. Entretanto, pesquisas recentes têm reconhecido cada vez mais a importância da análise da produção discursiva, especialmente de narrativas, como uma alternativa mais adequada, principalmente para a detecção do CCL. (Lacuna) Enquanto uma análise qualitativa do discurso pode revelar o tipo da doença apresentada pelo paciente, uma análise quantitativa é capaz de revelar a intensidade do dano cerebral existente. A grande dificuldade de análises quantitativas de discurso é sua exigência de esforços: o processo de análise rigorosa e detalhada da produção oral é bastante laborioso, o que dificulta sua adoção em larga escala. Nesse cenário, análises computadorizadas despontam como uma solução de interesse. Ferramentas de análise automática de discurso com vistas ao diagnóstico de demências de linguagem já existem para o inglês, mas nenhum trabalho nesse sentido foi feito para o português até o presente momento. (Objetivo) Este projeto visa criar um ambiente unificado, intitulado Coh-Metrix-Dementia, que se valerá de recursos e ferramentas de Processamento de Línguas Naturais (PLN) e de Aprendizado de Máquina para possibilitar a análise e o reconhecimento automatizados de demências, com foco inicial na DA e no CCL. (Hipótese) Tendo como base o ambiente Coh-Metrix adaptado para o português do Brasil, denominado Coh-Metrix-Port, e incluindo a adaptação para o português e inserção de vinte e cinco novas métricas para calcular a complexidade sintática, a densidade de ideias, e a coerência textual, via semântica latente, é possível classificar narrativas de sujeitos normais, com DA, e com CCL, em uma abordagem de aprendizado de máquina, com precisão comparável a dos testes clássicos. (Conclusão) Nos resultados experimentais, foi possível separar os pacientes entre controles, CCL, e DA com medida F de 81,7%, e separar controles e CCL com medida F de 90%. Os resultados indicam que o uso das métricas da ferramenta Coh-Metrix-Dementia é bastante promissor como recurso na detecção precoce de declínio nas habilidades de linguagem. / (Backgroung) According to the World Health Organization, dementia is a costly social issue, whose management will be a challenge on the coming decades. One common form of dementia is Alzheimers Disease (AD). Another less known syndrome, Mild Cognitive Impairment (MCI), is relevant for being the initial clinically defined stage of AD. Even though MCI is less known by the public, patients with a particular variant of this syndrome, Amestic MCI, evolve to AD in a considerably larger proportion than that of the general population. The diagnosis of dementia and related syndromes is based on the analysis of linguistic and cognitive aspects. Classical exams include fluency, naming, and repetition tests. However, recent research has been recognizing the importance of discourse analysis, specially narrative-based, as a more suitable alternative, specially for MCI detection. (Gap) While qualitative discourse analyses can determine the nature of the patients disease, quantitative analyses can reveal the extent of the existing brain damage. The greatest challenge in quantitative discourse analyses is that a rigorous and thorough evaluation of oral production is very labor-intensive, which hinders its large-scale adoption. In this scenario, computerized analyses become of increasing interest. Automated discourse analysis tools aiming at the diagnosis of language-impairing dementias already exist for the English language, but no such work has been made for Brasilian Portuguese so far. (Goal) This project aims to create a unified environment, entitled Coh-Metrix-Dementia, that will make use of Natural Language Processing and Machine Learning resources and tools to enable automated dementia analysis and classification, initially focusing on AD and MCI. (Hypothesis) Basing our work on Coh-Metrix-Port, the Brazilian Portuguese adaption of Coh-Metrix, and including the adaptation and inclusion of twenty-five new metrics for measuring syntactical complexity, idea density, and text cohesion through latent semantics, it is possible to classify narratives of healthy, AD, and MCI patients, in a machine learning approach, with a precision comparable to classical tests. (Conclusion) In our experiments, it was possible to separate patients in controls, DA, and CCL with 81.7% F-measure, and separate controls and CCL with 90% F-measure. These results indicate that Coh-Metrix-Dementia is a very promising resource in the early detection of language impairment.
79

The Virtual Landscape of Geological Information Topics, Methods, and Rhetoric in Modern Geology

Fratesi, Sarah Elizabeth 03 November 2008 (has links)
Geology is undergoing changes that could influence its knowledge claims, reportedly becoming more laboratory-based, technology-driven, quantitative, and physics- and chemistry-rich over time. This dissertation uses techniques from information science and linguistics to examine the geologic literature of 1945-2005. It consists of two studies: an examination of the geological literature as an expanding network of related subdisciplines, and an investigation of the linguistic and graphical argumentation strategies within geological journal articles. The first investigation is a large-scale study of topics within articles from 67 geologic journals. Clustering of subdiscipline journals based on titles and keywords reveals six major areas of geology: sedimentology/stratigraphy, oceans/climate, solid-earth, earth-surface, hard-rock, and paleontology. Citation maps reveal similar relationships. Text classification of titles and keywords from general-geology journals reveals that geological research has shifted away from economic geology towards physics- and chemistry-based topics. Geological literature has grown and fragmented ("twigged") over time, sustained in its extreme specialization by the scientific collaborations characteristic of "big science." The second investigation is a survey of linguistic and graphic features within geological journal articles that signal certain types of scientific activity and reasoning. Longitudinal studies show that "classical geology" articles within Geological Society of America Bulletin have become shorter and more graphically dense from 1945-2005. Maps and graphs replace less-efficient text, photographs, and sketches. Linguistic markers reveal increases in formal scientific discourse, specialized vocabulary, and reading difficulty. Studies comparing GSA Bulletin to five subdiscipline journals reveals that, in 2005, GSA Bulletin, AAPG Bulletin, and Journal of Sedimentary Research had similar graphic profiles and presented both field and laboratory data. Ground Water, Journal of Geophysical Research - Solid Earth, and Geochimica et Cosmochimica Acta had more equations, graphs, and numerical-modeling results than the other journals. The dissertation concludes that geology evolves by spawning physics- and chemistry-rich subdisciplines with distinct methodologies. Publishing geologists accommodate increased theoretical rigor, not using classic hallmarks of hard science (e.g., equations), but by mobilizing spatial arguments within an increasingly dense web of linguistic and graphical signs. Substantial differences in topic, methodology, and argumentation between subdisciplines manifest the multifaceted and complex consistution of geology and geologic philosophy
80

Enrichissement des Modèles de Classification de Textes Représentés par des Concepts / Improving text-classification models using the bag-of-concept paradigm

Risch, Jean-Charles 27 June 2017 (has links)
La majorité des méthodes de classification de textes utilisent le paradigme du sac de mots pour représenter les textes. Pourtant cette technique pose différents problèmes sémantiques : certains mots sont polysémiques, d'autres peuvent être des synonymes et être malgré tout différenciés, d'autres encore sont liés sémantiquement sans que cela soit pris en compte et enfin, certains mots perdent leur sens s'ils sont extraits de leur groupe nominal. Pour pallier ces problèmes, certaines méthodes ne représentent plus les textes par des mots mais par des concepts extraits d'une ontologie de domaine, intégrant ainsi la notion de sens au modèle. Les modèles intégrant la représentation des textes par des concepts restent peu utilisés à cause des résultats peu satisfaisants. Afin d'améliorer les performances de ces modèles, plusieurs méthodes ont été proposées pour enrichir les caractéristiques des textes à l'aide de nouveaux concepts extraits de bases de connaissances. Mes travaux donnent suite à ces approches en proposant une étape d'enrichissement des modèles à l'aide d'une ontologie de domaine associée. J'ai proposé deux mesures permettant d'estimer l'appartenance aux catégories de ces nouveaux concepts. A l'aide de l'algorithme du classifieur naïf Bayésien, j'ai testé et comparé mes contributions sur le corpus de textes labéllisés Ohsumed et l'ontologie de domaine Disease Ontology. Les résultats satisfaisants m'ont amené à analyser plus précisément le rôle des relations sémantiques dans l'enrichissement des modèles. Ces nouveaux travaux ont été le sujet d'une seconde expérience où il est question d'évaluer les apports des relations hiérarchiques d'hyperonymie et d'hyponymie. / Most of text-classification methods use the ``bag of words” paradigm to represent texts. However Bloahdom and Hortho have identified four limits to this representation: (1) some words are polysemics, (2) others can be synonyms and yet differentiated in the analysis, (3) some words are strongly semantically linked without being taken into account in the representation as such and (4) certain words lose their meaning if they are extracted from their nominal group. To overcome these problems, some methods no longer represent texts with words but with concepts extracted from a domain ontology (Bag of Concept), integrating the notion of meaning into the model. Models integrating the bag of concepts remain less used because of the unsatisfactory results, thus several methods have been proposed to enrich text features using new concepts extracted from knowledge bases. My work follows these approaches by proposing a model-enrichment step using a domain ontology, I proposed two measures to estimate to belong to the categories of these new concepts. Using the naive Bayes classifier algorithm, I tested and compared my contributions on the Ohsumed corpus using the domain ontology ``Disease Ontology”. The satisfactory results led me to analyse more precisely the role of semantic relations in the enrichment step. These new works have been the subject of a second experiment in which we evaluate the contributions of the hierarchical relations of hypernymy and hyponymy.

Page generated in 0.0573 seconds