• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 13
  • 6
  • 5
  • 2
  • 1
  • Tagged with
  • 29
  • 29
  • 11
  • 10
  • 10
  • 8
  • 8
  • 8
  • 8
  • 6
  • 6
  • 6
  • 5
  • 5
  • 5
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
11

Extração automática de termos simples baseada em aprendizado de máquina / Automatic simple term extraction based on machine learning

Laguna, Merley da Silva Conrado 06 May 2014 (has links)
A Mineração de Textos (MT) visa descobrir conhecimento inovador nos textos não estruturados. A extração dos termos que representam os textos de um domínio é um dos passos mais importantes da MT, uma vez que os resultados de todo o processo da MT dependerão, em grande parte, da qualidade dos termos obtidos. Nesta tese, considera-se como termos as unidades lexicais realizadas para designar conceitos em um cenário tematicamente restrito. Para a extração dos termos, pode-se fazer uso de abordagens como: estatística, linguística ou híbrida. Normalmente, para a Mineração de Textos, são utilizados métodos estatísticos. A aplicação desses métodos é computacionalmente menos custosa que a dos métodos linguísticos, entretanto seus resultados são geralmente menos interpretáveis. Ambos métodos, muitas vezes, não são capazes de identificar diferenças entre termos e não-termos, por exemplo, os estatísticos podem não identificar termos raros ou que têm a mesma frequência de não-termos e os linguísticos podem não distinguir entre termos que seguem os mesmo padrões linguísticos dos não-termos. Uma solução para esse problema é utilizar métodos híbridos, de forma a combinar as estratégias dos métodos linguísticos e estatísticos, visando atenuar os problemas inerentes a cada um deles. Considerando as características dos métodos de extração de termos, nesta tese, foram investigados métodos estatísticos, formas de obtenção de conhecimento linguístico e métodos híbridos para a extração de termos simples - aqueles constituídos de somente um radical, com ou sem afixos - na língua portuguesa do Brasil. Quatro medidas estatísticas (tvq, tv, tc e comGram), originalmente utilizadas em outras tarefas, foram avaliadas na extração de termos simples, sendo que duas delas (tvq e tv) foram consideradas relevantes para essa tarefa. Quatro novas medidas híbridas (n_subst., n_adj., n_po e n_verbo) foram propostas, sendo que três delas (n_subst,. n_adj., e n_po) auxiliaram na extração de termos. Normalmente os métodos de extração de termos selecionam candidatos a termos com base em algum conhecimento linguístico. Depois disso, eles aplicam a esses candidatos medidas ou combinação de medidas (e/ou heurísticas) para gerar um ranking com tais candidatos. Quanto mais ao topo desse ranking os candidatos estão, maior a chance de que eles sejam termos. A escolha do liminar a ser considerado nesse ranking é feita, em geral de forma manual ou semiautomática por especialistas do domínio e/ou terminólogos. Automatizar a forma de escolha dos candidatos a termos é a primeira motivação da extração de termos realizada nesta pesquisa. A segunda motivação desta pesquisa é minimizar o elevado número de candidatos a termos presente na extração de termos. Esse alto número, causado pela grande quantidade de palavras contidas em um corpus, pode aumentar a complexidade de tempo e os recursos computacionais utilizados para se extrair os termos. A terceira motivação considerada nesta pesquisa é melhorar o estado da arte da extração automática de termos simples da língua portuguesa do Brasil, uma vez que os resultados dessa extração (medida F = 16%) ainda são inferiores se comparados com a extração de termos em línguas como a inglesa (medida F = 92%) e a espanhola (medida F = 68%). Considerando essas motivações, nesta tese, foi proposto o método MATE-ML (Automatic Term Extraction based on Machine Learning) que visa extrair automaticamente termos utilizando técnicas da área de aprendizado de máquina. No método MATE-ML, é sugerido o uso de filtros para reduzir o elevado número de candidatos a termos durante a extração de termos sem prejudicar a representação do domínio em questão. Com isso, acredita-se que os extratores de termos podem gerar listas menores de candidatos extraídos, demandando, assim , menos tempo dos especialistas para avaliar esses candidatos. Ainda, o método MATE-ML foi instanciado em duas abordagens: (i) ILATE (Inductive Learning for Automatic Term Extraction), que utiliza a classificação supervisionada indutiva para rotular os candidatos a termos em termos e não termos, e (ii) TLATE (Transductive Learning for Automatic Term Extraction), que faz uso da classificação semissupervisionada transdutiva para propagar os rótulos dos candidatos rotulados para os não rotulados. A aplicação do aprendizado transdutivo na extração de termos e a aplicação ao mesmo tempo de um conjunto rico de características de candidatos pertencentes a diferentes níveis de conhecimento - linguístico, estatístico e híbrido também são consideradas contribuições desta tese. Nesta tese, são discutidas as vantagens e limitações dessas duas abordagens propostas, ILATE e TLATE. Ressalta-se que o uso dessas abordagens alcança geralmente resultados mais altos de precisão (os melhores casos alcançam mais de 81%), altos resultados de cobertura (os melhores casos atingem mai de 87%) e bons valores de medida F (máximo de 41%) em relação aos métodos e medidas comparados nas avaliações experimentais realizadas considerando três corpora de diferentes domínios na língua portuguesa do Brasil / Text Mining (TM) aims at discovering innovating knowledge in unstructured texts. The extraction of terms that represent that texts of a specific domain is one of the most important steps of TM, since the results of the overall TM process will mostly depend on the quality of these terms. In this thesis, we consider terms as lexical units used to assign concepts in thematically restricted scenarios. The term extraction task may use approaches such as: statistical, linguistic, or hybrid. Typically, statistical methods are the most common for Text Mining. These methods are computationally less expensive than the linguistic ones, however their results tend to be less human-interpretable. Both methods are not often capable of identifying differences between terms and non-terms. For example, statistical methods may not identify terms that have the same frequency of non-terms and linguistic methods may not distinguish between terms that follow the same patterns of non-terms. One solution to this problem is to use hybrid methods, combining the strategies of linguistic and ststistical methods, in order to attenuate their inherent problems. Considering the features of the term extraction methods, in this thesis, we investigated statistical melhods, ways of obtaining linguistic knowledge, and hybrid methods for extracting simple terms (only one radical, with or without the affixes) for the Braziian Portuguese language. We evaluated, in term extraction, four new hybrid measures (tvq, tv, and comGram) originally proposed for other tasks; and two of them (tvq and tv) were considered relevant for this task. e proposed four new hybrid measures(n_subs., n_adj., n_po, and n_verb); and there of them (n_subst., n_adj., and n_po) were helpful in the term extraction task. Typically, the extraction methods select term candidates based on some linguistic knowledge. After this process, they apply measures or combination of measures (and/or heuristics) to these candidates in order to generate a ranking. The higher the candidates are in the ranking, the better the chances of being terms. To decide up to which position must be considered in this ranking normally, a domain expert and/or terminologist manually or semiautomatically analyse the ranking. The first motivation of this thesis is to automate how to choose the candidates during the term extraction process. The second motivation of this research is to minimize the high number of candidates present in the term extraction. The high number of candidate, caused by the large amount of words in a corpus, could increase the time complexity and computational resources for extracting terms. The third motivation considered in this research is to improve the state of the art of the automatic simple term extraction for Brazilian Portuguese since the results of this extraction (F-measure = 16%) are still low when compared to other languages like English (F-measure = 92%) and Spanish (F-measure =68%). Given these motivations, we proposed the MATE-ML method (Automatic Term Extraction Based on Machine Learning), which aims to automatically extract simple terms using the machine learning techniques. MATE-ML method suggests the use of filters to reduce the high number of term candidates during the term extraction task without harming the domain representation. Thus, we believe the extractors may generate smaller candidate lists, requiring less time to evaluate these candidates. The MATE-ML method was instantiated in two approaches.: (i) ILATE (Inductive Learning for Automatic Term Extraction),. which uses the supervised inductive classification to label term candidates, and (ii) TLATE (Trnasductive Learning for Automatic Term Extraction), which uses transductive semi-supervised classification to propagate the classes from labeled candidates to unlabeled candidates. Using transductive learning in term extraction and using, at the same time, a rich set of candidate features belonging to different levels of knowledge (linguistic,statistical, and hybrid) are also considered as contributions. In this thesis, we discuss the advantages and limitations of these two proposed approaches. We emphasize taht the use of these approaches usually with higher precision (the best case is above of 81%), high coverage results (the best case is above of 87%), and good F-measure value (maximum of 41%) considering three corpora of different domains in the Brazilian Portuguese language
12

Extração automática de termos simples baseada em aprendizado de máquina / Automatic simple term extraction based on machine learning

Merley da Silva Conrado Laguna 06 May 2014 (has links)
A Mineração de Textos (MT) visa descobrir conhecimento inovador nos textos não estruturados. A extração dos termos que representam os textos de um domínio é um dos passos mais importantes da MT, uma vez que os resultados de todo o processo da MT dependerão, em grande parte, da qualidade dos termos obtidos. Nesta tese, considera-se como termos as unidades lexicais realizadas para designar conceitos em um cenário tematicamente restrito. Para a extração dos termos, pode-se fazer uso de abordagens como: estatística, linguística ou híbrida. Normalmente, para a Mineração de Textos, são utilizados métodos estatísticos. A aplicação desses métodos é computacionalmente menos custosa que a dos métodos linguísticos, entretanto seus resultados são geralmente menos interpretáveis. Ambos métodos, muitas vezes, não são capazes de identificar diferenças entre termos e não-termos, por exemplo, os estatísticos podem não identificar termos raros ou que têm a mesma frequência de não-termos e os linguísticos podem não distinguir entre termos que seguem os mesmo padrões linguísticos dos não-termos. Uma solução para esse problema é utilizar métodos híbridos, de forma a combinar as estratégias dos métodos linguísticos e estatísticos, visando atenuar os problemas inerentes a cada um deles. Considerando as características dos métodos de extração de termos, nesta tese, foram investigados métodos estatísticos, formas de obtenção de conhecimento linguístico e métodos híbridos para a extração de termos simples - aqueles constituídos de somente um radical, com ou sem afixos - na língua portuguesa do Brasil. Quatro medidas estatísticas (tvq, tv, tc e comGram), originalmente utilizadas em outras tarefas, foram avaliadas na extração de termos simples, sendo que duas delas (tvq e tv) foram consideradas relevantes para essa tarefa. Quatro novas medidas híbridas (n_subst., n_adj., n_po e n_verbo) foram propostas, sendo que três delas (n_subst,. n_adj., e n_po) auxiliaram na extração de termos. Normalmente os métodos de extração de termos selecionam candidatos a termos com base em algum conhecimento linguístico. Depois disso, eles aplicam a esses candidatos medidas ou combinação de medidas (e/ou heurísticas) para gerar um ranking com tais candidatos. Quanto mais ao topo desse ranking os candidatos estão, maior a chance de que eles sejam termos. A escolha do liminar a ser considerado nesse ranking é feita, em geral de forma manual ou semiautomática por especialistas do domínio e/ou terminólogos. Automatizar a forma de escolha dos candidatos a termos é a primeira motivação da extração de termos realizada nesta pesquisa. A segunda motivação desta pesquisa é minimizar o elevado número de candidatos a termos presente na extração de termos. Esse alto número, causado pela grande quantidade de palavras contidas em um corpus, pode aumentar a complexidade de tempo e os recursos computacionais utilizados para se extrair os termos. A terceira motivação considerada nesta pesquisa é melhorar o estado da arte da extração automática de termos simples da língua portuguesa do Brasil, uma vez que os resultados dessa extração (medida F = 16%) ainda são inferiores se comparados com a extração de termos em línguas como a inglesa (medida F = 92%) e a espanhola (medida F = 68%). Considerando essas motivações, nesta tese, foi proposto o método MATE-ML (Automatic Term Extraction based on Machine Learning) que visa extrair automaticamente termos utilizando técnicas da área de aprendizado de máquina. No método MATE-ML, é sugerido o uso de filtros para reduzir o elevado número de candidatos a termos durante a extração de termos sem prejudicar a representação do domínio em questão. Com isso, acredita-se que os extratores de termos podem gerar listas menores de candidatos extraídos, demandando, assim , menos tempo dos especialistas para avaliar esses candidatos. Ainda, o método MATE-ML foi instanciado em duas abordagens: (i) ILATE (Inductive Learning for Automatic Term Extraction), que utiliza a classificação supervisionada indutiva para rotular os candidatos a termos em termos e não termos, e (ii) TLATE (Transductive Learning for Automatic Term Extraction), que faz uso da classificação semissupervisionada transdutiva para propagar os rótulos dos candidatos rotulados para os não rotulados. A aplicação do aprendizado transdutivo na extração de termos e a aplicação ao mesmo tempo de um conjunto rico de características de candidatos pertencentes a diferentes níveis de conhecimento - linguístico, estatístico e híbrido também são consideradas contribuições desta tese. Nesta tese, são discutidas as vantagens e limitações dessas duas abordagens propostas, ILATE e TLATE. Ressalta-se que o uso dessas abordagens alcança geralmente resultados mais altos de precisão (os melhores casos alcançam mais de 81%), altos resultados de cobertura (os melhores casos atingem mai de 87%) e bons valores de medida F (máximo de 41%) em relação aos métodos e medidas comparados nas avaliações experimentais realizadas considerando três corpora de diferentes domínios na língua portuguesa do Brasil / Text Mining (TM) aims at discovering innovating knowledge in unstructured texts. The extraction of terms that represent that texts of a specific domain is one of the most important steps of TM, since the results of the overall TM process will mostly depend on the quality of these terms. In this thesis, we consider terms as lexical units used to assign concepts in thematically restricted scenarios. The term extraction task may use approaches such as: statistical, linguistic, or hybrid. Typically, statistical methods are the most common for Text Mining. These methods are computationally less expensive than the linguistic ones, however their results tend to be less human-interpretable. Both methods are not often capable of identifying differences between terms and non-terms. For example, statistical methods may not identify terms that have the same frequency of non-terms and linguistic methods may not distinguish between terms that follow the same patterns of non-terms. One solution to this problem is to use hybrid methods, combining the strategies of linguistic and ststistical methods, in order to attenuate their inherent problems. Considering the features of the term extraction methods, in this thesis, we investigated statistical melhods, ways of obtaining linguistic knowledge, and hybrid methods for extracting simple terms (only one radical, with or without the affixes) for the Braziian Portuguese language. We evaluated, in term extraction, four new hybrid measures (tvq, tv, and comGram) originally proposed for other tasks; and two of them (tvq and tv) were considered relevant for this task. e proposed four new hybrid measures(n_subs., n_adj., n_po, and n_verb); and there of them (n_subst., n_adj., and n_po) were helpful in the term extraction task. Typically, the extraction methods select term candidates based on some linguistic knowledge. After this process, they apply measures or combination of measures (and/or heuristics) to these candidates in order to generate a ranking. The higher the candidates are in the ranking, the better the chances of being terms. To decide up to which position must be considered in this ranking normally, a domain expert and/or terminologist manually or semiautomatically analyse the ranking. The first motivation of this thesis is to automate how to choose the candidates during the term extraction process. The second motivation of this research is to minimize the high number of candidates present in the term extraction. The high number of candidate, caused by the large amount of words in a corpus, could increase the time complexity and computational resources for extracting terms. The third motivation considered in this research is to improve the state of the art of the automatic simple term extraction for Brazilian Portuguese since the results of this extraction (F-measure = 16%) are still low when compared to other languages like English (F-measure = 92%) and Spanish (F-measure =68%). Given these motivations, we proposed the MATE-ML method (Automatic Term Extraction Based on Machine Learning), which aims to automatically extract simple terms using the machine learning techniques. MATE-ML method suggests the use of filters to reduce the high number of term candidates during the term extraction task without harming the domain representation. Thus, we believe the extractors may generate smaller candidate lists, requiring less time to evaluate these candidates. The MATE-ML method was instantiated in two approaches.: (i) ILATE (Inductive Learning for Automatic Term Extraction),. which uses the supervised inductive classification to label term candidates, and (ii) TLATE (Trnasductive Learning for Automatic Term Extraction), which uses transductive semi-supervised classification to propagate the classes from labeled candidates to unlabeled candidates. Using transductive learning in term extraction and using, at the same time, a rich set of candidate features belonging to different levels of knowledge (linguistic,statistical, and hybrid) are also considered as contributions. In this thesis, we discuss the advantages and limitations of these two proposed approaches. We emphasize taht the use of these approaches usually with higher precision (the best case is above of 81%), high coverage results (the best case is above of 87%), and good F-measure value (maximum of 41%) considering three corpora of different domains in the Brazilian Portuguese language
13

O efeito do uso de diferentes formas de extração de termos na compreensibilidade e representatividade dos termos em coleções textuais na língua portuguesa / The effect of using different forms of terms extraction on its comprehensibility and representability in Portuguese textual domains

Merley da Silva Conrado 10 September 2009 (has links)
A extração de termos em coleções textuais, que é uma atividade da etapa de Pré-Processamento da Mineração de Textos, pode ser empregada para diversos fins nos processos de extração de conhecimento. Esses termos devem ser cuidadosamente extraídos, uma vez que os resultados de todo o processo dependerão, em grande parte, da \"qualidade\" dos termos obtidos. A \"qualidade\" dos termos, neste trabalho, abrange tanto a representatividade dos termos no domínio em questão como sua compreensibilidade. Tendo em vista sua importância, neste trabalho, avaliou-se o efeito do uso de diferentes técnicas de simplificação de termos na compreensibilidade e representatividade dos termos em coleções textuais na Língua Portuguesa. Os termos foram extraídos seguindo os passos da metodologia apresentada neste trabalho e as técnicas utilizadas durante essa atividade de extração foram a radicalização, lematização e substantivação. Para apoiar tal metodologia, foi desenvolvida uma ferramenta, a ExtraT (Ferramenta para Extração de Termos). Visando garantir a \"qualidade\" dos termos extraídos, os mesmos são avaliados objetiva e subjetivamente. As avaliações subjetivas, ou seja, com o auxílio de especialistas do domínio em questão, abrangem a representatividade dos termos em seus respectivos documentos, a compreensibilidade dos termos obtidos ao utilizar cada técnica e a preferência geral subjetiva dos especialistas em cada técnica. As avaliações objetivas, que são auxiliadas por uma ferramenta desenvolvida (a TaxEM - Taxonomia em XML da Embrapa), levam em consideração a quantidade de termos extraídos por cada técnica, além de abranger tambéem a representatividade dos termos extraídos a partir de cada técnica em relação aos seus respectivos documentos. Essa avaliação objetiva da representatividade dos termos utiliza como suporte a medida CTW (Context Term Weight). Oito coleções de textos reais do domínio de agronegócio foram utilizadas na avaliaçao experimental. Como resultado foram indicadas algumas das características positivas e negativas da utilização das técnicas de simplificação de termos, mostrando que a escolha pelo uso de alguma dessas técnicas para o domínio em questão depende do objetivo principal pré-estabelecido, que pode ser desde a necessidade de se ter termos compreensíveis para o usuário até a necessidade de se trabalhar com uma menor quantidade de termos / The task of term extraction in textual domains, which is a subtask of the text pre-processing in Text Mining, can be used for many purposes in knowledge extraction processes. These terms must be carefully extracted since their quality will have a high impact in the results. In this work, the quality of these terms involves both representativity in the specific domain and comprehensibility. Considering this high importance, in this work the effects produced in the comprehensibility and representativity of terms were evaluated when different term simplification techniques are utilized in text collections in Portuguese. The term extraction process follows the methodology presented in this work and the techniques used were radicalization, lematization and substantivation. To support this metodology, a term extraction tool was developed and is presented as ExtraT. In order to guarantee the quality of the extracted terms, they were evaluated in an objective and subjective way. The subjective evaluations, assisted by domain specialists, analyze the representativity of the terms in related documents, the comprehensibility of the terms with each technique, and the specialist\'s opinion. The objective evaluations, which are assisted by TaxEM and by Thesagro (National Agricultural Thesaurus), consider the number of extracted terms by each technique and their representativity in the related documents. This objective evaluation of the representativity uses the CTW measure (Context Term Weight) as support. Eight real collections of the agronomy domain were used in the experimental evaluation. As a result, some positive and negative characteristics of each techniques were pointed out, showing that the best technique selection for this domain depends on the main pre-established goal, which can involve obtaining better comprehensibility terms for the user or reducing the quantity of extracted terms
14

Computational Terminology : Exploring Bilingual and Monolingual Term Extraction

Foo, Jody January 2012 (has links)
Terminologies are becoming more important to modern day society as technology and science continue to grow at an accelerating rate in a globalized environment. Agreeing upon which terms should be used to represent which concepts and how those terms should be translated into different languages is important if we wish to be able to communicate with as little confusion and misunderstandings as possible. Since the 1990s, an increasing amount of terminology research has been devoted to facilitating and augmenting terminology-related tasks by using computers and computational methods. One focus for this research is Automatic Term Extraction (ATE). In this compilation thesis, studies on both bilingual and monolingual ATE are presented. First, two publications reporting on how bilingual ATE using the align-extract approach can be used to extract patent terms. The result in this case was 181,000 manually validated English-Swedish patent terms which were to be used in a machine translation system for patent documents. A critical component of the method used is the Q-value metric, presented in the third paper, which can be used to rank extracted term candidates (TC) in an order that correlates with TC precision. The use of Machine Learning (ML) in monolingual ATE is the topic of the two final contributions. The first ML-related publication shows that rule induction based ML can be used to generate linguistic term selection patterns, and in the second ML-related publication, contrastive n-gram language models are used in conjunction with SVM ML to improve the precision of term candidates selected using linguistic patterns.
15

Étude de l’évolution dans la terminologie de l’informatique en anglais avant et après 2006

Lafrance, Angélique 09 1900 (has links)
Dans la présente étude, nous proposons une méthode pour observer les changements lexicaux (néologie et nécrologie) en anglais dans le domaine de l’informatique en diachronie courte. Comme l’informatique évolue rapidement, nous croyons qu’une approche en diachronie courte (sur une période de 10 ans) se prête bien à l’étude de la terminologie de ce domaine. Pour ce faire, nous avons construit un corpus anglais constitué d’articles de revues d’informatique grand public, PC Magazine et PC World, couvrant les années 2001 à 2010. Le corpus a été divisé en deux sous-corpus : 2001-2005 et 2006-2010. Nous avons choisi l'année 2006 comme pivot, car c’est depuis cette année-là que Facebook (le réseau social le plus populaire) est ouvert au public, et nous croyions que cela donnerait lieu à des changements rapides dans la terminologie de l’informatique. Pour chacune des deux revues, nous avons sélectionné un numéro par année de 2001 à 2010, pour un total d’environ 540 000 mots pour le sous-corpus de 2001 à 2005 et environ 390 000 mots pour le sous-corpus de 2006 à 2010. Chaque sous-corpus a été soumis à l’extracteur de termes TermoStat pour en extraire les candidats-termes nominaux, verbaux et adjectivaux. Nous avons procédé à trois groupes d’expérimentations, selon le corpus de référence utilisé. Dans le premier groupe d’expérimentations (Exp1), nous avons comparé chaque sous-corpus au corpus de référence par défaut de TermoStat pour l’anglais, un extrait du British National Corpus (BNC). Dans le deuxième groupe d’expérimentations (Exp2), nous avons comparé chacun des sous-corpus à l’ensemble du corpus informatique que nous avons créé. Dans le troisième groupe d’expérimentations (Exp3), nous avons comparé chacun des sous-corpus entre eux. Après avoir nettoyé les listes de candidats-termes ainsi obtenues pour ne retenir que les termes du domaine de l’informatique, et généré des données sur la variation de la fréquence et de la spécificité relative des termes entre les sous-corpus, nous avons procédé à la validation de la nouveauté et de l’obsolescence des premiers termes de chaque liste pour déterminer si la méthode proposée fonctionne mieux avec un type de changement lexical (nouveauté ou obsolescence), une partie du discours (termes nominaux, termes verbaux et termes adjectivaux) ou un groupe d’expérimentations. Les résultats de la validation montrent que la méthode semble mieux convenir à l’extraction des néologismes qu’à l’extraction des nécrologismes. De plus, nous avons obtenu de meilleurs résultats pour les termes nominaux et adjectivaux que pour les termes verbaux. Enfin, nous avons obtenu beaucoup plus de résultats avec l’Exp1 qu’avec l’Exp2 et l’Exp3. / In this study, we propose a method to observe lexical changes (neology and necrology) in English in the field of computer science in short-period diachrony. Since computer science evolves quickly, we believe that a short-period diachronic approach (over a period of 10 years) lends itself to studying the terminology of that field. For this purpose, we built a corpus in English with articles taken from computer science magazines for the general public, PC Magazine and PC World, covering the years 2001 to 2010. The corpus was divided into two subcorpora: 2001-2005 and 2006-2010. We chose year 2006 as a pivot, because Facebook (the most popular social network) has been open to the public since that year, and we believed that would cause quick changes in computer science terminology. For each of the magazines, we selected one issue per year from 2001 to 2010, for a total of about 540,000 words for the 2001-2005 subcorpus and about 390,000 words for the 2006-2010 subcorpus. Each subcorpus was submitted to term extractor TermoStat to extract nominal, verbal and adjectival term candidates. We proceeded to three experiment groups, according to the reference corpus used. In the first experiment group (Exp1), we compared each subcorpus to the default reference corpus in TermoStat for English, a British National Corpus (BNC) extract. In the second experiment group (Exp2), we compared each subcorpus to the whole computer science corpus we created. In the third experiment group (Exp3), we compared the two subcorpora with each other. After cleaning up the term candidates lists thus obtained to retain only the terms in the field of computer science, and generating data about relative frequency and relative specificity of the terms between subcorpora, we proceeded to the validation of novelty and obsolescence of the first terms of each list to determine whether the proposed method works better with a particular type of lexical change (novelty or obsolescence), part of speech (nominal, verbal or adjectival term), or experiment group. The validation results show that the method seems to work better with neology extraction than with necrology extraction. Also, we had better results with nominal and adjectival terms than with verbal terms. Finally, we had much more results with Exp1 than with Exp2 and Exp3.
16

Analyse comparative de l'équivalence terminologique en corpus parallèle et en corpus comparable : application au domaine du changement climatique

Le Serrec, Annaïch 04 1900 (has links)
Les travaux entrepris dans le cadre de la présente thèse portent sur l’analyse de l’équivalence terminologique en corpus parallèle et en corpus comparable. Plus spécifiquement, nous nous intéressons aux corpus de textes spécialisés appartenant au domaine du changement climatique. Une des originalités de cette étude réside dans l’analyse des équivalents de termes simples. Les bases théoriques sur lesquelles nous nous appuyons sont la terminologie textuelle (Bourigault et Slodzian 1999) et l’approche lexico-sémantique (L’Homme 2005). Cette étude poursuit deux objectifs. Le premier est d’effectuer une analyse comparative de l’équivalence dans les deux types de corpus afin de vérifier si l’équivalence terminologique observable dans les corpus parallèles se distingue de celle que l’on trouve dans les corpus comparables. Le deuxième consiste à comparer dans le détail les équivalents associés à un même terme anglais, afin de les décrire et de les répertorier pour en dégager une typologie. L’analyse détaillée des équivalents français de 343 termes anglais est menée à bien grâce à l’exploitation d’outils informatiques (extracteur de termes, aligneur de textes, etc.) et à la mise en place d’une méthodologie rigoureuse divisée en trois parties. La première partie qui est commune aux deux objectifs de la recherche concerne l’élaboration des corpus, la validation des termes anglais et le repérage des équivalents français dans les deux corpus. La deuxième partie décrit les critères sur lesquels nous nous appuyons pour comparer les équivalents des deux types de corpus. La troisième partie met en place la typologie des équivalents associés à un même terme anglais. Les résultats pour le premier objectif montrent que sur les 343 termes anglais analysés, les termes présentant des équivalents critiquables dans les deux corpus sont relativement peu élevés (12), tandis que le nombre de termes présentant des similitudes d’équivalence entre les corpus est très élevé (272 équivalents identiques et 55 équivalents non critiquables). L’analyse comparative décrite dans ce chapitre confirme notre hypothèse selon laquelle la terminologie employée dans les corpus parallèles ne se démarque pas de celle des corpus comparables. Les résultats pour le deuxième objectif montrent que de nombreux termes anglais sont rendus par plusieurs équivalents (70 % des termes analysés). Il est aussi constaté que ce ne sont pas les synonymes qui forment le groupe le plus important des équivalents, mais les quasi-synonymes. En outre, les équivalents appartenant à une autre partie du discours constituent une part importante des équivalents. Ainsi, la typologie élaborée dans cette thèse présente des mécanismes de l’équivalence terminologique peu décrits aussi systématiquement dans les travaux antérieurs. / The research undertaken for this thesis concerns the analysis of terminological equivalence in a parallel corpus and a comparable corpus. More specifically, we focus on specialized texts related to the domain of climate change. A unique aspect of this study is based on the analysis of the equivalents of single word terms. The theoretical frameworks on which we rely are the terminologie textuelle (Bourigault et Slodzian 1999) and the lexico-sémantique approaches (L’Homme 2005). This study has two objectives. The first is to perform a comparative analysis of terminological equivalents in the two types of corpora in order to verify if the equivalents found in the parallel corpus are different from the ones observed in the comparable corpora. The second is to compare in detail equivalents associated with a same English term, in order to describe them and define a typology. A detailed analysis of the French equivalents of 343 English terms is carried out with the help of computer tools (term extractor, text aligner, etc.) and the establishment of a rigorous methodology divided into three parts. The first part, common to both objectives of the research concerns the elaboration of the corpus, the validation of the English terms and the identification of the French equivalents in the two corpora. The second part describes the criteria on which we rely to compare the equivalents of the two types of corpora. The third part sets up the typology of equivalents associated with a same English term. The results for the first objective shows that of the 343 English words analyzed, terms with equivalents that can be criticized in both corpora are relatively low in number (12), while the number of terms with similar equivalences between the two corpora is very high (272 identical and 55 equivalents not objectionable). The analysis described in this chapter confirms our hypothesis that terminology used in parallel corpora does not differ from that used in comparable corpora. The results of the second objective show that many English terms are rendered by several equivalents (70% of analyzed terms). It is also noted that synonyms are not the largest group of equivalents but near-synonyms. Also, equivalents from another part of speech constitute an important part of the equivalents analyzed. Thus, the typology developed in this thesis presents terminological equivalent mechanisms rarely described as systematically in previous work.
17

Analyse comparative de l'équivalence terminologique en corpus parallèle et en corpus comparable : application au domaine du changement climatique

Le Serrec, Annaïch 04 1900 (has links)
Les travaux entrepris dans le cadre de la présente thèse portent sur l’analyse de l’équivalence terminologique en corpus parallèle et en corpus comparable. Plus spécifiquement, nous nous intéressons aux corpus de textes spécialisés appartenant au domaine du changement climatique. Une des originalités de cette étude réside dans l’analyse des équivalents de termes simples. Les bases théoriques sur lesquelles nous nous appuyons sont la terminologie textuelle (Bourigault et Slodzian 1999) et l’approche lexico-sémantique (L’Homme 2005). Cette étude poursuit deux objectifs. Le premier est d’effectuer une analyse comparative de l’équivalence dans les deux types de corpus afin de vérifier si l’équivalence terminologique observable dans les corpus parallèles se distingue de celle que l’on trouve dans les corpus comparables. Le deuxième consiste à comparer dans le détail les équivalents associés à un même terme anglais, afin de les décrire et de les répertorier pour en dégager une typologie. L’analyse détaillée des équivalents français de 343 termes anglais est menée à bien grâce à l’exploitation d’outils informatiques (extracteur de termes, aligneur de textes, etc.) et à la mise en place d’une méthodologie rigoureuse divisée en trois parties. La première partie qui est commune aux deux objectifs de la recherche concerne l’élaboration des corpus, la validation des termes anglais et le repérage des équivalents français dans les deux corpus. La deuxième partie décrit les critères sur lesquels nous nous appuyons pour comparer les équivalents des deux types de corpus. La troisième partie met en place la typologie des équivalents associés à un même terme anglais. Les résultats pour le premier objectif montrent que sur les 343 termes anglais analysés, les termes présentant des équivalents critiquables dans les deux corpus sont relativement peu élevés (12), tandis que le nombre de termes présentant des similitudes d’équivalence entre les corpus est très élevé (272 équivalents identiques et 55 équivalents non critiquables). L’analyse comparative décrite dans ce chapitre confirme notre hypothèse selon laquelle la terminologie employée dans les corpus parallèles ne se démarque pas de celle des corpus comparables. Les résultats pour le deuxième objectif montrent que de nombreux termes anglais sont rendus par plusieurs équivalents (70 % des termes analysés). Il est aussi constaté que ce ne sont pas les synonymes qui forment le groupe le plus important des équivalents, mais les quasi-synonymes. En outre, les équivalents appartenant à une autre partie du discours constituent une part importante des équivalents. Ainsi, la typologie élaborée dans cette thèse présente des mécanismes de l’équivalence terminologique peu décrits aussi systématiquement dans les travaux antérieurs. / The research undertaken for this thesis concerns the analysis of terminological equivalence in a parallel corpus and a comparable corpus. More specifically, we focus on specialized texts related to the domain of climate change. A unique aspect of this study is based on the analysis of the equivalents of single word terms. The theoretical frameworks on which we rely are the terminologie textuelle (Bourigault et Slodzian 1999) and the lexico-sémantique approaches (L’Homme 2005). This study has two objectives. The first is to perform a comparative analysis of terminological equivalents in the two types of corpora in order to verify if the equivalents found in the parallel corpus are different from the ones observed in the comparable corpora. The second is to compare in detail equivalents associated with a same English term, in order to describe them and define a typology. A detailed analysis of the French equivalents of 343 English terms is carried out with the help of computer tools (term extractor, text aligner, etc.) and the establishment of a rigorous methodology divided into three parts. The first part, common to both objectives of the research concerns the elaboration of the corpus, the validation of the English terms and the identification of the French equivalents in the two corpora. The second part describes the criteria on which we rely to compare the equivalents of the two types of corpora. The third part sets up the typology of equivalents associated with a same English term. The results for the first objective shows that of the 343 English words analyzed, terms with equivalents that can be criticized in both corpora are relatively low in number (12), while the number of terms with similar equivalences between the two corpora is very high (272 identical and 55 equivalents not objectionable). The analysis described in this chapter confirms our hypothesis that terminology used in parallel corpora does not differ from that used in comparable corpora. The results of the second objective show that many English terms are rendered by several equivalents (70% of analyzed terms). It is also noted that synonyms are not the largest group of equivalents but near-synonyms. Also, equivalents from another part of speech constitute an important part of the equivalents analyzed. Thus, the typology developed in this thesis presents terminological equivalent mechanisms rarely described as systematically in previous work.
18

Modeling the socio-semantic dynamics of scientific communities / Modélisation des dynamiques socio-sémantiques dans les communautés scientifiques

Omodei, Elisa 19 December 2014 (has links)
Comment les structures sociales et sémantiques d'une communauté scientifique guident-elles les dynamiques de collaboration à venir ? Dans cette thèse, nous combinons des techniques de traitement automatique des langues et des méthodes provenant de l'analyse de réseaux complexes pour analyser une base de données de publications scientifiques dans le domaine de la linguistique computationnelle : l'ACL Anthology. Notre objectif est de comprendre le rôle des collaborations entre les chercheurs dans la construction du paysage sémantique du domaine, et, symétriquement, de saisir combien ce même paysage influence les trajectoires individuelles des chercheurs et leurs interactions. Nous employons des outils d’analyse du contenu textuel pour extraire des textes des publications les termes correspondant à des concepts scientifiques. Ces termes sont ensuite connectés aux chercheurs pour former un réseau socio-sémantique, dont nous modélisons la dynamique à différentes échelles. Nous construisons d’abord un modèle statistique, à base de régressions logistiques multivariées, qui permet de quantifier le rôle respectif des propriétés sociales et sémantiques de la communauté sur la dynamique microscopique du réseau socio-sémantique. Nous reconstruisons par la suite l’évolution du champ de la linguistique computationelle en créant différentes cartographies du réseau sémantique, représentant les connaissances produites dans le domaine, mais aussi le flux d’auteurs entre les différents champs de recherche du domaine. En résumé, nos travaux ont montré que la combinaison des méthodes issues du traitement automatique des langues et de l'analyse des réseaux complexes permet d'étudier d'une manière nouvelle l'évolution des domaines scientifiques. / How are the social and semantic structures of a scientific communitydriving future research dynamics? In this thesis we combine naturallanguage processing techniques and network theory methods to analyzeavery large dataset of scientific publications in the field of computationallinguistics,i.e.the ACL Anthology. Ultimately, our goal is to understandthe role of collaborations among researchers in building and shaping thelandscape of scientific knowledge, and, symmetrically, to understand howthe configuration of this landscape influences individual trajectories ofresearchers and their interactions. We use natural language processingtools to extract the terms corresponding to scientific concepts from thetexts of the publications. Then we reconstruct a socio-semantic networkconnecting researchers and scientific concepts, and model the dynamicsof its evolution at different scales. To achieve this, we first build astatistical model, based on multivariate logistic regression, that quantifiesthe role that social and semantic features play in the evolution of thesocio-semantic network, namely in the emergence of new links. Then, wereconstruct the evolution of the field through different visualizations of theknowledge produced therein, and of the flow of researchers across thedifferent subfields of the domain. To summarize, we have shown throughour work that the combination of natural language processing techniqueswith complex network analysis makes it possible to investigate in a novelway the evolution of scientific fields.
19

Odvození slovníku pro nástroj Process Inspector na platformě SharePoint / Derivation of Dictionary for Process Inspector Tool on SharePoint Platform

Pavlín, Václav January 2012 (has links)
This master's thesis presents methods for mining important pieces of information from text. It analyses the problem of terms extraction from large document collection and describes the implementation using C# language and Microsoft SQL Server. The system uses stemming and a number of statistical methods for term extraction. This project also compares used methods and suggests the process of the dictionary derivation.
20

Extrakce sémantických vztahů z textu / Extraction of Semantic Relations from Text

Schmidt, Marek January 2008 (has links)
Extraction of semantic relations from English text is the topic of this thesis. It focuses on exploitation of a dependency parser. A method based on syntactic patterns is proposed and evaluated in addition to evaluation of several statistical methods over syntactic features. It applies the methods for extraction of a hypernymy relation and evaluates it on the WordNet thesaurus. A system for extraction of semantic relations from text is designed and implemented based on these methods.

Page generated in 0.1733 seconds