Global ETD Search

1	Extração de termos de manuais técnicos de produtos tecnológicos: uma aplicação em Sistemas de Adaptação Textual / Term extraction from technological products instruction manuals: an application in textual adaptation systems Muniz, Fernando Aurélio Martins 28 April 2011 (has links) No Brasil, cerca de 68% da população é classificada como leitores com baixos níveis de alfabetização, isto é, possuem o nível de alfabetização rudimentar (21%) ou básico (47%), segundo dados do INAF (2009). O projeto PorSimples utilizou as duas abordagens de Adaptação Textual, a Simplificação e a Elaboração, para ajudar leitores com baixo nível de alfabetização a compreender documentos disponíveis na Web em português do Brasil, principalmente textos jornalísticos. Esta pesquisa de mestrado também se dedicou às duas abordagens acima, mas o foco foi o gênero de textos instrucionais. Em tarefas que exigem o uso de documentação técnica, a qualidade da documentação é um ponto crítico, pois caso a documentação seja imprecisa, incompleta ou muito complexa, o custo da tarefa ou até mesmo o risco de acidentes aumenta muito. Manuais de instrução possuem duas relações procedimentais básicas: a relação gera generation (quando uma ação gera automaticamente uma ação ), e a relação habilita enablement (quando a realização de uma ação permite a realização da ação , mas o agente precisa fazer algo a mais para garantir que irá ocorrer). O projeto aqui descrito, intitulado NorMan, estudou como as relações procedimentais gera e habilita são realizadas em manuais de instruções, dando base para a criação do sistema NorMan Extractor, que implementa um método de extração de termos dedicado ao gênero de textos instrucionais, especificamente aos manuais técnicos. Também foi proposta a adaptação do sistema de autoria de textos simplificados criado no projeto PorSimples o SIMPLIFICA para atender o gênero de textos instrucional. O SIMPLIFICA adaptado usa a lista de candidatos a termo, gerada pelo sistema NorMan Extractor, com duas funções: (a) para auxiliar na identificação de palavras que não devem ser simplificadas pelo método de simplificação léxica baseado em sinônimos, e (b) para gerar uma elaboração léxica para facilitar o entendimento do texto / In Brazil, 68% of the population can be classified as low-literacy readers, i.e., people at the rudimentary (21%) and basic (47%) literacy levels, according to the National Indicator of Functional Literacy (INAF, 2009). The PorSimples project used the two approaches of Textual Adaptation, Simplification and Elaboration, to help readers with low-literacy levels to understand Brazilian Portuguese documents on the Web, mainly newspaper articles. In this research we also used the two approaches above, but the focus was the genre of instructional texts. In tasks requiring the use of technical documentation, the quality of documentation is a critical point, because if the documentation is inaccurate, incomplete or too complex, the cost of the task or even the risk of accidents is greatly increased. Instructions manuals have two basic procedural relationships: the relation generation (by performing one of the actions (), the other () will automatically occur), and the relation enablement (when enables , then the agent needs to do something more than to guarantee that will be done). The project presented here, entitled NorMan, investigated the realization of the relationships between procedural actions in instruction manuals, providing the basis for creating an automatic term extraction method devoted to the genre of instructional texts, specifically technical manuals. We also proposed an adaptation of the authoring system of simplified texts created in the project PorSimples - the SIMPLIFICA - to deals with the genre of instrumental texts. The new SIMPLIFICA uses the list of term candidates, generated by the proposed method, with two functions: (a) to assist in the identification of words that should not be simplified by the lexical simplification method based on synonyms, and (b) to generate a lexical elaboration to facilitate the comprehension of the text Automatic term extraction Documentação técnica Extração automática de termos Simplificação textual Technical documentation Text simplification
2	Extração de termos de manuais técnicos de produtos tecnológicos: uma aplicação em Sistemas de Adaptação Textual / Term extraction from technological products instruction manuals: an application in textual adaptation systems Fernando Aurélio Martins Muniz 28 April 2011 (has links) No Brasil, cerca de 68% da população é classificada como leitores com baixos níveis de alfabetização, isto é, possuem o nível de alfabetização rudimentar (21%) ou básico (47%), segundo dados do INAF (2009). O projeto PorSimples utilizou as duas abordagens de Adaptação Textual, a Simplificação e a Elaboração, para ajudar leitores com baixo nível de alfabetização a compreender documentos disponíveis na Web em português do Brasil, principalmente textos jornalísticos. Esta pesquisa de mestrado também se dedicou às duas abordagens acima, mas o foco foi o gênero de textos instrucionais. Em tarefas que exigem o uso de documentação técnica, a qualidade da documentação é um ponto crítico, pois caso a documentação seja imprecisa, incompleta ou muito complexa, o custo da tarefa ou até mesmo o risco de acidentes aumenta muito. Manuais de instrução possuem duas relações procedimentais básicas: a relação gera generation (quando uma ação gera automaticamente uma ação ), e a relação habilita enablement (quando a realização de uma ação permite a realização da ação , mas o agente precisa fazer algo a mais para garantir que irá ocorrer). O projeto aqui descrito, intitulado NorMan, estudou como as relações procedimentais gera e habilita são realizadas em manuais de instruções, dando base para a criação do sistema NorMan Extractor, que implementa um método de extração de termos dedicado ao gênero de textos instrucionais, especificamente aos manuais técnicos. Também foi proposta a adaptação do sistema de autoria de textos simplificados criado no projeto PorSimples o SIMPLIFICA para atender o gênero de textos instrucional. O SIMPLIFICA adaptado usa a lista de candidatos a termo, gerada pelo sistema NorMan Extractor, com duas funções: (a) para auxiliar na identificação de palavras que não devem ser simplificadas pelo método de simplificação léxica baseado em sinônimos, e (b) para gerar uma elaboração léxica para facilitar o entendimento do texto / In Brazil, 68% of the population can be classified as low-literacy readers, i.e., people at the rudimentary (21%) and basic (47%) literacy levels, according to the National Indicator of Functional Literacy (INAF, 2009). The PorSimples project used the two approaches of Textual Adaptation, Simplification and Elaboration, to help readers with low-literacy levels to understand Brazilian Portuguese documents on the Web, mainly newspaper articles. In this research we also used the two approaches above, but the focus was the genre of instructional texts. In tasks requiring the use of technical documentation, the quality of documentation is a critical point, because if the documentation is inaccurate, incomplete or too complex, the cost of the task or even the risk of accidents is greatly increased. Instructions manuals have two basic procedural relationships: the relation generation (by performing one of the actions (), the other () will automatically occur), and the relation enablement (when enables , then the agent needs to do something more than to guarantee that will be done). The project presented here, entitled NorMan, investigated the realization of the relationships between procedural actions in instruction manuals, providing the basis for creating an automatic term extraction method devoted to the genre of instructional texts, specifically technical manuals. We also proposed an adaptation of the authoring system of simplified texts created in the project PorSimples - the SIMPLIFICA - to deals with the genre of instrumental texts. The new SIMPLIFICA uses the list of term candidates, generated by the proposed method, with two functions: (a) to assist in the identification of words that should not be simplified by the lexical simplification method based on synonyms, and (b) to generate a lexical elaboration to facilitate the comprehension of the text Documentação técnica Extração automática de termos Simplificação textual Automatic term extraction Technical documentation Text simplification
3	Extração automática de termos simples baseada em aprendizado de máquina / Automatic simple term extraction based on machine learning Laguna, Merley da Silva Conrado 06 May 2014 (has links) A Mineração de Textos (MT) visa descobrir conhecimento inovador nos textos não estruturados. A extração dos termos que representam os textos de um domínio é um dos passos mais importantes da MT, uma vez que os resultados de todo o processo da MT dependerão, em grande parte, da qualidade dos termos obtidos. Nesta tese, considera-se como termos as unidades lexicais realizadas para designar conceitos em um cenário tematicamente restrito. Para a extração dos termos, pode-se fazer uso de abordagens como: estatística, linguística ou híbrida. Normalmente, para a Mineração de Textos, são utilizados métodos estatísticos. A aplicação desses métodos é computacionalmente menos custosa que a dos métodos linguísticos, entretanto seus resultados são geralmente menos interpretáveis. Ambos métodos, muitas vezes, não são capazes de identificar diferenças entre termos e não-termos, por exemplo, os estatísticos podem não identificar termos raros ou que têm a mesma frequência de não-termos e os linguísticos podem não distinguir entre termos que seguem os mesmo padrões linguísticos dos não-termos. Uma solução para esse problema é utilizar métodos híbridos, de forma a combinar as estratégias dos métodos linguísticos e estatísticos, visando atenuar os problemas inerentes a cada um deles. Considerando as características dos métodos de extração de termos, nesta tese, foram investigados métodos estatísticos, formas de obtenção de conhecimento linguístico e métodos híbridos para a extração de termos simples - aqueles constituídos de somente um radical, com ou sem afixos - na língua portuguesa do Brasil. Quatro medidas estatísticas (tvq, tv, tc e comGram), originalmente utilizadas em outras tarefas, foram avaliadas na extração de termos simples, sendo que duas delas (tvq e tv) foram consideradas relevantes para essa tarefa. Quatro novas medidas híbridas (n_subst., n_adj., n_po e n_verbo) foram propostas, sendo que três delas (n_subst,. n_adj., e n_po) auxiliaram na extração de termos. Normalmente os métodos de extração de termos selecionam candidatos a termos com base em algum conhecimento linguístico. Depois disso, eles aplicam a esses candidatos medidas ou combinação de medidas (e/ou heurísticas) para gerar um ranking com tais candidatos. Quanto mais ao topo desse ranking os candidatos estão, maior a chance de que eles sejam termos. A escolha do liminar a ser considerado nesse ranking é feita, em geral de forma manual ou semiautomática por especialistas do domínio e/ou terminólogos. Automatizar a forma de escolha dos candidatos a termos é a primeira motivação da extração de termos realizada nesta pesquisa. A segunda motivação desta pesquisa é minimizar o elevado número de candidatos a termos presente na extração de termos. Esse alto número, causado pela grande quantidade de palavras contidas em um corpus, pode aumentar a complexidade de tempo e os recursos computacionais utilizados para se extrair os termos. A terceira motivação considerada nesta pesquisa é melhorar o estado da arte da extração automática de termos simples da língua portuguesa do Brasil, uma vez que os resultados dessa extração (medida F = 16%) ainda são inferiores se comparados com a extração de termos em línguas como a inglesa (medida F = 92%) e a espanhola (medida F = 68%). Considerando essas motivações, nesta tese, foi proposto o método MATE-ML (Automatic Term Extraction based on Machine Learning) que visa extrair automaticamente termos utilizando técnicas da área de aprendizado de máquina. No método MATE-ML, é sugerido o uso de filtros para reduzir o elevado número de candidatos a termos durante a extração de termos sem prejudicar a representação do domínio em questão. Com isso, acredita-se que os extratores de termos podem gerar listas menores de candidatos extraídos, demandando, assim , menos tempo dos especialistas para avaliar esses candidatos. Ainda, o método MATE-ML foi instanciado em duas abordagens: (i) ILATE (Inductive Learning for Automatic Term Extraction), que utiliza a classificação supervisionada indutiva para rotular os candidatos a termos em termos e não termos, e (ii) TLATE (Transductive Learning for Automatic Term Extraction), que faz uso da classificação semissupervisionada transdutiva para propagar os rótulos dos candidatos rotulados para os não rotulados. A aplicação do aprendizado transdutivo na extração de termos e a aplicação ao mesmo tempo de um conjunto rico de características de candidatos pertencentes a diferentes níveis de conhecimento - linguístico, estatístico e híbrido também são consideradas contribuições desta tese. Nesta tese, são discutidas as vantagens e limitações dessas duas abordagens propostas, ILATE e TLATE. Ressalta-se que o uso dessas abordagens alcança geralmente resultados mais altos de precisão (os melhores casos alcançam mais de 81%), altos resultados de cobertura (os melhores casos atingem mai de 87%) e bons valores de medida F (máximo de 41%) em relação aos métodos e medidas comparados nas avaliações experimentais realizadas considerando três corpora de diferentes domínios na língua portuguesa do Brasil / Text Mining (TM) aims at discovering innovating knowledge in unstructured texts. The extraction of terms that represent that texts of a specific domain is one of the most important steps of TM, since the results of the overall TM process will mostly depend on the quality of these terms. In this thesis, we consider terms as lexical units used to assign concepts in thematically restricted scenarios. The term extraction task may use approaches such as: statistical, linguistic, or hybrid. Typically, statistical methods are the most common for Text Mining. These methods are computationally less expensive than the linguistic ones, however their results tend to be less human-interpretable. Both methods are not often capable of identifying differences between terms and non-terms. For example, statistical methods may not identify terms that have the same frequency of non-terms and linguistic methods may not distinguish between terms that follow the same patterns of non-terms. One solution to this problem is to use hybrid methods, combining the strategies of linguistic and ststistical methods, in order to attenuate their inherent problems. Considering the features of the term extraction methods, in this thesis, we investigated statistical melhods, ways of obtaining linguistic knowledge, and hybrid methods for extracting simple terms (only one radical, with or without the affixes) for the Braziian Portuguese language. We evaluated, in term extraction, four new hybrid measures (tvq, tv, and comGram) originally proposed for other tasks; and two of them (tvq and tv) were considered relevant for this task. e proposed four new hybrid measures(n_subs., n_adj., n_po, and n_verb); and there of them (n_subst., n_adj., and n_po) were helpful in the term extraction task. Typically, the extraction methods select term candidates based on some linguistic knowledge. After this process, they apply measures or combination of measures (and/or heuristics) to these candidates in order to generate a ranking. The higher the candidates are in the ranking, the better the chances of being terms. To decide up to which position must be considered in this ranking normally, a domain expert and/or terminologist manually or semiautomatically analyse the ranking. The first motivation of this thesis is to automate how to choose the candidates during the term extraction process. The second motivation of this research is to minimize the high number of candidates present in the term extraction. The high number of candidate, caused by the large amount of words in a corpus, could increase the time complexity and computational resources for extracting terms. The third motivation considered in this research is to improve the state of the art of the automatic simple term extraction for Brazilian Portuguese since the results of this extraction (F-measure = 16%) are still low when compared to other languages like English (F-measure = 92%) and Spanish (F-measure =68%). Given these motivations, we proposed the MATE-ML method (Automatic Term Extraction Based on Machine Learning), which aims to automatically extract simple terms using the machine learning techniques. MATE-ML method suggests the use of filters to reduce the high number of term candidates during the term extraction task without harming the domain representation. Thus, we believe the extractors may generate smaller candidate lists, requiring less time to evaluate these candidates. The MATE-ML method was instantiated in two approaches.: (i) ILATE (Inductive Learning for Automatic Term Extraction),. which uses the supervised inductive classification to label term candidates, and (ii) TLATE (Trnasductive Learning for Automatic Term Extraction), which uses transductive semi-supervised classification to propagate the classes from labeled candidates to unlabeled candidates. Using transductive learning in term extraction and using, at the same time, a rich set of candidate features belonging to different levels of knowledge (linguistic,statistical, and hybrid) are also considered as contributions. In this thesis, we discuss the advantages and limitations of these two proposed approaches. We emphasize taht the use of these approaches usually with higher precision (the best case is above of 81%), high coverage results (the best case is above of 87%), and good F-measure value (maximum of 41%) considering three corpora of different domains in the Brazilian Portuguese language Aprendizado de máquina Automatic term extraction Conhecimento linguístico estatístico e híbrido Extração automática de termos Linguistic Machine learning statistical statistical and hybrid
4	Extração automática de termos simples baseada em aprendizado de máquina / Automatic simple term extraction based on machine learning Merley da Silva Conrado Laguna 06 May 2014 (has links) A Mineração de Textos (MT) visa descobrir conhecimento inovador nos textos não estruturados. A extração dos termos que representam os textos de um domínio é um dos passos mais importantes da MT, uma vez que os resultados de todo o processo da MT dependerão, em grande parte, da qualidade dos termos obtidos. Nesta tese, considera-se como termos as unidades lexicais realizadas para designar conceitos em um cenário tematicamente restrito. Para a extração dos termos, pode-se fazer uso de abordagens como: estatística, linguística ou híbrida. Normalmente, para a Mineração de Textos, são utilizados métodos estatísticos. A aplicação desses métodos é computacionalmente menos custosa que a dos métodos linguísticos, entretanto seus resultados são geralmente menos interpretáveis. Ambos métodos, muitas vezes, não são capazes de identificar diferenças entre termos e não-termos, por exemplo, os estatísticos podem não identificar termos raros ou que têm a mesma frequência de não-termos e os linguísticos podem não distinguir entre termos que seguem os mesmo padrões linguísticos dos não-termos. Uma solução para esse problema é utilizar métodos híbridos, de forma a combinar as estratégias dos métodos linguísticos e estatísticos, visando atenuar os problemas inerentes a cada um deles. Considerando as características dos métodos de extração de termos, nesta tese, foram investigados métodos estatísticos, formas de obtenção de conhecimento linguístico e métodos híbridos para a extração de termos simples - aqueles constituídos de somente um radical, com ou sem afixos - na língua portuguesa do Brasil. Quatro medidas estatísticas (tvq, tv, tc e comGram), originalmente utilizadas em outras tarefas, foram avaliadas na extração de termos simples, sendo que duas delas (tvq e tv) foram consideradas relevantes para essa tarefa. Quatro novas medidas híbridas (n_subst., n_adj., n_po e n_verbo) foram propostas, sendo que três delas (n_subst,. n_adj., e n_po) auxiliaram na extração de termos. Normalmente os métodos de extração de termos selecionam candidatos a termos com base em algum conhecimento linguístico. Depois disso, eles aplicam a esses candidatos medidas ou combinação de medidas (e/ou heurísticas) para gerar um ranking com tais candidatos. Quanto mais ao topo desse ranking os candidatos estão, maior a chance de que eles sejam termos. A escolha do liminar a ser considerado nesse ranking é feita, em geral de forma manual ou semiautomática por especialistas do domínio e/ou terminólogos. Automatizar a forma de escolha dos candidatos a termos é a primeira motivação da extração de termos realizada nesta pesquisa. A segunda motivação desta pesquisa é minimizar o elevado número de candidatos a termos presente na extração de termos. Esse alto número, causado pela grande quantidade de palavras contidas em um corpus, pode aumentar a complexidade de tempo e os recursos computacionais utilizados para se extrair os termos. A terceira motivação considerada nesta pesquisa é melhorar o estado da arte da extração automática de termos simples da língua portuguesa do Brasil, uma vez que os resultados dessa extração (medida F = 16%) ainda são inferiores se comparados com a extração de termos em línguas como a inglesa (medida F = 92%) e a espanhola (medida F = 68%). Considerando essas motivações, nesta tese, foi proposto o método MATE-ML (Automatic Term Extraction based on Machine Learning) que visa extrair automaticamente termos utilizando técnicas da área de aprendizado de máquina. No método MATE-ML, é sugerido o uso de filtros para reduzir o elevado número de candidatos a termos durante a extração de termos sem prejudicar a representação do domínio em questão. Com isso, acredita-se que os extratores de termos podem gerar listas menores de candidatos extraídos, demandando, assim , menos tempo dos especialistas para avaliar esses candidatos. Ainda, o método MATE-ML foi instanciado em duas abordagens: (i) ILATE (Inductive Learning for Automatic Term Extraction), que utiliza a classificação supervisionada indutiva para rotular os candidatos a termos em termos e não termos, e (ii) TLATE (Transductive Learning for Automatic Term Extraction), que faz uso da classificação semissupervisionada transdutiva para propagar os rótulos dos candidatos rotulados para os não rotulados. A aplicação do aprendizado transdutivo na extração de termos e a aplicação ao mesmo tempo de um conjunto rico de características de candidatos pertencentes a diferentes níveis de conhecimento - linguístico, estatístico e híbrido também são consideradas contribuições desta tese. Nesta tese, são discutidas as vantagens e limitações dessas duas abordagens propostas, ILATE e TLATE. Ressalta-se que o uso dessas abordagens alcança geralmente resultados mais altos de precisão (os melhores casos alcançam mais de 81%), altos resultados de cobertura (os melhores casos atingem mai de 87%) e bons valores de medida F (máximo de 41%) em relação aos métodos e medidas comparados nas avaliações experimentais realizadas considerando três corpora de diferentes domínios na língua portuguesa do Brasil / Text Mining (TM) aims at discovering innovating knowledge in unstructured texts. The extraction of terms that represent that texts of a specific domain is one of the most important steps of TM, since the results of the overall TM process will mostly depend on the quality of these terms. In this thesis, we consider terms as lexical units used to assign concepts in thematically restricted scenarios. The term extraction task may use approaches such as: statistical, linguistic, or hybrid. Typically, statistical methods are the most common for Text Mining. These methods are computationally less expensive than the linguistic ones, however their results tend to be less human-interpretable. Both methods are not often capable of identifying differences between terms and non-terms. For example, statistical methods may not identify terms that have the same frequency of non-terms and linguistic methods may not distinguish between terms that follow the same patterns of non-terms. One solution to this problem is to use hybrid methods, combining the strategies of linguistic and ststistical methods, in order to attenuate their inherent problems. Considering the features of the term extraction methods, in this thesis, we investigated statistical melhods, ways of obtaining linguistic knowledge, and hybrid methods for extracting simple terms (only one radical, with or without the affixes) for the Braziian Portuguese language. We evaluated, in term extraction, four new hybrid measures (tvq, tv, and comGram) originally proposed for other tasks; and two of them (tvq and tv) were considered relevant for this task. e proposed four new hybrid measures(n_subs., n_adj., n_po, and n_verb); and there of them (n_subst., n_adj., and n_po) were helpful in the term extraction task. Typically, the extraction methods select term candidates based on some linguistic knowledge. After this process, they apply measures or combination of measures (and/or heuristics) to these candidates in order to generate a ranking. The higher the candidates are in the ranking, the better the chances of being terms. To decide up to which position must be considered in this ranking normally, a domain expert and/or terminologist manually or semiautomatically analyse the ranking. The first motivation of this thesis is to automate how to choose the candidates during the term extraction process. The second motivation of this research is to minimize the high number of candidates present in the term extraction. The high number of candidate, caused by the large amount of words in a corpus, could increase the time complexity and computational resources for extracting terms. The third motivation considered in this research is to improve the state of the art of the automatic simple term extraction for Brazilian Portuguese since the results of this extraction (F-measure = 16%) are still low when compared to other languages like English (F-measure = 92%) and Spanish (F-measure =68%). Given these motivations, we proposed the MATE-ML method (Automatic Term Extraction Based on Machine Learning), which aims to automatically extract simple terms using the machine learning techniques. MATE-ML method suggests the use of filters to reduce the high number of term candidates during the term extraction task without harming the domain representation. Thus, we believe the extractors may generate smaller candidate lists, requiring less time to evaluate these candidates. The MATE-ML method was instantiated in two approaches.: (i) ILATE (Inductive Learning for Automatic Term Extraction),. which uses the supervised inductive classification to label term candidates, and (ii) TLATE (Trnasductive Learning for Automatic Term Extraction), which uses transductive semi-supervised classification to propagate the classes from labeled candidates to unlabeled candidates. Using transductive learning in term extraction and using, at the same time, a rich set of candidate features belonging to different levels of knowledge (linguistic,statistical, and hybrid) are also considered as contributions. In this thesis, we discuss the advantages and limitations of these two proposed approaches. We emphasize taht the use of these approaches usually with higher precision (the best case is above of 81%), high coverage results (the best case is above of 87%), and good F-measure value (maximum of 41%) considering three corpora of different domains in the Brazilian Portuguese language Aprendizado de máquina Conhecimento linguístico estatístico e híbrido Extração automática de termos Automatic term extraction Linguistic Machine learning statistical statistical and hybrid
5	Computational Terminology : Exploring Bilingual and Monolingual Term Extraction Foo, Jody January 2012 (has links) Terminologies are becoming more important to modern day society as technology and science continue to grow at an accelerating rate in a globalized environment. Agreeing upon which terms should be used to represent which concepts and how those terms should be translated into different languages is important if we wish to be able to communicate with as little confusion and misunderstandings as possible. Since the 1990s, an increasing amount of terminology research has been devoted to facilitating and augmenting terminology-related tasks by using computers and computational methods. One focus for this research is Automatic Term Extraction (ATE). In this compilation thesis, studies on both bilingual and monolingual ATE are presented. First, two publications reporting on how bilingual ATE using the align-extract approach can be used to extract patent terms. The result in this case was 181,000 manually validated English-Swedish patent terms which were to be used in a machine translation system for patent documents. A critical component of the method used is the Q-value metric, presented in the third paper, which can be used to rank extracted term candidates (TC) in an order that correlates with TC precision. The use of Machine Learning (ML) in monolingual ATE is the topic of the two final contributions. The first ML-related publication shows that rule induction based ML can be used to generate linguistic term selection patterns, and in the second ML-related publication, contrastive n-gram language models are used in conjunction with SVM ML to improve the precision of term candidates selected using linguistic patterns. terminology automatic term extraction automatic term recognition computational terminology terminology management
6	Étude de l’évolution dans la terminologie de l’informatique en anglais avant et après 2006 Lafrance, Angélique 09 1900 (has links) Dans la présente étude, nous proposons une méthode pour observer les changements lexicaux (néologie et nécrologie) en anglais dans le domaine de l’informatique en diachronie courte. Comme l’informatique évolue rapidement, nous croyons qu’une approche en diachronie courte (sur une période de 10 ans) se prête bien à l’étude de la terminologie de ce domaine. Pour ce faire, nous avons construit un corpus anglais constitué d’articles de revues d’informatique grand public, PC Magazine et PC World, couvrant les années 2001 à 2010. Le corpus a été divisé en deux sous-corpus : 2001-2005 et 2006-2010. Nous avons choisi l'année 2006 comme pivot, car c’est depuis cette année-là que Facebook (le réseau social le plus populaire) est ouvert au public, et nous croyions que cela donnerait lieu à des changements rapides dans la terminologie de l’informatique. Pour chacune des deux revues, nous avons sélectionné un numéro par année de 2001 à 2010, pour un total d’environ 540 000 mots pour le sous-corpus de 2001 à 2005 et environ 390 000 mots pour le sous-corpus de 2006 à 2010. Chaque sous-corpus a été soumis à l’extracteur de termes TermoStat pour en extraire les candidats-termes nominaux, verbaux et adjectivaux. Nous avons procédé à trois groupes d’expérimentations, selon le corpus de référence utilisé. Dans le premier groupe d’expérimentations (Exp1), nous avons comparé chaque sous-corpus au corpus de référence par défaut de TermoStat pour l’anglais, un extrait du British National Corpus (BNC). Dans le deuxième groupe d’expérimentations (Exp2), nous avons comparé chacun des sous-corpus à l’ensemble du corpus informatique que nous avons créé. Dans le troisième groupe d’expérimentations (Exp3), nous avons comparé chacun des sous-corpus entre eux. Après avoir nettoyé les listes de candidats-termes ainsi obtenues pour ne retenir que les termes du domaine de l’informatique, et généré des données sur la variation de la fréquence et de la spécificité relative des termes entre les sous-corpus, nous avons procédé à la validation de la nouveauté et de l’obsolescence des premiers termes de chaque liste pour déterminer si la méthode proposée fonctionne mieux avec un type de changement lexical (nouveauté ou obsolescence), une partie du discours (termes nominaux, termes verbaux et termes adjectivaux) ou un groupe d’expérimentations. Les résultats de la validation montrent que la méthode semble mieux convenir à l’extraction des néologismes qu’à l’extraction des nécrologismes. De plus, nous avons obtenu de meilleurs résultats pour les termes nominaux et adjectivaux que pour les termes verbaux. Enfin, nous avons obtenu beaucoup plus de résultats avec l’Exp1 qu’avec l’Exp2 et l’Exp3. / In this study, we propose a method to observe lexical changes (neology and necrology) in English in the field of computer science in short-period diachrony. Since computer science evolves quickly, we believe that a short-period diachronic approach (over a period of 10 years) lends itself to studying the terminology of that field. For this purpose, we built a corpus in English with articles taken from computer science magazines for the general public, PC Magazine and PC World, covering the years 2001 to 2010. The corpus was divided into two subcorpora: 2001-2005 and 2006-2010. We chose year 2006 as a pivot, because Facebook (the most popular social network) has been open to the public since that year, and we believed that would cause quick changes in computer science terminology. For each of the magazines, we selected one issue per year from 2001 to 2010, for a total of about 540,000 words for the 2001-2005 subcorpus and about 390,000 words for the 2006-2010 subcorpus. Each subcorpus was submitted to term extractor TermoStat to extract nominal, verbal and adjectival term candidates. We proceeded to three experiment groups, according to the reference corpus used. In the first experiment group (Exp1), we compared each subcorpus to the default reference corpus in TermoStat for English, a British National Corpus (BNC) extract. In the second experiment group (Exp2), we compared each subcorpus to the whole computer science corpus we created. In the third experiment group (Exp3), we compared the two subcorpora with each other. After cleaning up the term candidates lists thus obtained to retain only the terms in the field of computer science, and generating data about relative frequency and relative specificity of the terms between subcorpora, we proceeded to the validation of novelty and obsolescence of the first terms of each list to determine whether the proposed method works better with a particular type of lexical change (novelty or obsolescence), part of speech (nominal, verbal or adjectival term), or experiment group. The validation results show that the method seems to work better with neology extraction than with necrology extraction. Also, we had better results with nominal and adjectival terms than with verbal terms. Finally, we had much more results with Exp1 than with Exp2 and Exp3. Terminologie Diachronie courte Extraction semi-automatique de termes Néologie Nécrologie Informatique Terminology Short-period diachrony Semi-automatic term extraction Neology Necrology Computer science
7	Analyse comparative de l'équivalence terminologique en corpus parallèle et en corpus comparable : application au domaine du changement climatique Le Serrec, Annaïch 04 1900 (has links) Les travaux entrepris dans le cadre de la présente thèse portent sur l’analyse de l’équivalence terminologique en corpus parallèle et en corpus comparable. Plus spécifiquement, nous nous intéressons aux corpus de textes spécialisés appartenant au domaine du changement climatique. Une des originalités de cette étude réside dans l’analyse des équivalents de termes simples. Les bases théoriques sur lesquelles nous nous appuyons sont la terminologie textuelle (Bourigault et Slodzian 1999) et l’approche lexico-sémantique (L’Homme 2005). Cette étude poursuit deux objectifs. Le premier est d’effectuer une analyse comparative de l’équivalence dans les deux types de corpus afin de vérifier si l’équivalence terminologique observable dans les corpus parallèles se distingue de celle que l’on trouve dans les corpus comparables. Le deuxième consiste à comparer dans le détail les équivalents associés à un même terme anglais, afin de les décrire et de les répertorier pour en dégager une typologie. L’analyse détaillée des équivalents français de 343 termes anglais est menée à bien grâce à l’exploitation d’outils informatiques (extracteur de termes, aligneur de textes, etc.) et à la mise en place d’une méthodologie rigoureuse divisée en trois parties. La première partie qui est commune aux deux objectifs de la recherche concerne l’élaboration des corpus, la validation des termes anglais et le repérage des équivalents français dans les deux corpus. La deuxième partie décrit les critères sur lesquels nous nous appuyons pour comparer les équivalents des deux types de corpus. La troisième partie met en place la typologie des équivalents associés à un même terme anglais. Les résultats pour le premier objectif montrent que sur les 343 termes anglais analysés, les termes présentant des équivalents critiquables dans les deux corpus sont relativement peu élevés (12), tandis que le nombre de termes présentant des similitudes d’équivalence entre les corpus est très élevé (272 équivalents identiques et 55 équivalents non critiquables). L’analyse comparative décrite dans ce chapitre confirme notre hypothèse selon laquelle la terminologie employée dans les corpus parallèles ne se démarque pas de celle des corpus comparables. Les résultats pour le deuxième objectif montrent que de nombreux termes anglais sont rendus par plusieurs équivalents (70 % des termes analysés). Il est aussi constaté que ce ne sont pas les synonymes qui forment le groupe le plus important des équivalents, mais les quasi-synonymes. En outre, les équivalents appartenant à une autre partie du discours constituent une part importante des équivalents. Ainsi, la typologie élaborée dans cette thèse présente des mécanismes de l’équivalence terminologique peu décrits aussi systématiquement dans les travaux antérieurs. / The research undertaken for this thesis concerns the analysis of terminological equivalence in a parallel corpus and a comparable corpus. More specifically, we focus on specialized texts related to the domain of climate change. A unique aspect of this study is based on the analysis of the equivalents of single word terms. The theoretical frameworks on which we rely are the terminologie textuelle (Bourigault et Slodzian 1999) and the lexico-sémantique approaches (L’Homme 2005). This study has two objectives. The first is to perform a comparative analysis of terminological equivalents in the two types of corpora in order to verify if the equivalents found in the parallel corpus are different from the ones observed in the comparable corpora. The second is to compare in detail equivalents associated with a same English term, in order to describe them and define a typology. A detailed analysis of the French equivalents of 343 English terms is carried out with the help of computer tools (term extractor, text aligner, etc.) and the establishment of a rigorous methodology divided into three parts. The first part, common to both objectives of the research concerns the elaboration of the corpus, the validation of the English terms and the identification of the French equivalents in the two corpora. The second part describes the criteria on which we rely to compare the equivalents of the two types of corpora. The third part sets up the typology of equivalents associated with a same English term. The results for the first objective shows that of the 343 English words analyzed, terms with equivalents that can be criticized in both corpora are relatively low in number (12), while the number of terms with similar equivalences between the two corpora is very high (272 identical and 55 equivalents not objectionable). The analysis described in this chapter confirms our hypothesis that terminology used in parallel corpora does not differ from that used in comparable corpora. The results of the second objective show that many English terms are rendered by several equivalents (70% of analyzed terms). It is also noted that synonyms are not the largest group of equivalents but near-synonyms. Also, equivalents from another part of speech constitute an important part of the equivalents analyzed. Thus, the typology developed in this thesis presents terminological equivalent mechanisms rarely described as systematically in previous work. Terminologie Équivalence Corpus parallèle Corpus comparable Extraction automatique de termes Changement climatique Terminology Equivalence Aligned corpora Comparable corpora Automatic term extraction Climate change
8	Analyse comparative de l'équivalence terminologique en corpus parallèle et en corpus comparable : application au domaine du changement climatique Le Serrec, Annaïch 04 1900 (has links) Les travaux entrepris dans le cadre de la présente thèse portent sur l’analyse de l’équivalence terminologique en corpus parallèle et en corpus comparable. Plus spécifiquement, nous nous intéressons aux corpus de textes spécialisés appartenant au domaine du changement climatique. Une des originalités de cette étude réside dans l’analyse des équivalents de termes simples. Les bases théoriques sur lesquelles nous nous appuyons sont la terminologie textuelle (Bourigault et Slodzian 1999) et l’approche lexico-sémantique (L’Homme 2005). Cette étude poursuit deux objectifs. Le premier est d’effectuer une analyse comparative de l’équivalence dans les deux types de corpus afin de vérifier si l’équivalence terminologique observable dans les corpus parallèles se distingue de celle que l’on trouve dans les corpus comparables. Le deuxième consiste à comparer dans le détail les équivalents associés à un même terme anglais, afin de les décrire et de les répertorier pour en dégager une typologie. L’analyse détaillée des équivalents français de 343 termes anglais est menée à bien grâce à l’exploitation d’outils informatiques (extracteur de termes, aligneur de textes, etc.) et à la mise en place d’une méthodologie rigoureuse divisée en trois parties. La première partie qui est commune aux deux objectifs de la recherche concerne l’élaboration des corpus, la validation des termes anglais et le repérage des équivalents français dans les deux corpus. La deuxième partie décrit les critères sur lesquels nous nous appuyons pour comparer les équivalents des deux types de corpus. La troisième partie met en place la typologie des équivalents associés à un même terme anglais. Les résultats pour le premier objectif montrent que sur les 343 termes anglais analysés, les termes présentant des équivalents critiquables dans les deux corpus sont relativement peu élevés (12), tandis que le nombre de termes présentant des similitudes d’équivalence entre les corpus est très élevé (272 équivalents identiques et 55 équivalents non critiquables). L’analyse comparative décrite dans ce chapitre confirme notre hypothèse selon laquelle la terminologie employée dans les corpus parallèles ne se démarque pas de celle des corpus comparables. Les résultats pour le deuxième objectif montrent que de nombreux termes anglais sont rendus par plusieurs équivalents (70 % des termes analysés). Il est aussi constaté que ce ne sont pas les synonymes qui forment le groupe le plus important des équivalents, mais les quasi-synonymes. En outre, les équivalents appartenant à une autre partie du discours constituent une part importante des équivalents. Ainsi, la typologie élaborée dans cette thèse présente des mécanismes de l’équivalence terminologique peu décrits aussi systématiquement dans les travaux antérieurs. / The research undertaken for this thesis concerns the analysis of terminological equivalence in a parallel corpus and a comparable corpus. More specifically, we focus on specialized texts related to the domain of climate change. A unique aspect of this study is based on the analysis of the equivalents of single word terms. The theoretical frameworks on which we rely are the terminologie textuelle (Bourigault et Slodzian 1999) and the lexico-sémantique approaches (L’Homme 2005). This study has two objectives. The first is to perform a comparative analysis of terminological equivalents in the two types of corpora in order to verify if the equivalents found in the parallel corpus are different from the ones observed in the comparable corpora. The second is to compare in detail equivalents associated with a same English term, in order to describe them and define a typology. A detailed analysis of the French equivalents of 343 English terms is carried out with the help of computer tools (term extractor, text aligner, etc.) and the establishment of a rigorous methodology divided into three parts. The first part, common to both objectives of the research concerns the elaboration of the corpus, the validation of the English terms and the identification of the French equivalents in the two corpora. The second part describes the criteria on which we rely to compare the equivalents of the two types of corpora. The third part sets up the typology of equivalents associated with a same English term. The results for the first objective shows that of the 343 English words analyzed, terms with equivalents that can be criticized in both corpora are relatively low in number (12), while the number of terms with similar equivalences between the two corpora is very high (272 identical and 55 equivalents not objectionable). The analysis described in this chapter confirms our hypothesis that terminology used in parallel corpora does not differ from that used in comparable corpora. The results of the second objective show that many English terms are rendered by several equivalents (70% of analyzed terms). It is also noted that synonyms are not the largest group of equivalents but near-synonyms. Also, equivalents from another part of speech constitute an important part of the equivalents analyzed. Thus, the typology developed in this thesis presents terminological equivalent mechanisms rarely described as systematically in previous work. Terminologie Équivalence Corpus parallèle Corpus comparable Extraction automatique de termes Changement climatique Terminology Equivalence Aligned corpora Comparable corpora Automatic term extraction Climate change
9	Modeling the socio-semantic dynamics of scientific communities / Modélisation des dynamiques socio-sémantiques dans les communautés scientifiques Omodei, Elisa 19 December 2014 (has links) Comment les structures sociales et sémantiques d'une communauté scientifique guident-elles les dynamiques de collaboration à venir ? Dans cette thèse, nous combinons des techniques de traitement automatique des langues et des méthodes provenant de l'analyse de réseaux complexes pour analyser une base de données de publications scientifiques dans le domaine de la linguistique computationnelle : l'ACL Anthology. Notre objectif est de comprendre le rôle des collaborations entre les chercheurs dans la construction du paysage sémantique du domaine, et, symétriquement, de saisir combien ce même paysage influence les trajectoires individuelles des chercheurs et leurs interactions. Nous employons des outils d’analyse du contenu textuel pour extraire des textes des publications les termes correspondant à des concepts scientifiques. Ces termes sont ensuite connectés aux chercheurs pour former un réseau socio-sémantique, dont nous modélisons la dynamique à différentes échelles. Nous construisons d’abord un modèle statistique, à base de régressions logistiques multivariées, qui permet de quantifier le rôle respectif des propriétés sociales et sémantiques de la communauté sur la dynamique microscopique du réseau socio-sémantique. Nous reconstruisons par la suite l’évolution du champ de la linguistique computationelle en créant différentes cartographies du réseau sémantique, représentant les connaissances produites dans le domaine, mais aussi le flux d’auteurs entre les différents champs de recherche du domaine. En résumé, nos travaux ont montré que la combinaison des méthodes issues du traitement automatique des langues et de l'analyse des réseaux complexes permet d'étudier d'une manière nouvelle l'évolution des domaines scientifiques. / How are the social and semantic structures of a scientific communitydriving future research dynamics? In this thesis we combine naturallanguage processing techniques and network theory methods to analyzeavery large dataset of scientific publications in the field of computationallinguistics,i.e.the ACL Anthology. Ultimately, our goal is to understandthe role of collaborations among researchers in building and shaping thelandscape of scientific knowledge, and, symmetrically, to understand howthe configuration of this landscape influences individual trajectories ofresearchers and their interactions. We use natural language processingtools to extract the terms corresponding to scientific concepts from thetexts of the publications. Then we reconstruct a socio-semantic networkconnecting researchers and scientific concepts, and model the dynamicsof its evolution at different scales. To achieve this, we first build astatistical model, based on multivariate logistic regression, that quantifiesthe role that social and semantic features play in the evolution of thesocio-semantic network, namely in the emergence of new links. Then, wereconstruct the evolution of the field through different visualizations of theknowledge produced therein, and of the flow of researchers across thedifferent subfields of the domain. To summarize, we have shown throughour work that the combination of natural language processing techniqueswith complex network analysis makes it possible to investigate in a novelway the evolution of scientific fields. Dynamiques socio-semantiques Reseaux de collaboration Reseaux semantiques Extraction lexicale Socio-semantic dynamics Co-authorship networks Semantic networks Automatic term extraction 401
10	Analyse comparative de la terminologie des médias sociaux : contribution des domaines de la communication et de l'informatique à la néologie Charlebois, Julien-Claude 08 1900 (has links) L’objectif de cette étude est de repérer des néologismes à partir de corpus de textes français au moyen d’une méthode semi-automatique. Plus précisément, nous extrayons les néologismes de corpus associés à deux domaines différents, mais traitant du même thème, nous examinons leur répartition et nous les classons selon leur type. L’étude s’appuie sur l’analyse de corpus traitant des médias sociaux. Le premier aborde les médias sociaux du point de vue de la communication, l’autre le fait du point de vue de l’informatique. Ces points de vue ont été privilégiés, car la communication considère ce qui a trait l’utilisation des médias sociaux et l’informatique aborde leur cartographie. La méthode fait appel à l’extracteur de termes TermoStat pour recenser la terminologie des médias sociaux pour chaque point de vue. Ensuite, nous soumettons les 150 termes les plus spécifiques de chaque point de vue à une méthode de validation divisée en trois tests destinés à valider leur statut néologique : des dictionnaires spécialisés, des dictionnaires de langue générale et un outil de visualisation de n-grammes. Finalement, nous étiquetons les néologismes selon la typologie de Dubuc (2002). L’analyse des résultats de la communication et de l’informatique est comparative. La comparaison des deux corpus révèle les contributions respectives de la communication et de l'informatique à la terminologie des médias sociaux en plus de montrer les termes communs aux deux disciplines. L’étude a également permis de repérer 60 néologismes, dont 28 sont exclusifs au corpus de la communication, 28 exclusifs à celui de l’informatique et 4 communs aux deux corpus. La recherche révèle également que les composés par subordination sont les types de néologismes les plus présents dans nos résultats. / The objective of this study is to identify the neologisms within corpora of French texts by means of a semi-automatic method. More precisely, we will extract the neologisms from corpora associated to two different areas; however dealing with the same topic, we examine their distribution and we classify them according to their type. This study is based on an analysis of two corpora within social media. The first one approaches social media from the point of view of communication, and the other approaches it from the point of view of computer science. We prioritize these two points of view being that communication is used as the main source of social media’s utilization and that computer science allows us to understand what is involved to allow for social media to be functional. For this method, we use the TermoStat term extractor in order to take census of terminology for each point of view. We then submit 150 of the most specific terms related to each point of view by way of an exclusion corpus from which we divide into three different tests meant to validate their neological status: specialized dictionaries, general language dictionaries, and a visualization tool for n-grams. Lastly, we label the neologisms according to Dubuc’s (2002) typology. The analysis of the results obtained for communication and computer science uses a comparative method. The comparison of the two corpora reveals the respective contributions from communication and computer science with respect to the terminology of social medias, as well it demonstrates common terms found within the two disciplines. This examination also allowed for the identification of 60 neologisms; of which 28 are exclusive to the corpus of communication, another 28 are exclusive to that of computer science, and four were found to be common to both corpora. This research also reveals that subordinate compounds are the most present types of neologisms according to our results. terminologie néologie médias sociaux communication informatique extraction semi-automatique de termes corpus d'exclusion terminology neology social medias computer science semi-automatic term extraction exclusion corpus

Search results