1 |
Computational Terminology : Exploring Bilingual and Monolingual Term ExtractionFoo, Jody January 2012 (has links)
Terminologies are becoming more important to modern day society as technology and science continue to grow at an accelerating rate in a globalized environment. Agreeing upon which terms should be used to represent which concepts and how those terms should be translated into different languages is important if we wish to be able to communicate with as little confusion and misunderstandings as possible. Since the 1990s, an increasing amount of terminology research has been devoted to facilitating and augmenting terminology-related tasks by using computers and computational methods. One focus for this research is Automatic Term Extraction (ATE). In this compilation thesis, studies on both bilingual and monolingual ATE are presented. First, two publications reporting on how bilingual ATE using the align-extract approach can be used to extract patent terms. The result in this case was 181,000 manually validated English-Swedish patent terms which were to be used in a machine translation system for patent documents. A critical component of the method used is the Q-value metric, presented in the third paper, which can be used to rank extracted term candidates (TC) in an order that correlates with TC precision. The use of Machine Learning (ML) in monolingual ATE is the topic of the two final contributions. The first ML-related publication shows that rule induction based ML can be used to generate linguistic term selection patterns, and in the second ML-related publication, contrastive n-gram language models are used in conjunction with SVM ML to improve the precision of term candidates selected using linguistic patterns.
|
2 |
Élaboration d'un corpus étalon pour l'évaluation d'extracteurs de termesBernier-Colborne, Gabriel 05 1900 (has links)
Ce travail porte sur la construction d’un corpus étalon pour l’évaluation automatisée des extracteurs de termes. Ces programmes informatiques, conçus pour extraire automatiquement les termes contenus dans un corpus, sont utilisés dans différentes applications, telles que la terminographie, la traduction, la recherche d’information, l’indexation, etc. Ainsi, leur évaluation doit être faite en fonction d’une application précise.
Une façon d’évaluer les extracteurs consiste à annoter toutes les occurrences des termes dans un corpus, ce qui nécessite un protocole de repérage et de découpage des unités terminologiques. À notre connaissance, il n’existe pas de corpus annoté bien documenté pour l’évaluation des extracteurs. Ce travail vise à construire un tel corpus et à décrire les problèmes qui doivent être abordés pour y parvenir.
Le corpus étalon que nous proposons est un corpus entièrement annoté, construit en fonction d’une application précise, à savoir la compilation d’un dictionnaire spécialisé de la mécanique automobile. Ce corpus rend compte de la variété des réalisations des termes en contexte. Les termes sont sélectionnés en fonction de critères précis liés à l’application, ainsi qu’à certaines propriétés formelles, linguistiques et conceptuelles des termes et des variantes terminologiques.
Pour évaluer un extracteur au moyen de ce corpus, il suffit d’extraire toutes les unités terminologiques du corpus et de comparer, au moyen de métriques, cette liste à la sortie de l’extracteur. On peut aussi créer une liste de référence sur mesure en extrayant des sous-ensembles de termes en fonction de différents critères. Ce travail permet une évaluation automatique des extracteurs qui tient compte du rôle de l’application. Cette évaluation étant reproductible, elle peut servir non seulement à mesurer la qualité d’un extracteur, mais à comparer différents extracteurs et à améliorer les techniques d’extraction. / We describe a methodology for constructing a gold standard for the automatic evaluation of term extractors. These programs, designed to automatically extract specialized terms from a corpus, are used in various settings, including terminology work, translation, information retrieval, indexing, etc. Thus, the evaluation of term extractors must be carried out in accordance with a specific application.
One way of evaluating term extractors is to construct a corpus in which all term occurrences have been annotated. This involves establishing a protocol for term selection and term boundary identification. To our knowledge, no well-documented annotated corpus is available for the evaluation of term extractors. This contribution aims to build such a corpus and describe what issues must be dealt with in the process.
The gold standard we propose is a fully annotated corpus, constructed in accordance with a specific terminological setting, namely the compilation of a specialized dictionary of automotive mechanics. This annotated corpus accounts for the wide variety of realizations of terms in context. Terms are selected in accordance with specific criteria pertaining to the terminological setting as well as formal, linguistic and conceptual properties of terms and term variations.
To evaluate a term extractor, a list of all the terminological units in the corpus is extracted and compared to the output of the term extractor, using a set of metrics to assess its performance. Subsets of terminological units may also be extracted, providing a level of customization. This allows an automatic and application-driven evaluation of term extractors. Due to its reusability, it can serve not only to assess the performance of a particular extractor, but also to compare different extractors and fine-tune extraction techniques.
|
3 |
e-Termos: Um ambiente colaborativo web de gestão terminológica / e-Termos: a web collaborative environment of terminology managementOliveira, Leandro Henrique Mendonça de 22 September 2009 (has links)
Em uma de suas definções, a Terminologia representa o conjunto de princípios e métodos adotados no processo de gestão e criação de produtos terminológicos, tais como glossários e dicionários de termos. A sistematização desses métodos envolve a aplicação de ferramentas computacionais específicas e compatíveis com as tarefas terminológicas, contribuindo para o desenvolvimento desses produtos e a difusão de conhecimento especializado. Entretanto, principalmente no Brasil, a combinação da Terminologia e Informática é incipiente, e dentre as atividades do trabalho terminológico é comum a utilização de várias ferramentas não especializados para esse fim. Isso torna o trabalho dos terminólogos muito moroso, pois esse trabalho geralmente é feito por uma equipe multidisciplinar que deve ter acesso, a todo o momento, à versão mais atual das várias etapas da geração de um produto terminológico. Além disso, deixa o gerenciamento dos dados mais complicado, pois não existe um padrão de entrada e saída definido para os programas. Apoiado nos pressupostos da Teoria Comunicativa da Terminologia (TCT), este trabalho apresenta a proposta de desenvolvimento e avaliação do e- Termos, um Ambiente ColaborativoWeb composto por seis módulos de trabalho bem definidos, cujo propósito é automatizar as tarefas de gestão e criação de produtos terminológicos. Cada módulo do e-Termos possui a responsabilidade de abrigar tarefas inerentes ao processo de criação das terminologias, sendo atreladas a eles diferentes ferramentas de apoio lingüístico, que possuem a função de dar suporte às atividades de Processamento de Língua Natural envolvidas nesse processo. Além delas, há também ferramentas colaborativas, designadas para dar apoio às necessidades comunicacionais e de interação da equipe de trabalho. Particularmente com relação ao processo de avaliação proposto, uma de suas características é a capacidade de ser executado em um tempo curto, viabilizando a avaliação controlada de vários grupos, mas executada no ambiente de trabalho do público alvo. As principais contribuições desta pesquisa são o aspecto colaborativo instanciado na prática terminológica, a criação flexível da Ficha Terminológica, a possibilidade didática de uso para o ensino de terminologia, lexicografia e tradução e o processo de avaliação para sistemas colaborativos desenvolvido para o e-Termos, que combina Cenários de Uso e um Questionário de Pesquisa. Utilizando tecnologias Web e da área de Computer Supported Collaborative Work (CSCW) para o desenvolvimento da sua arquitetura computacional colaborativa, o e-Termos apresenta-se como um ambiente inovador para a pesquisa terminolóogica assistida por computador, pois automatiza um método prático que exp~oe os postulados da terminologia de orientação descritiva e evidencia todas as etapas do processo de criação de produtos terminológicos com o inédito diferencial colaborativo. Para certificar este êxito, o e-Termos tem recebido um número crescente de novas propostas de projeto, tendo até Agosto de 2009 mais de 130 usuários cadastrados, alocados em 68 diferentes projetos terminológicos / In one of its definitions, Terminology represents the set of principles and methods adopted in the creation and management of terminological products as glossaries and dictionaries of terms. A systematization of these methods includes the application of specific computational tools, compatible with terminological tasks, which contribute to developing such products and disseminating expert knowledge. However, especially in Brazil, the combination of Terminology and Computer Science is still incipient, and to perform the tasks of a terminological work it is typical to employ several nonspecialized tools, which make terminologists\' work very time-consuming, since it is usually carried out by a multidisciplinary team that should have access, all the time, to the latest versions of the various stages of the generation of a terminological product. Moreover, it makes data management more complex, because there is no input/output standard defined for programs. Based on the presuppositions of the Communicative Theory of Terminology (CTT), this thesis proposes the development and evaluation of e-Termos, a Web Collaborative Environment composed of six well-defined working modules, whose purpose is to automatize tasks for creating and managing terminological products. Each module in e- Termos is responsible for tasks inherent to the process of creating terminologies. Linked to these modules, there are different linguistic support tools that assist the Natural Language Processing activities included in the process. Besides them, there are also collaborative tools for supporting the communication and interaction needs of team members. As far as the proposed evaluation process is concerned, one of its features is that it can be run in a short time, making viable a controlled evaluation of several groups that is, however, run in the work environment of the target audience. The main contributions of this research are the collaborative aspect instantiated in terminological practice, the exible creation of Terminological Records, the possibility of being used for teaching terminology, lexicography and translation, and the evaluation of collaborative systems developed for e-Termos, which combines Scenario-based Evaluations and Surveys. Using Web technologies and Computer Supported Collaborative Work (CSCW) to develop its collaborative computational architecture, e-Termos is an innovative environment for computer-assisted terminological research, since it automatizes a useful method that represents the postulates of descriptive terminology and highlights all stages of the process of creating terminological products with the unprecedented collaborative differential. Confirming its success, e-Termos has been receiving a growing number of new project proposals, and in August 2009 has more than 130 registered users in 68 different terminological projects
|
4 |
Élaboration d'un corpus étalon pour l'évaluation d'extracteurs de termesBernier-Colborne, Gabriel 05 1900 (has links)
Ce travail porte sur la construction d’un corpus étalon pour l’évaluation automatisée des extracteurs de termes. Ces programmes informatiques, conçus pour extraire automatiquement les termes contenus dans un corpus, sont utilisés dans différentes applications, telles que la terminographie, la traduction, la recherche d’information, l’indexation, etc. Ainsi, leur évaluation doit être faite en fonction d’une application précise.
Une façon d’évaluer les extracteurs consiste à annoter toutes les occurrences des termes dans un corpus, ce qui nécessite un protocole de repérage et de découpage des unités terminologiques. À notre connaissance, il n’existe pas de corpus annoté bien documenté pour l’évaluation des extracteurs. Ce travail vise à construire un tel corpus et à décrire les problèmes qui doivent être abordés pour y parvenir.
Le corpus étalon que nous proposons est un corpus entièrement annoté, construit en fonction d’une application précise, à savoir la compilation d’un dictionnaire spécialisé de la mécanique automobile. Ce corpus rend compte de la variété des réalisations des termes en contexte. Les termes sont sélectionnés en fonction de critères précis liés à l’application, ainsi qu’à certaines propriétés formelles, linguistiques et conceptuelles des termes et des variantes terminologiques.
Pour évaluer un extracteur au moyen de ce corpus, il suffit d’extraire toutes les unités terminologiques du corpus et de comparer, au moyen de métriques, cette liste à la sortie de l’extracteur. On peut aussi créer une liste de référence sur mesure en extrayant des sous-ensembles de termes en fonction de différents critères. Ce travail permet une évaluation automatique des extracteurs qui tient compte du rôle de l’application. Cette évaluation étant reproductible, elle peut servir non seulement à mesurer la qualité d’un extracteur, mais à comparer différents extracteurs et à améliorer les techniques d’extraction. / We describe a methodology for constructing a gold standard for the automatic evaluation of term extractors. These programs, designed to automatically extract specialized terms from a corpus, are used in various settings, including terminology work, translation, information retrieval, indexing, etc. Thus, the evaluation of term extractors must be carried out in accordance with a specific application.
One way of evaluating term extractors is to construct a corpus in which all term occurrences have been annotated. This involves establishing a protocol for term selection and term boundary identification. To our knowledge, no well-documented annotated corpus is available for the evaluation of term extractors. This contribution aims to build such a corpus and describe what issues must be dealt with in the process.
The gold standard we propose is a fully annotated corpus, constructed in accordance with a specific terminological setting, namely the compilation of a specialized dictionary of automotive mechanics. This annotated corpus accounts for the wide variety of realizations of terms in context. Terms are selected in accordance with specific criteria pertaining to the terminological setting as well as formal, linguistic and conceptual properties of terms and term variations.
To evaluate a term extractor, a list of all the terminological units in the corpus is extracted and compared to the output of the term extractor, using a set of metrics to assess its performance. Subsets of terminological units may also be extracted, providing a level of customization. This allows an automatic and application-driven evaluation of term extractors. Due to its reusability, it can serve not only to assess the performance of a particular extractor, but also to compare different extractors and fine-tune extraction techniques.
|
5 |
e-Termos: Um ambiente colaborativo web de gestão terminológica / e-Termos: a web collaborative environment of terminology managementLeandro Henrique Mendonça de Oliveira 22 September 2009 (has links)
Em uma de suas definções, a Terminologia representa o conjunto de princípios e métodos adotados no processo de gestão e criação de produtos terminológicos, tais como glossários e dicionários de termos. A sistematização desses métodos envolve a aplicação de ferramentas computacionais específicas e compatíveis com as tarefas terminológicas, contribuindo para o desenvolvimento desses produtos e a difusão de conhecimento especializado. Entretanto, principalmente no Brasil, a combinação da Terminologia e Informática é incipiente, e dentre as atividades do trabalho terminológico é comum a utilização de várias ferramentas não especializados para esse fim. Isso torna o trabalho dos terminólogos muito moroso, pois esse trabalho geralmente é feito por uma equipe multidisciplinar que deve ter acesso, a todo o momento, à versão mais atual das várias etapas da geração de um produto terminológico. Além disso, deixa o gerenciamento dos dados mais complicado, pois não existe um padrão de entrada e saída definido para os programas. Apoiado nos pressupostos da Teoria Comunicativa da Terminologia (TCT), este trabalho apresenta a proposta de desenvolvimento e avaliação do e- Termos, um Ambiente ColaborativoWeb composto por seis módulos de trabalho bem definidos, cujo propósito é automatizar as tarefas de gestão e criação de produtos terminológicos. Cada módulo do e-Termos possui a responsabilidade de abrigar tarefas inerentes ao processo de criação das terminologias, sendo atreladas a eles diferentes ferramentas de apoio lingüístico, que possuem a função de dar suporte às atividades de Processamento de Língua Natural envolvidas nesse processo. Além delas, há também ferramentas colaborativas, designadas para dar apoio às necessidades comunicacionais e de interação da equipe de trabalho. Particularmente com relação ao processo de avaliação proposto, uma de suas características é a capacidade de ser executado em um tempo curto, viabilizando a avaliação controlada de vários grupos, mas executada no ambiente de trabalho do público alvo. As principais contribuições desta pesquisa são o aspecto colaborativo instanciado na prática terminológica, a criação flexível da Ficha Terminológica, a possibilidade didática de uso para o ensino de terminologia, lexicografia e tradução e o processo de avaliação para sistemas colaborativos desenvolvido para o e-Termos, que combina Cenários de Uso e um Questionário de Pesquisa. Utilizando tecnologias Web e da área de Computer Supported Collaborative Work (CSCW) para o desenvolvimento da sua arquitetura computacional colaborativa, o e-Termos apresenta-se como um ambiente inovador para a pesquisa terminolóogica assistida por computador, pois automatiza um método prático que exp~oe os postulados da terminologia de orientação descritiva e evidencia todas as etapas do processo de criação de produtos terminológicos com o inédito diferencial colaborativo. Para certificar este êxito, o e-Termos tem recebido um número crescente de novas propostas de projeto, tendo até Agosto de 2009 mais de 130 usuários cadastrados, alocados em 68 diferentes projetos terminológicos / In one of its definitions, Terminology represents the set of principles and methods adopted in the creation and management of terminological products as glossaries and dictionaries of terms. A systematization of these methods includes the application of specific computational tools, compatible with terminological tasks, which contribute to developing such products and disseminating expert knowledge. However, especially in Brazil, the combination of Terminology and Computer Science is still incipient, and to perform the tasks of a terminological work it is typical to employ several nonspecialized tools, which make terminologists\' work very time-consuming, since it is usually carried out by a multidisciplinary team that should have access, all the time, to the latest versions of the various stages of the generation of a terminological product. Moreover, it makes data management more complex, because there is no input/output standard defined for programs. Based on the presuppositions of the Communicative Theory of Terminology (CTT), this thesis proposes the development and evaluation of e-Termos, a Web Collaborative Environment composed of six well-defined working modules, whose purpose is to automatize tasks for creating and managing terminological products. Each module in e- Termos is responsible for tasks inherent to the process of creating terminologies. Linked to these modules, there are different linguistic support tools that assist the Natural Language Processing activities included in the process. Besides them, there are also collaborative tools for supporting the communication and interaction needs of team members. As far as the proposed evaluation process is concerned, one of its features is that it can be run in a short time, making viable a controlled evaluation of several groups that is, however, run in the work environment of the target audience. The main contributions of this research are the collaborative aspect instantiated in terminological practice, the exible creation of Terminological Records, the possibility of being used for teaching terminology, lexicography and translation, and the evaluation of collaborative systems developed for e-Termos, which combines Scenario-based Evaluations and Surveys. Using Web technologies and Computer Supported Collaborative Work (CSCW) to develop its collaborative computational architecture, e-Termos is an innovative environment for computer-assisted terminological research, since it automatizes a useful method that represents the postulates of descriptive terminology and highlights all stages of the process of creating terminological products with the unprecedented collaborative differential. Confirming its success, e-Termos has been receiving a growing number of new project proposals, and in August 2009 has more than 130 registered users in 68 different terminological projects
|
Page generated in 0.099 seconds