Global ETD Search

21	Uso de informação linguística e análise de conceitos formais no aprendizado de ontologias / Use of linguistic information and formal concept analysis for ontology learning. Torres, Carlos Eduardo Atencio 08 October 2012 (has links) Na atualidade, o interesse pelo uso de ontologias tem sido incrementado. No entanto, o processo de construção pode ser custoso em termos de tempo. Para uma ontologia ser construída, precisa-se de um especialista com conhecimentos de um editor de ontologias. Com a finalidade de reduzir tal processo de construção pelo especialista, analisamos e propomos um método para realizar aprendizado de ontologias (AO) de forma supervisionada. O presente trabalho consiste em uma abordagem combinada de diferentes técnicas no AO. Primeiro, usamos uma técnica estatística chamada C/NC-values, acompanhada da ferramenta Cogroo, para extrair os termos mais representativos do texto. Esses termos são considerados por sua vez como conceitos. Projetamos também uma gramática de restrições (GR), com base na informação linguística do Português, com o objetivo de reconhecer e estabelecer relações entre conceitos. Para poder enriquecer a informação na ontologia, usamos a análise de conceitos formais (ACF) com o objetivo de identificar possíveis superconceitos entre dois conceitos. Finalmente, extraímos ontologias para os textos de três temas, submetendo-as à avaliação dos especialistas na área. Um web site foi feito para tornar o processo de avaliação mais amigável para os avaliadores e usamos o questionário de marcos de características proposto pelo método OntoMetrics. Os resultados mostram que nosso método provê um ponto de partida aceitável para a construção de ontologias. / Nowadays, the interest in the use of ontologies has increased, nevertheless, the process of ontology construction can be very time consuming. To build an ontology, we need a domain expert with knowledge in an ontology editor. In order to reduce the time needed by the expert, we propose and analyse a supervised ontology learning (OL) method. The present work consists of a combined approach of different techniques in OL. First, we use a statistic technique called C/NC-values, with the help of the Cogroo tool, to extract the most significant terms. These terms are considered as concepts consequently. We also design a constraint grammar (CG) based in linguistic information of Portuguese to recognize relations between concepts. To enrich the ontology information, we use the formal concept analysis (FCA) in order to discover a parent for a set of concepts. In order to evaluate the method, we have extracted ontologies from text on three different domains and tested them with corresponding experts. A web site was built to make the evaluation process friendlier for the experts and we used an evaluation framework proposed in the OntoMetrics method. The results show that our method provides an acceptable starting point for the construction of ontologies. Análise de Conceitos Formais Análise Sintática Aprendizado de Ontologias Avaliação de Ontologias Constraint Grammar Descoberta de Relações Extração de Termos Formal Concept Analysis Gramática de Restrições Ontology Evaluation Ontology Learning Relation Discovery Syntactic Analysis Term Extraction
22	Méthode d’extraction d’informations géographiques à des fins d’enrichissement d’une ontologie de domaine / Geographical information extraction method in order to enrich a domain ontology Nguyen, Van Tien 15 November 2012 (has links) Notre thèse se situe dans le contexte du projet ANR GEONTO qui porte sur la constitution, l’alignement, la comparaison et l’exploitation d’ontologies géographiques hétérogènes. Dans ce contexte, notre objectif est d'extraire automatiquement des termes topographiques à partir des récits de voyage afin d'enrichir une ontologie géographique initialement conçue par l'IGN. La méthode proposée permet de repérer et d'extraire des termes à connotation topographiques contenus dans un texte. Notre méthode est basée sur le repérage automatique de certaines relations linguistiques afin d'annoter ces termes. Sa mise en œuvre s'appuie sur le principe des relations n-aires et passe par l'utilisation de méthodes ou de techniques de TAL (Traitement Automatique de la Langue). Il s'agit de relations n-aires entre les termes à extraire et d'autres éléments du textes qui peuvent être repérés à l'aide de ressources externes prédéfinies, telles que des lexiques spécifiques: les verbes de récit de voyage (verbes de déplacement, verbes de perceptions, et verbes topographiques), les pré-positions (prépositions de lieu, adverbes, adjectifs), les noms toponymiques, des thésaurus génériques, des ontologies de domaine (ici l'ontologie géographique initialement conçue par l'IGN). Une fois marquées par des patrons linguistiques, les relations proposées nous permettent d'annoter et d'extraire automatiquement des termes dont les différents indices permettent de déduire qu'ils évoquent des concepts topographiques. Les règles de raisonnement qui permettent ces déductions s'appuient sur des connaissances intrinsèques (évocation du spatial dans la langue) et des connaissances externes contenues dans les ressources ci-dessus évoquées, ou leur combinaison. Le point fort de notre approche est que la méthode proposée permet d'extraire non seulement des termes rattachés directement aux noms toponymiques mais également dans des structures de phrase où d'autres termes s'intercalent. L'expérimentation sur un corpus comportant 12 récits de voyage (2419 pages, fournit par la médiathèque de Pau) a montré que notre méthode est robuste. En résultat, elle a permis d'extraire 2173 termes distincts dont 1191 termes valides, soit une précision de 0,55. Cela démontre que l'utilisation des relations proposées est plus efficace que celle des couples (termes, nom toponymique)(qui donne 733 termes distincts valides avec une précision de 0,38). Notre méthode peut également être utilisée pour d'autres applications telles que la reconnaissance des entités nommées géographiques, l'indexation spatiale des documents textuels. / This thesis is in the context of the ANR project GEONTO covering the constitution, alignment, comparison and exploitation of heterogeneous geographic ontologies. The goal is to automatically extract terms from topographic travelogues to enrich a geographical ontology originally designed by IGN. The proposed method allows identification and extraction of terms contained in a text with a topographical connotation. Our method is based on a model that relies on certain grammatical relations to locate these terms. The implementation of this model requires the use of methods or techniques of NLP (Processing of Language). Our model represents the relationships between terms to extract and other elements of the texts that can be identified by using external predefined resources, such as specific lexicons: verbs of travelogue (verbs of displacement, verbs of perceptions, topographical verbs), pre-positions (prepositions of place, adverbs, adjectives), place name, generic thesauri, ontologies of domain (in our case the geographical ontology originally designed by IGN). Once marked by linguistic patterns, the proposed relationships allow us to annotate and automatically retrieve terms. Then various indices help deduce whether the extracted terms evoke topographical concepts. It is through reasoning rules that deductions are made. These rules are based on intrinsic knowledge (evocation of space in the language) and external knowledge contained in external resources mentioned above, or their combination. The advantage of our approach is that the method can extract not only the terms related directly to place name but also those embedded in sentence structure in which other terms coexisted. Experiments on a corpus consisting of 12 travel stories (2419 pages, provided by the library of Pau) showed that our method is robust. As a result, it was used to extract 2173 distinct terms with 1191 valid terms, with a precision of 0.55. This demonstrates that the use of the proposed relationships is more effective than that of couples (term, place name) (which gives 733 distinct terms valid with an accuracy of 0.38). Our method can also be used for other applications such as geographic named entity recognition, spatial indexing of textual documents. Extractions terminologiques Enrichissement d'ontologie de domaine Expressions spatiales dans le texte Relations n-aires Patrons syntaxico-sémantiques Entités nommées Métriques de chaînes. Term extraction Geographical ontology enrichment Spatial expressions in the text N-ary relation, Syntatic-semantic patterns, Named entities String metrics.
23	La terminologie bilingue (Arabe-Français) de la surdité : analyse du discours textuelle et socioterminologique / The bilingual terminology (Arabic-French) of deafness discours analysis : textual and socioterminological Tajo, Kinda 18 December 2013 (has links) Le texte spécialisé dans le domaine de la surdité est un phénomène complexe où les termes ont une fonction sémantique très importante. Le discours actualise le sens des termes et donne suite à de nouvelles significations dynamiques. Le corpus bilingue (français, arabe) est représentatif de différents types de discours et de niveaux de spécialisation notamment lorsqu’il s’agit de comparer la terminologie de la surdité entre les pays arabes (Liban, Syrie et Jordanie). Les termes, qui sont responsables de transmettre les connaissances relatives à une spécialité, constituent aujourd'hui un objet d'étude central pour la terminologie. Leur extraction s’effectue non seulement par la méthode manuelle mais aussi à travers les nouveaux logiciels d’extraction automatique. Cette thèse prend en considération les besoins linguistiques des usagers qui sont dorénavant les vrais consommateurs de terminologie. Elle a pour objectif de faire une approche socioterminologique et textuelle du domaine de la surdité en mettant la lumière sur les phénomènes étudiés comme la synonymie, la variation terminologique, la vulgarisation, la métaphore, la traduction et autres. Sa retombée étant la constitution d’une base de données terminologique trilingue qui répond aux exigences des spécialistes et des non-spécialistes. / The specialized text in the domain of deafness is a complex phenomenon where terms have important semantic functions. The discourse updates the meaning of terms and brings up new dynamic significations. The bilingual corpus (French, Arabic) is representative of different types of discourse and levels of specialization especially when it comes to comparing the terminology of deafness in the three Arab countries (Lebanon, Syria, Jordan). Terms in charge of transmitting knowledge of special fields represent nowadays a central object of study for terminology. The extraction of terms can be made manually but also by means of new automatic term extraction software. Our doctoral thesis takes into consideration the linguistic needs of language users that are considered from now on the real consumers of terminology. This thesis is intended for socioterminological and textual approaches of the domain of deafness. It highlights the studied phenomena such as synonymy, terminology variation, scientific popularization, metaphor, translation and many other phenomena. The result of the thesis research being the construction of a trilingual terminological data base, it meets the requirements of specialists and non-specialists. Terminologie textuelle Socioterminologie Terminologie culturelle Vulgarisation Variante terminologique Corpus comparables Usager Contexte Extraction terminologique Textual terminology Socioterminology Cultural terminology Scientific popularization Terminology variation Comparable corpus Users Context Term extraction
24	Automated Patent Categorization and Guided Patent Search using IPC as Inspired by MeSH and PubMed Eisinger, Daniel 08 September 2014 (has links) (PDF) The patent domain is a very important source of scientific information that is currently not used to its full potential. Searching for relevant patents is a complex task because the number of existing patents is very high and grows quickly, patent text is extremely complicated, and standard vocabulary is not used consistently or doesn’t even exist. As a consequence, pure keyword searches often fail to return satisfying results in the patent domain. Major companies employ patent professionals who are able to search patents effectively, but even they have to invest a lot of time and effort into their search. Academic scientists on the other hand do not have access to such resources and therefore often do not search patents at all, but they risk missing up-to-date information that will not be published in scientific publications until much later, if it is published at all. Document search on PubMed, the pre-eminent database for biomedical literature, relies on the annotation of its documents with relevant terms from the Medical Subject Headings ontology (MeSH) for improving recall through query expansion. Similarly, professional patent searches expand beyond keywords by including class codes from various patent classification systems. However, classification-based searches can only be performed effectively if the user has very detailed knowledge of the system, which is usually not the case for academic scientists. Consequently, we investigated methods to automatically identify relevant classes that can then be suggested to the user to expand their query. Since every patent is assigned at least one class code, it should be possible for these assignments to be used in a similar way as the MeSH annotations in PubMed. In order to develop a system for this task, it is necessary to have a good understanding of the properties of both classification systems. In order to gain such knowledge, we perform an in-depth comparative analysis of MeSH and the main patent classification system, the International Patent Classification (IPC). We investigate the hierarchical structures as well as the properties of the terms/classes respectively, and we compare the assignment of IPC codes to patents with the annotation of PubMed documents with MeSH terms. Our analysis shows that the hierarchies are structurally similar, but terms and annotations differ significantly. The most important differences concern the considerably higher complexity of the IPC class definitions compared to MeSH terms and the far lower number of class assignments to the average patent compared to the number of MeSH terms assigned to PubMed documents. As a result of these differences, problems are caused both for unexperienced patent searchers and professionals. On the one hand, the complex term system makes it very difficult for members of the former group to find any IPC classes that are relevant for their search task. On the other hand, the low number of IPC classes per patent points to incomplete class assignments by the patent office, therefore limiting the recall of the classification-based searches that are frequently performed by the latter group. We approach these problems from two directions: First, by automatically assigning additional patent classes to make up for the missing assignments, and second, by automatically retrieving relevant keywords and classes that are proposed to the user so they can expand their initial search. For the automated assignment of additional patent classes, we adapt an approach to the patent domain that was successfully used for the assignment of MeSH terms to PubMed abstracts. Each document is assigned a set of IPC classes by a large set of binary Maximum-Entropy classifiers. Our evaluation shows good performance by individual classifiers (precision/recall between 0:84 and 0:90), making the retrieval of additional relevant documents for specific IPC classes feasible. The assignment of additional classes to specific documents is more problematic, since the precision of our classifiers is not high enough to avoid false positives. However, we propose filtering methods that can help solve this problem. For the guided patent search, we demonstrate various methods to expand a user’s initial query. Our methods use both keywords and class codes that the user enters to retrieve additional relevant keywords and classes that are then suggested to the user. These additional query components are extracted from different sources such as patent text, IPC definitions, external vocabularies and co-occurrence data. The suggested expansions can help unexperienced users refine their queries with relevant IPC classes, and professionals can compose their complete query faster and more easily. We also present GoPatents, a patent retrieval prototype that incorporates some of our proposals and makes faceted browsing of a patent corpus possible. Patent Kategorisierung Suche Patentsuche Termextraktion Patentklassifikation International Patent Classification IPC Medical Subject Headings MeSH PubMed patent categorization search patent search term extraction patent classification International Patent Classification IPC Medical Subject Headings MeSH PubMed ddc:004 rvk:ST 640
25	Uso de informação linguística e análise de conceitos formais no aprendizado de ontologias / Use of linguistic information and formal concept analysis for ontology learning. Carlos Eduardo Atencio Torres 08 October 2012 (has links) Na atualidade, o interesse pelo uso de ontologias tem sido incrementado. No entanto, o processo de construção pode ser custoso em termos de tempo. Para uma ontologia ser construída, precisa-se de um especialista com conhecimentos de um editor de ontologias. Com a finalidade de reduzir tal processo de construção pelo especialista, analisamos e propomos um método para realizar aprendizado de ontologias (AO) de forma supervisionada. O presente trabalho consiste em uma abordagem combinada de diferentes técnicas no AO. Primeiro, usamos uma técnica estatística chamada C/NC-values, acompanhada da ferramenta Cogroo, para extrair os termos mais representativos do texto. Esses termos são considerados por sua vez como conceitos. Projetamos também uma gramática de restrições (GR), com base na informação linguística do Português, com o objetivo de reconhecer e estabelecer relações entre conceitos. Para poder enriquecer a informação na ontologia, usamos a análise de conceitos formais (ACF) com o objetivo de identificar possíveis superconceitos entre dois conceitos. Finalmente, extraímos ontologias para os textos de três temas, submetendo-as à avaliação dos especialistas na área. Um web site foi feito para tornar o processo de avaliação mais amigável para os avaliadores e usamos o questionário de marcos de características proposto pelo método OntoMetrics. Os resultados mostram que nosso método provê um ponto de partida aceitável para a construção de ontologias. / Nowadays, the interest in the use of ontologies has increased, nevertheless, the process of ontology construction can be very time consuming. To build an ontology, we need a domain expert with knowledge in an ontology editor. In order to reduce the time needed by the expert, we propose and analyse a supervised ontology learning (OL) method. The present work consists of a combined approach of different techniques in OL. First, we use a statistic technique called C/NC-values, with the help of the Cogroo tool, to extract the most significant terms. These terms are considered as concepts consequently. We also design a constraint grammar (CG) based in linguistic information of Portuguese to recognize relations between concepts. To enrich the ontology information, we use the formal concept analysis (FCA) in order to discover a parent for a set of concepts. In order to evaluate the method, we have extracted ontologies from text on three different domains and tested them with corresponding experts. A web site was built to make the evaluation process friendlier for the experts and we used an evaluation framework proposed in the OntoMetrics method. The results show that our method provides an acceptable starting point for the construction of ontologies. Análise de Conceitos Formais Análise Sintática Aprendizado de Ontologias Avaliação de Ontologias Descoberta de Relações Extração de Termos Gramática de Restrições Constraint Grammar Formal Concept Analysis Ontology Evaluation Ontology Learning Relation Discovery Syntactic Analysis Term Extraction
26	Analyse comparative de la terminologie des médias sociaux : contribution des domaines de la communication et de l'informatique à la néologie Charlebois, Julien-Claude 08 1900 (has links) L’objectif de cette étude est de repérer des néologismes à partir de corpus de textes français au moyen d’une méthode semi-automatique. Plus précisément, nous extrayons les néologismes de corpus associés à deux domaines différents, mais traitant du même thème, nous examinons leur répartition et nous les classons selon leur type. L’étude s’appuie sur l’analyse de corpus traitant des médias sociaux. Le premier aborde les médias sociaux du point de vue de la communication, l’autre le fait du point de vue de l’informatique. Ces points de vue ont été privilégiés, car la communication considère ce qui a trait l’utilisation des médias sociaux et l’informatique aborde leur cartographie. La méthode fait appel à l’extracteur de termes TermoStat pour recenser la terminologie des médias sociaux pour chaque point de vue. Ensuite, nous soumettons les 150 termes les plus spécifiques de chaque point de vue à une méthode de validation divisée en trois tests destinés à valider leur statut néologique : des dictionnaires spécialisés, des dictionnaires de langue générale et un outil de visualisation de n-grammes. Finalement, nous étiquetons les néologismes selon la typologie de Dubuc (2002). L’analyse des résultats de la communication et de l’informatique est comparative. La comparaison des deux corpus révèle les contributions respectives de la communication et de l'informatique à la terminologie des médias sociaux en plus de montrer les termes communs aux deux disciplines. L’étude a également permis de repérer 60 néologismes, dont 28 sont exclusifs au corpus de la communication, 28 exclusifs à celui de l’informatique et 4 communs aux deux corpus. La recherche révèle également que les composés par subordination sont les types de néologismes les plus présents dans nos résultats. / The objective of this study is to identify the neologisms within corpora of French texts by means of a semi-automatic method. More precisely, we will extract the neologisms from corpora associated to two different areas; however dealing with the same topic, we examine their distribution and we classify them according to their type. This study is based on an analysis of two corpora within social media. The first one approaches social media from the point of view of communication, and the other approaches it from the point of view of computer science. We prioritize these two points of view being that communication is used as the main source of social media’s utilization and that computer science allows us to understand what is involved to allow for social media to be functional. For this method, we use the TermoStat term extractor in order to take census of terminology for each point of view. We then submit 150 of the most specific terms related to each point of view by way of an exclusion corpus from which we divide into three different tests meant to validate their neological status: specialized dictionaries, general language dictionaries, and a visualization tool for n-grams. Lastly, we label the neologisms according to Dubuc’s (2002) typology. The analysis of the results obtained for communication and computer science uses a comparative method. The comparison of the two corpora reveals the respective contributions from communication and computer science with respect to the terminology of social medias, as well it demonstrates common terms found within the two disciplines. This examination also allowed for the identification of 60 neologisms; of which 28 are exclusive to the corpus of communication, another 28 are exclusive to that of computer science, and four were found to be common to both corpora. This research also reveals that subordinate compounds are the most present types of neologisms according to our results. terminologie néologie médias sociaux communication informatique extraction semi-automatique de termes corpus d'exclusion terminology neology social medias computer science semi-automatic term extraction exclusion corpus
27	Automated Patent Categorization and Guided Patent Search using IPC as Inspired by MeSH and PubMed Eisinger, Daniel 07 October 2013 (has links) The patent domain is a very important source of scientific information that is currently not used to its full potential. Searching for relevant patents is a complex task because the number of existing patents is very high and grows quickly, patent text is extremely complicated, and standard vocabulary is not used consistently or doesn’t even exist. As a consequence, pure keyword searches often fail to return satisfying results in the patent domain. Major companies employ patent professionals who are able to search patents effectively, but even they have to invest a lot of time and effort into their search. Academic scientists on the other hand do not have access to such resources and therefore often do not search patents at all, but they risk missing up-to-date information that will not be published in scientific publications until much later, if it is published at all. Document search on PubMed, the pre-eminent database for biomedical literature, relies on the annotation of its documents with relevant terms from the Medical Subject Headings ontology (MeSH) for improving recall through query expansion. Similarly, professional patent searches expand beyond keywords by including class codes from various patent classification systems. However, classification-based searches can only be performed effectively if the user has very detailed knowledge of the system, which is usually not the case for academic scientists. Consequently, we investigated methods to automatically identify relevant classes that can then be suggested to the user to expand their query. Since every patent is assigned at least one class code, it should be possible for these assignments to be used in a similar way as the MeSH annotations in PubMed. In order to develop a system for this task, it is necessary to have a good understanding of the properties of both classification systems. In order to gain such knowledge, we perform an in-depth comparative analysis of MeSH and the main patent classification system, the International Patent Classification (IPC). We investigate the hierarchical structures as well as the properties of the terms/classes respectively, and we compare the assignment of IPC codes to patents with the annotation of PubMed documents with MeSH terms. Our analysis shows that the hierarchies are structurally similar, but terms and annotations differ significantly. The most important differences concern the considerably higher complexity of the IPC class definitions compared to MeSH terms and the far lower number of class assignments to the average patent compared to the number of MeSH terms assigned to PubMed documents. As a result of these differences, problems are caused both for unexperienced patent searchers and professionals. On the one hand, the complex term system makes it very difficult for members of the former group to find any IPC classes that are relevant for their search task. On the other hand, the low number of IPC classes per patent points to incomplete class assignments by the patent office, therefore limiting the recall of the classification-based searches that are frequently performed by the latter group. We approach these problems from two directions: First, by automatically assigning additional patent classes to make up for the missing assignments, and second, by automatically retrieving relevant keywords and classes that are proposed to the user so they can expand their initial search. For the automated assignment of additional patent classes, we adapt an approach to the patent domain that was successfully used for the assignment of MeSH terms to PubMed abstracts. Each document is assigned a set of IPC classes by a large set of binary Maximum-Entropy classifiers. Our evaluation shows good performance by individual classifiers (precision/recall between 0:84 and 0:90), making the retrieval of additional relevant documents for specific IPC classes feasible. The assignment of additional classes to specific documents is more problematic, since the precision of our classifiers is not high enough to avoid false positives. However, we propose filtering methods that can help solve this problem. For the guided patent search, we demonstrate various methods to expand a user’s initial query. Our methods use both keywords and class codes that the user enters to retrieve additional relevant keywords and classes that are then suggested to the user. These additional query components are extracted from different sources such as patent text, IPC definitions, external vocabularies and co-occurrence data. The suggested expansions can help unexperienced users refine their queries with relevant IPC classes, and professionals can compose their complete query faster and more easily. We also present GoPatents, a patent retrieval prototype that incorporates some of our proposals and makes faceted browsing of a patent corpus possible. info:eu-repo/classification/ddc/004 ddc:004
28	Uma an?lise comparativa entre as abordagens lingu?stica e estat?stica para extra??o autom?tica de termos relevantes de corpora Santos, Carlos Alberto dos 27 April 2018 (has links) Submitted by PPG Ci?ncia da Computa??o (ppgcc@pucrs.br) on 2018-07-26T19:48:07Z No. of bitstreams: 1 CARLOS ALBERTO DOS SANTOS_DIS.pdf: 1271475 bytes, checksum: 856ae87ad633d3c772b413816caa43d1 (MD5) / Approved for entry into archive by Sheila Dias (sheila.dias@pucrs.br) on 2018-08-01T13:39:36Z (GMT) No. of bitstreams: 1 CARLOS ALBERTO DOS SANTOS_DIS.pdf: 1271475 bytes, checksum: 856ae87ad633d3c772b413816caa43d1 (MD5) / Made available in DSpace on 2018-08-01T14:31:21Z (GMT). No. of bitstreams: 1 CARLOS ALBERTO DOS SANTOS_DIS.pdf: 1271475 bytes, checksum: 856ae87ad633d3c772b413816caa43d1 (MD5) Previous issue date: 2018-04-27 / It is known that linguistic processing of corpora demands high computational effort because of the complexity of its algorithms, but despite this, the results reached are better than that generated by the statistical processing, where the computational demand is lower. This dissertation describes a comparative analysis between the process linguistic and statistical of term extraction. Experiments were carried out through four corpora in English idiom, built from scientific papers, on which terms extractions were carried out using the approaches. The resulting terms lists were refined with use of relevance metrics and stop list, and then compared with the reference lists of the corpora across the recall technical. These lists, in its turn, were built from the context these corpora, whith help of Internet searches. The results shown that the statistical extraction combined with the stop list and relevance metrics can produce superior results to linguistic process extraction using the same metrics. It?s concluded that statistical approach composed by these metrics can be ideal option to relevance terms extraction, by requiring few computational resources and by to show superior results that found in the linguistic processing. / Sabe-se que o processamento lingu?stico de corpora demanda grande esfor?o computacional devido ? complexidade dos seus algoritmos, mas que, apesar disso, os resultados alcan?ados s?o melhores que aqueles gerados pelo processamento estat?stico, onde a demanda computacional ? menor. Esta disserta??o descreve uma an?lise comparativa entre os processos lingu?stico e estat?stico de extra??o de termos. Foram realizados experimentos atrav?s de quatro corpora em l?ngua inglesa, constru?dos a partir de artigos cient?ficos, sobre os quais foram executadas extra??es de termos utilizando essas abordagens. As listas de termos resultantes foram refinadas com o uso de m?tricas de relev?ncia e stop list, e em seguida comparadas com as listas de refer?ncia dos corpora atrav?s da t?cnica do recall. Essas listas, por sua vez, foram constru?das a partir do contexto desses corpora e com ajuda de pesquisas na Internet. Os resultados mostraram que a extra??o estat?stica combinada com as t?cnicas da stop list e as m?tricas de relev?ncia pode produzir resultados superiores ao processo de extra??o lingu?stico refinado pelas mesmas m?tricas. Concluiu se que a abordagem estat?stica composta por essas t?cnicas pode ser a op??o ideal para extra??o de termos relevantes, por exigir poucos recursos computacionais e por apresentar resultados superiores ?queles encontrados no processamento lingu?stico. Eextra??o de Termos Minera??o de Texto Lista de Refer?ncia M?tricas Estat?sticas Extra??o Lingu?stica Extra??o Estat?stica Stop List Term Extraction Text Mining Reference List Stop List Statistical Metrics Linguistic Extraction Statistical Extraction
29	[en] ESPÍRITO DE CORPUS: CREATION OF A MARINE CORPS BILINGUAL LEXICON / [pt] ESPÍRITO DE CORPUS: CRIAÇÃO DE UM LÉXICO BILÍNGUE DO CORPO DE FUZILEIROS NAVAIS MARIANA LEMOS MULLER 07 June 2022 (has links) [pt] Este estudo apresenta uma pesquisa temática envolvendo Terminologia, Estudos de Tradução Baseados em Corpus, Terminologia Computacional e Semântica Lexical, e tem como objeto de estudo a área do Corpo de Fuzileiros Navais. O objetivo desta pesquisa foi de criar um material terminológico por meio de uma metodologia híbrida de extração de termos desenvolvida a partir de testes com ferramentas de Extração Automática de Termos (EAT). Assim, buscou-se solucionar tanto problemas tradutórios relacionados à subárea de estudo quanto à detecção e validação de candidatos a termos em um corpus. Primeiramente, foi realizado um estudo piloto com o objetivo de avaliar as ferramentas TermoStat Web 3.0 e AntConc 3.5.7. Após os testes por meio da análise de um corpus paralelo bilíngue, foram selecionadas as melhores condições identificadas para se obter uma metodologia eficaz de extração automática de termos aliada à análise humana. Em seguida, essa metodologia foi utilizada para a análise de um corpus bilíngue comparável. Os candidatos a termos extraídos foram então validados pelos critérios de Semântica Lexical propostos por L Homme (2020) e, em seguida, foram detectados seus equivalentes terminológicos. Este estudo resultou na criação do léxico bilíngue Espírito de Corpus. / [en] This study presents a thematic research in the Marine Corps area involving Terminology, Corpus-Based Translation Studies, Computational Terminology and Lexical Semantics. The objective of this research was to create a terminological material through a hybrid methodology of term extraction developed from tests with Automatic Term Extraction (ATE) tools. Thus, we sought to solve both translation problems related to the subarea of study and to the detection and validation of term candidates in a corpus. First, a pilot study was conducted aiming to analyze two tools – TermoStat Web 3.0 and AntConc 3.5.7. After the conduction of the tests through the analysis of a bilingual parallel corpus, the best conditions identified were selected to obtain an effective methodology of automatic extraction of terms allied to human analysis. Then, this methodology was used for the analysis of a comparable bilingual corpus. The term candidates automatically extracted were then validated by the Lexical Semantics criteria proposed by L Homme (2020) and their translation equivalents were detected. This study resulted in the creation of the bilingual lexicon Espírito de Corpus. [pt] LINGUA PORTUGUESA [pt] USMC [pt] CFN [pt] TERMOSTAT [pt] ANTCONC [pt] EXTRACAO AUTOMATICA DE TERMOS [pt] TERMINOLOGIA COMPUTACIONAL [pt] CORPO DE FUZILEIROS NAVAIS [pt] LINGUA INGLESA [pt] LEXICO [pt] TERMINOLOGIA [pt] SEMANTICA LEXICAL [en] PORTUGUESE LANGUAGE [en] UNITED STATES MARINE CORPS [en] CFN [en] TERMOSTAT [en] ANTCONC [en] AUTOMATIC TERM EXTRACTION [en] COMPUTATIONAL LINGUISTICS [en] MARINE CORPS [en] CORPUS-BASED TRANSLATION STUDIES [en] ENGLISH LANGUAGE [en] LEXICON [en] TERMINOLOGY [en] LEXICAL SEMANTICS

Search results