Global ETD Search

1	Recycling Translations : Extraction of Lexical Data from Parallel Corpora and their Application in Natural Language Processing Tiedemann, Jörg January 2003 (has links) <p>The focus of this thesis is on re-using translations in natural language processing. It involves the collection of documents and their translations in an appropriate format, the automatic extraction of translation data, and the application of the extracted data to different tasks in natural language processing.</p><p>Five parallel corpora containing more than 35 million words in 60 languages have been collected within co-operative projects. All corpora are sentence aligned and parts of them have been analyzed automatically and annotated with linguistic markup.</p><p>Lexical data are extracted from the corpora by means of word alignment. Two automatic word alignment systems have been developed, the Uppsala Word Aligner (UWA) and the Clue Aligner. UWA implements an iterative "knowledge-poor" word alignment approach using association measures and alignment heuristics. The Clue Aligner provides an innovative framework for the combination of statistical and linguistic resources in aligning single words and multi-word units. Both aligners have been applied to several corpora. Detailed evaluations of the alignment results have been carried out for three of them using fine-grained evaluation techniques.</p><p>A corpus processing toolbox, Uplug, has been developed. It includes the implementation of UWA and is freely available for research purposes. A new version, Uplug II, includes the Clue Aligner. It can be used via an experimental web interface (UplugWeb).</p><p>Lexical data extracted by the word aligners have been applied to different tasks in computational lexicography and machine translation. The use of word alignment in monolingual lexicography has been investigated in two studies. In a third study, the feasibility of using the extracted data in interactive machine translation has been demonstrated. Finally, extracted lexical data have been used for enhancing the lexical components of two machine translation systems.</p> Computational linguistics word alignment parallel corpora translation corpora computational lexicography machine translation Datorlingvistik Computational linguistics Datorlingvistik
2	Recycling Translations : Extraction of Lexical Data from Parallel Corpora and their Application in Natural Language Processing Tiedemann, Jörg January 2003 (has links) The focus of this thesis is on re-using translations in natural language processing. It involves the collection of documents and their translations in an appropriate format, the automatic extraction of translation data, and the application of the extracted data to different tasks in natural language processing. Five parallel corpora containing more than 35 million words in 60 languages have been collected within co-operative projects. All corpora are sentence aligned and parts of them have been analyzed automatically and annotated with linguistic markup. Lexical data are extracted from the corpora by means of word alignment. Two automatic word alignment systems have been developed, the Uppsala Word Aligner (UWA) and the Clue Aligner. UWA implements an iterative "knowledge-poor" word alignment approach using association measures and alignment heuristics. The Clue Aligner provides an innovative framework for the combination of statistical and linguistic resources in aligning single words and multi-word units. Both aligners have been applied to several corpora. Detailed evaluations of the alignment results have been carried out for three of them using fine-grained evaluation techniques. A corpus processing toolbox, Uplug, has been developed. It includes the implementation of UWA and is freely available for research purposes. A new version, Uplug II, includes the Clue Aligner. It can be used via an experimental web interface (UplugWeb). Lexical data extracted by the word aligners have been applied to different tasks in computational lexicography and machine translation. The use of word alignment in monolingual lexicography has been investigated in two studies. In a third study, the feasibility of using the extracted data in interactive machine translation has been demonstrated. Finally, extracted lexical data have been used for enhancing the lexical components of two machine translation systems. Computational linguistics word alignment parallel corpora translation corpora computational lexicography machine translation Datorlingvistik Computational linguistics Datorlingvistik
3	Création automatique d'un dictionnaire des régimes des verbes du français Hassert, Naïma 06 1900 (has links) Les dictionnaires de valence sont utiles dans plusieurs tâches en traitement automatique des langues. Or, les dictionnaires de qualité de ce type sont créés au moins en partie manuellement; ils nécessitent donc beaucoup de ressources et sont difficiles à mettre à jour. De plus, plusieurs de ces ressources ne prennent pas en compte les différents sens des lemmes, qui sont pourtant importants puisque les arguments sélectionnés ont tendance à varier selon le sens du verbe. Dans ce mémoire, nous créons automatiquement un dictionnaire de valence des verbes du français qui tient compte de la polysémie. Nous extrayons 20 000 exemples de phrases pour chacun des 2 000 verbes les plus fréquents du franc¸ais. Nous obtenons ensuite les plongements lexicaux de ces verbes en contexte à l’aide d’un modèle de langue monolingue et de deux modèles de langue multilingues. Puis, nous utilisons des algorithmes de regroupement pour induire les différents sens de ces verbes. Enfin, nous analysons automatiquement les phrases à l’aide de différents analyseurs syntaxiques afin de trouver leurs arguments. Nous déterminons que la combinaison du modèle de langue français CamemBERT et d’un algorithme de regroupement agglomératif offre les meilleurs résultats dans la tâche d’induction de sens (58,19% de F1 B3), et que pour l’analyse syntaxique, Stanza est l’outil qui a les meilleures performances (83,29% de F1). En filtrant les cadres syntaxiques obtenus à l’aide d’une estimation de la vraisemblance maximale, une méthode statistique très simple qui permet de trouver les paramètres les plus vraisemblables d’un modèle de probabilité qui explique nos données, nous construisons un dictionnaire de valence qui se passe presque complètement d’intervention humaine. Notre procédé est ici utilisé pour le français, mais peut être utilisé pour n’importe quelle autre langue pour laquelle il existe suffisamment de données écrites. / Valency dictionaries are useful for many tasks in automatic language processing. However, quality dictionaries of this type are created at least in part manually; they are therefore resource-intensive and difficult to update. In addition, many of these resources do not take into account the different meanings of lemmas, which are important because the arguments selected tend to vary according to the meaning of the verb. In this thesis, we automatically create a French verb valency dictionary that takes polysemy into account. We extract 20 000 example sentences for each of the 2 000 most frequent French verbs. We then obtain the lexical embeddings of these verbs in context using a monolingual and two multilingual language models. Then, we use clustering algorithms to induce the different meanings of these verbs. Finally, we automatically parse the sentences using different parsers to find their arguments. We determine that the combination of the French language model CamemBERT and an agglomerative clustering algorithm offers the best results in the sense induction task (58.19% of F1 B3), and that for syntactic parsing, Stanza is the tool with the best performance (83.29% of F1). By filtering the syntactic frames obtained using maximum likelihood estimation, a very simple statistical method for finding the most likely parameters of a probability model that explains our data, we build a valency dictionary that almost completely dispenses with human intervention. Our procedure is used here for French, but can be used for any other language for which sufficient written data exists. induction de sens valence lexicographie computationnelle word sense induction valency computational lexicography Linguistics / Linguistique (UMI : 0290)
4	A Computational Study of Lexicalized Noun Phrases in English Godby, Carol Jean 02 July 2002 (has links) No description available. Language, Linguistics computational linguistics computational lexicography terminology terminology extraction lexicography corpus linguistics
5	Jogada de letra: um estudo sobre colocações à luz da semântica de frames Souza, Diego Spader de 30 March 2015 (has links) Submitted by Maicon Juliano Schmidt (maicons) on 2015-06-17T13:30:29Z No. of bitstreams: 1 Diego Spader de Souza.pdf: 2206624 bytes, checksum: 30715b12e44b6bedea8e7f523b159982 (MD5) / Made available in DSpace on 2015-06-17T13:30:29Z (GMT). No. of bitstreams: 1 Diego Spader de Souza.pdf: 2206624 bytes, checksum: 30715b12e44b6bedea8e7f523b159982 (MD5) Previous issue date: 2015-03-30 / CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / O objetivo da presente dissertação é discutir a relação existente entre o fenômeno linguístico das colocações e os conceitos da teoria da Semântica de Frames (FILLMORE, 1982; 1985). O trabalho se insere no contexto de dois projetos de pesquisa desenvolvidos pelo grupo SemanTec, o Field – Dicionário de Expressões do Futebol (CHISHMAN, 2014), já disponível para consulta na web, e o Dicionário Eletrônico Modalidades Olímpicas (CHISHMAN, 2014), ainda em fase inicial. Os dois dicionários citados se organizam a partir da noção de frame semântico proposta por Fillmore (1982; 1985), de forma que a dissertação busca evidenciar de que forma esse conceito (e os conceitos que o cercam) repercutem no tratamento lexicográfico dispensado às colocações. Nesse sentido, a revisão da literatura, apresentada nos capítulos 2 e 3, discute as bases teóricas para o estudo das colocações e da Semântica de Frames. O método da pesquisa consiste na análise de 74 colocações da linguagem do futebol. A escolha dessas estruturas parte do estudo de 500 combinações lexicais extraídas a partir de um corpus em português brasileiro do discurso do futebol através do software Sketch Engine. A análise das 74 colocações selecionadas acontece em duas fases: a primeira se dedica a averiguar os aspectos quantitativos do conjunto de dados e as características estruturais das colocações da linguagem do futebol; a segunda etapa foca na relação dessas combinações com os preceitos teóricos da Semântica de Frames e da sua contraparte computacional, a FrameNet, a fim de perceber de que modo esse arcabouço teórico oferece subsídios para o tratamento das colocações em contextos lexicográficos. Entre os principais resultados da primeira fase de análise, destaca-se o fato de que a maior parte das colocações do futebol designa estruturas verbais, como fazer gol e mandar bola, o que demonstra que a linguagem esportiva é marcada pela dinâmica das ações e dos eventos que ocorrem durante a partida. Além disso, foi possível perceber que as colocações nominais estão fortemente ligadas aos materiais, participantes e locais do contexto futebolístico. A segunda parte demonstrou que as colocações, no âmbito de dicionários baseados em frames, atuam como unidades lexicais, conceito proveniente da FrameNet. Ao serem consideradas unidades lexicais, as colocações são evocadoras de frame, o que as caracteriza como termos que devem estar presentes na lista principal de verbetes. Foi possível notar, contudo, que a evocação de frame a partir das colocações muitas vezes não segue o modelo tradicional presente na FrameNet, especialmente quando se trata das colocações nominais, que não evocam acontecimentos, mas entidades estáticas, como cartão vermelho e tabela de classificação. A presente dissertação evidencia a relevância da Semântica de Frames e da FrameNet para o estudo de unidades complexas como as colocações em contextos lexicográficos. Outro aspecto a ser mencionado é a importância dos recursos metodológicos da Linguística de Corpus para a área em que o estudo se insere. / The present thesis aims at the discussion of the relation that exists between the linguistic phenomenon of collocations and the concepts of Frame Semantics theory (FILLMORE, 1982; 1985). The study has arisen in the context of two research projects developed by the SemanTec group, Field – Football Expressions Dictionary (CHISHMAN, 2014), already available on the web, and Olympic Modalities Electronic Dictionary (CHISHMAN, 2014), still in early stage. Both dictionaries are organized around the notion of semantic frame proposed by Fillmore (1982; 1985), in such a way that the thesis seeks to demonstrate in which way this concept (and the concepts surrounding it) are related to the lexicographic treatment given to collocations. Thus, the literature review, presented in chapters 2 and 3, discusses the theoretical basis of the studies of collocations and Frame Semantics. The research method consists of the analysis of 74 collocations of football language. The choice of these structures was made after the study of 500 lexical combinations extracted from a Brazilian Portuguese corpus of football discourse through the Sketch Engine software. The analysis of the 74 collocations happens in two steps: the first one is dedicated to investigate the quantitative aspects of the data set and the structural characteristics of football language collocations; the second phase focuses on the relation between these combinations and the theoretical assumptions of Frame Semantics and its computational counterpart, FrameNet, in order to see in which way this theoretical outline treats collocations in lexicographic contexts. Among the main results of the first phase of analysis is the fact that a major part of football collocations are verbal, such as score goal and send the ball, which demonstrates that sport language is marked by the dynamics of actions and events that take place in a game. Besides, it was also possible to realize that nominal collocations are strongly connected to the materials, participants and places of football context. The second phase demonstrated that collocations in the scope of frame-based dictionaries act as lexical units, concept arising from FrameNet. Because they are considered lexical units, collocations are seen as frame evokers, thus characterizing them as terms that must be displayed in the main list of entries. However, it was also possible to note, however, that the frame evoking by collocations many times does not follow the traditional model of FrameNet, especially when it comes to nominal collocations that do not evoke events, but static entities, such as red card and classification table. The present thesis evidences the relevance of Frame Semantics and FrameNet for the study of complex units such as collocations in lexicographic contexts. Another aspect to be mentioned is the importance of the methodological resources of Corpus Linguistics to the area in which this study is included. Colocações Semântica de frames Lexicografia computacional Linguística de corpus FrameNet Collocations Frame semantics Computational lexicography Corpus linguistics

1

Page generated in 0.1256 seconds