Spelling suggestions: "subject:"multiwall expressions"" "subject:"multiway expressions""
11 |
Alinhamento léxico utilizando técnicas híbridas discriminativas e de pós-processamento / Text alignmentSchreiner, Paulo January 2010 (has links)
O alinhamento léxico automático é uma tarefa essencial para as técnicas de tradução de máquina empíricas modernas. A abordagem gerativa não-supervisionado têm sido substituída recentemente por uma abordagem discriminativa supervisionada que facilite inclusão de conhecimento linguístico de uma diversidade de fontes. Dentro deste contexto, este trabalho descreve uma série alinhadores léxicos discriminativos que incorporam heurísticas de pós-processamento com o objetivo de melhorar o desempenho dos mesmos para expressões multi-palavra, que constituem um dos desafios da área de processamento de linguagens naturais atualmente. A avaliação é realizada utilizando um gold-standard obtido a partir da anotação de um corpus paralelo de legendas de filmes. Os alinhadores propostos apresentam um desempenho superior tanto ao obtido por uma baseline quanto ao obtido por um alinhador gerativo do estado-da-arte (Giza++), tanto no caso geral quanto para as expressões foco do trabalho. / Lexical alignment is an essential task for modern empirical machine translation techniques. The unsupervised generative approach is being replaced by a supervised, discriminative one that considerably facilitates the inclusion of linguistic knowledge from several sources. Given this context, the present work describes a series of discriminative lexical aligners that incorporate post-processing heuristics with the goal of improving the quality of the alignments of multiword expressions, which is one of the major challanges in natural language processing today. The evaluation is conducted using a gold-standard obtained from a movie subtitle parallel corpus. The aligners proposed show an alignment quality that is superior both to our baseline and to a state-of-the-art generative aligner (Giza++), for the general case as well as for the expressions that are the focus of this work.
|
12 |
A generic and open framework for multiword expressions treatment : from acquisition to applicationsRamisch, Carlos Eduardo January 2012 (has links)
The treatment of multiword expressions (MWEs), like take off, bus stop and big deal, is a challenge for NLP applications. This kind of linguistic construction is not only arbitrary but also much more frequent than one would initially guess. This thesis investigates the behaviour of MWEs across different languages, domains and construction types, proposing and evaluating an integrated methodological framework for their acquisition. There have been many theoretical proposals to define, characterise and classify MWEs. We adopt generic definition stating that MWEs are word combinations which must be treated as a unit at some level of linguistic processing. They present a variable degree of institutionalisation, arbitrariness, heterogeneity and limited syntactic and semantic variability. There has been much research on automatic MWE acquisition in the recent decades, and the state of the art covers a large number of techniques and languages. Other tasks involving MWEs, namely disambiguation, interpretation, representation and applications, have received less emphasis in the field. The first main contribution of this thesis is the proposal of an original methodological framework for automatic MWE acquisition from monolingual corpora. This framework is generic, language independent, integrated and contains a freely available implementation, the mwetoolkit. It is composed of independent modules which may themselves use multiple techniques to solve a specific sub-task in MWE acquisition. The evaluation of MWE acquisition is modelled using four independent axes. We underline that the evaluation results depend on parameters of the acquisition context, e.g., nature and size of corpora, language and type of MWE, analysis depth, and existing resources. The second main contribution of this thesis is the application-oriented evaluation of our methodology proposal in two applications: computer-assisted lexicography and statistical machine translation. For the former, we evaluate the usefulness of automatic MWE acquisition with the mwetoolkit for creating three lexicons: Greek nominal expressions, Portuguese complex predicates and Portuguese sentiment expressions. For the latter, we test several integration strategies in order to improve the treatment given to English phrasal verbs when translated by a standard statistical MT system into Portuguese. Both applications can benefit from automatic MWE acquisition, as the expressions acquired automatically from corpora can both speed up and improve the quality of the results. The promising results of previous and ongoing experiments encourage further investigation about the optimal way to integrate MWE treatment into other applications. Thus, we conclude the thesis with an overview of the past, ongoing and future work.
|
13 |
Alinhamento léxico utilizando técnicas híbridas discriminativas e de pós-processamento / Text alignmentSchreiner, Paulo January 2010 (has links)
O alinhamento léxico automático é uma tarefa essencial para as técnicas de tradução de máquina empíricas modernas. A abordagem gerativa não-supervisionado têm sido substituída recentemente por uma abordagem discriminativa supervisionada que facilite inclusão de conhecimento linguístico de uma diversidade de fontes. Dentro deste contexto, este trabalho descreve uma série alinhadores léxicos discriminativos que incorporam heurísticas de pós-processamento com o objetivo de melhorar o desempenho dos mesmos para expressões multi-palavra, que constituem um dos desafios da área de processamento de linguagens naturais atualmente. A avaliação é realizada utilizando um gold-standard obtido a partir da anotação de um corpus paralelo de legendas de filmes. Os alinhadores propostos apresentam um desempenho superior tanto ao obtido por uma baseline quanto ao obtido por um alinhador gerativo do estado-da-arte (Giza++), tanto no caso geral quanto para as expressões foco do trabalho. / Lexical alignment is an essential task for modern empirical machine translation techniques. The unsupervised generative approach is being replaced by a supervised, discriminative one that considerably facilitates the inclusion of linguistic knowledge from several sources. Given this context, the present work describes a series of discriminative lexical aligners that incorporate post-processing heuristics with the goal of improving the quality of the alignments of multiword expressions, which is one of the major challanges in natural language processing today. The evaluation is conducted using a gold-standard obtained from a movie subtitle parallel corpus. The aligners proposed show an alignment quality that is superior both to our baseline and to a state-of-the-art generative aligner (Giza++), for the general case as well as for the expressions that are the focus of this work.
|
14 |
A generic and open framework for multiword expressions treatment : from acquisition to applicationsRamisch, Carlos Eduardo January 2012 (has links)
The treatment of multiword expressions (MWEs), like take off, bus stop and big deal, is a challenge for NLP applications. This kind of linguistic construction is not only arbitrary but also much more frequent than one would initially guess. This thesis investigates the behaviour of MWEs across different languages, domains and construction types, proposing and evaluating an integrated methodological framework for their acquisition. There have been many theoretical proposals to define, characterise and classify MWEs. We adopt generic definition stating that MWEs are word combinations which must be treated as a unit at some level of linguistic processing. They present a variable degree of institutionalisation, arbitrariness, heterogeneity and limited syntactic and semantic variability. There has been much research on automatic MWE acquisition in the recent decades, and the state of the art covers a large number of techniques and languages. Other tasks involving MWEs, namely disambiguation, interpretation, representation and applications, have received less emphasis in the field. The first main contribution of this thesis is the proposal of an original methodological framework for automatic MWE acquisition from monolingual corpora. This framework is generic, language independent, integrated and contains a freely available implementation, the mwetoolkit. It is composed of independent modules which may themselves use multiple techniques to solve a specific sub-task in MWE acquisition. The evaluation of MWE acquisition is modelled using four independent axes. We underline that the evaluation results depend on parameters of the acquisition context, e.g., nature and size of corpora, language and type of MWE, analysis depth, and existing resources. The second main contribution of this thesis is the application-oriented evaluation of our methodology proposal in two applications: computer-assisted lexicography and statistical machine translation. For the former, we evaluate the usefulness of automatic MWE acquisition with the mwetoolkit for creating three lexicons: Greek nominal expressions, Portuguese complex predicates and Portuguese sentiment expressions. For the latter, we test several integration strategies in order to improve the treatment given to English phrasal verbs when translated by a standard statistical MT system into Portuguese. Both applications can benefit from automatic MWE acquisition, as the expressions acquired automatically from corpora can both speed up and improve the quality of the results. The promising results of previous and ongoing experiments encourage further investigation about the optimal way to integrate MWE treatment into other applications. Thus, we conclude the thesis with an overview of the past, ongoing and future work.
|
15 |
[en] THE CORPUS NEVER LIES: ON THE IDENTIFICATION AND USE OF MULTIWORD EXPRESSIONS / [pt] O CÓRPUS NÃO MENTE JAMAIS: SOBRE A IDENTIFICAÇÃO E USO DE COMBINAÇÕES MULTIVOCABULARES DO TIPO VERBO MAIS SINTAGMA NOMINALMILENA DE UZEDA GARRAO 22 August 2006 (has links)
[pt] Muitos estudos recentes sobre a identificação e uso de
combinações
multivocabulares (CMs) adotam uma perspectiva
representacionista do
significado da palavra. Este estudo propõe que é muito
mais interessante
identificar as CMs por um olhar não-representacionista. A
metodologia proposta
foi testada em CMs do tipo V+SN, um padrão bastante
freqüente no português do
Brasil (PB). Trata-se de uma análise estatística com base
em córpus que pode ser
resumida em três etapas: 1) córpus robusto do PB como base
de análise, 2)
aplicação de um teste estatístico ao córpus, a saber,
teste de Logaritmo de
Verossimilhança (Banerjee e Pedersen, 2003), para detecção
das CMs mais
freqüentes com padrão V+SN (como tomar café) e exclusão de
co-ocorrências
sintáticas aleatórias dos mesmos itens lexicais, 3)
aplicação de Medidas de
Similaridade (Baeza-Yates e Ribeiro-Neto, 1999) entre
todos os parágrafos
contendo uma certa CM (por exemplo, fazer campanha) e
todos os parágrafos
contendo o substantivo fora da CM (campanha). Esta última
etapa foi utilizada
para avaliar o grau de composicionalidade da CM. Pôde-se
concluir que quanto
maior a similaridade entre os parágrafos contendo a CM e
os parágrafos contendo
o substantivo fora da expressão, maior será o grau de
composicionalidade da CM.
Por essa razão, este estudo tem um impacto tanto teórico
quanto prático para a
semântica. / [en] A considerable amount of recent researches on defining
multi-word
expressions´ (MWE) phenomenon has an underlying
representational framework
of word meaning. In this study we claim that it is much
more interesting to view
MWE from a non-representational perspective. By choosing
this path, we avoid
the time-consuming and controversial human intuitions to
MWE identification
and definition. Our methodology was tested on Brazilian
Portuguese verbal
phrases of V+NP pattern. It is a statistically-based
corpus analysis which could be
summed up as the following three sequent steps: 1) robust
linguistic corpora as
output, 2) application of a probabilistic test to the
corpora, namely Log Likelihood
test (Banerjee and Pedersen, 2003), in order to spot the
Portuguese MWEs of V+NP
pattern (such as tomar café) and disregard casual
syntactic and not otherwise
motivated co-occurrences of the same lexical items, 3)
application of Similarity
Measures (Baeza-Yates and Ribeiro-Neto, 1999) between all
the paragraphs
containing a certain MWE and all the paragraphs containing
its separate noun.
This latter step is crucial to assess the MWE
compositionality level. We conclude
that the higher are the similarity measures between the
MWE (such as fazer
campanha) and its separate noun (campanha), the more
compositional will be the
MWE. Therefore, we believe that this work has both a
practical and a theoretical
impact to semantics.
|
16 |
Un environnement générique et ouvert pour le traitement des expressions polylexicales : de l'acquisition aux applications / A generic and open framework for multiword expressions treatment : from acquisition to applicationsRamisch, Carlos Eduardo 11 September 2012 (has links)
Cette thèse présente un environnement ouvert et souple pour l'acquisition automatique d'expressions multimots (MWE) à partir de corpus textuels monolingues. Cette recherche est motivée par l'importance des MWE pour les applications du TALN. Après avoir brièvement présenté les modules de l'environnement, le mémoire présente des résultats d'évaluation intrinsèque en utilisant deux applications: la lexicographie assistée par ordinateur et la traduction automatique statistique. Ces deux applications peuvent bénéficier de l'acquisition automatique de MWE, et les expressions acquises automatiquement à partir de corpus peuvent à la fois les accélérer et améliorer leur qualité. Les résultats prometteurs de nos expériences nous encouragent à mener des recherches ultérieures sur la façon optimale d'intégrer le traitement des MWE dans ces applications et dans bien d'autres / This thesis presents an open and flexible methodological framework for the automatic acquisition of multiword expressions (MWEs) from monolingual textual corpora. This research is motivated by the importance of MWEs for NLP applications. After briefly presenting the modules of the framework, the work reports extrinsic evaluation results considering two applications: computer-aided lexicography and statistical machine translation. Both applications can benefit from automatic MWE acquisition and the expressions acquired automatically from corpora can both speed up and improve their quality. The promising results of our experiments encourage further investigation about the optimal way to integrate MWE treatment into these and many other applications.
|
17 |
Le traitement des locutions en génération automatique de texte multilingueDubé, Michaelle 08 1900 (has links)
La locution est peu étudiée en génération automatique de texte (GAT). Syntaxiquement, elle forme un syntagme, alors que sémantiquement, elle ne constitue qu’une seule unité. Le présent mémoire propose un traitement des locutions en GAT multilingue qui permet d’isoler les constituants de la locution tout en conservant le sens global de celle-ci. Pour ce faire, nous avons élaboré une solution flexible à base de patrons universels d’arbres de dépendances syntaxiques vers lesquels pointent des patrons de locutions propres au français (Pausé, 2017). Notre traitement a été effectué dans le réalisateur de texte profond multilingue GenDR à l’aide des données du Réseau lexical du français (RL-fr). Ce travail a abouti à la création de 36 règles de lexicalisation par patron (indépendantes de la langue) et à un dictionnaire lexical pour les locutions du français. Notre implémentation couvre 2 846 locutions du RL-fr (soit 97,5 %), avec une précision de 97,7 %.
Le mémoire se divise en cinq chapitres, qui décrivent : 1) l’architecture classique en GAT et le traitement des locutions par différents systèmes symboliques ; 2) l’architecture de GenDR, (principalement sa grammaire, ses dictionnaires, son interface sémantique-syntaxe et ses stratégies de lexicalisations) ; 3) la place des locutions dans la phraséologie selon la théorie Sens-Texte, ainsi que le RL-fr et ses patrons syntaxiques linéarisés ; 4) notre implémentation de la lexicalisation par patron des locutions dans GenDR, et 5) notre évaluation de la couverture de la précision de notre implémentation. / Idioms are rarely studied in natural language generation (NLG). Syntactically, they form a phrase, while semantically, they correspond to a single unit. In this master’s thesis, we propose a treatment of idioms in multilingual NLG that enables us to isolate their constituents while preserving their global meaning. To do so, we developed a flexible solution based on universal templates of syntactic dependency trees, onto which we map French-specific idiom patterns (Pausé, 2017). Our work was implemented in Generic Deep Realizer (GenDR) using data from the Réseau lexical du français (RL-fr). This resulted in the creation of 36 template-based lexicalization rules (independent of language) and of a lexical dictionary for French idioms. Our implementation covers 2846 idioms of the RL-fr (i.e., 97.5%), with an accuracy of 97.7%.
We divided our analysis into five chapters, which describe: 1) the classical NLG architecture and the handling of idioms by different symbolic systems; 2) the architecture of GenDR (mainly its grammar, its dictionaries, its semantic-syntactic interface, and its lexicalization strategies); 3) the place of idioms in phraseology according to Meaning-Text Theory (théorie Sens-Texte), the RL-fr and its linearized syntactic patterns; 4) our implementation of the template lexicalization of idioms in GenDR; and 5) our evaluation of the coverage and the precision of our implementation.
|
Page generated in 0.0867 seconds