Global ETD Search

1	Distributional models of multiword expression compositionality prediction / Modelos distribucionais para a predição de composicionalidade de expressões multipalavras Cordeiro, Silvio Ricardo January 2018 (has links) Sistemas de processamento de linguagem natural baseiam-se com frequência na hipótese de que a linguagem humana é composicional, ou seja, que o significado de uma entidade linguística pode ser inferido a partir do significado de suas partes. Essa expectativa falha no caso de expressões multipalavras (EMPs). Por exemplo, uma pessoa caracterizada como pão-duro não é literalmente um pão, e também não tem uma consistência molecular mais dura que a de outras pessoas. Técnicas computacionais modernas para inferir o significado das palavras com base na sua distribuição no texto vêm obtendo um considerável sucesso em múltiplas tarefas, especialmente após o surgimento de abordagens de word embeddings. No entanto, a representação de EMPs continua a ser um problema em aberto na área. Em particular, não existe um método consolidado que prediga, com base em corpora, se uma determinada EMP deveria ser tratada como unidade indivisível (por exemplo olho gordo) ou como alguma combinação do significado de suas partes (por exemplo tartaruga marinha). Esta tese propõe um modelo de predição de composicionalidade de EMPs com base em representações de semântica distribucional, que são instanciadas no contexto de uma variedade de parâmetros. Também é apresentada uma avaliação minuciosa do impacto desses parâmetros em três novos conjuntos de dados que modelam a composicionalidade de EMP, abrangendo EMPs em inglês, francês e português. Por fim, é apresentada uma avaliação extrínseca dos níveis previstos de composicionalidade de EMPs, através da tarefa de identificação de EMPs. Os resultados obtidos sugerem que a escolha adequada do modelo distribucional e de parâmetros de corpus pode produzir predições de composicionalidade que são comparáveis às observadas no estado da arte. / Natural language processing systems often rely on the idea that language is compositional, that is, the meaning of a linguistic entity can be inferred from the meaning of its parts. This expectation fails in the case of multiword expressions (MWEs). For example, a person who is a sitting duck is neither a duck nor necessarily sitting. Modern computational techniques for inferring word meaning based on the distribution of words in the text have been quite successful at multiple tasks, especially since the rise of word embedding approaches. However, the representation of MWEs still remains an open problem in the field. In particular, it is unclear how one could predict from corpora whether a given MWE should be treated as an indivisible unit (e.g. nut case) or as some combination of the meaning of its parts (e.g. engine room). This thesis proposes a framework of MWE compositionality prediction based on representations of distributional semantics, which we instantiate under a variety of parameters. We present a thorough evaluation of the impact of these parameters on three new datasets of MWE compositionality, encompassing English, French and Portuguese MWEs. Finally, we present an extrinsic evaluation of the predicted levels of MWE compositionality on the task of MWE identification. Our results suggest that the proper choice of distributional model and corpus parameters can produce compositionality predictions that are comparable to the state of the art. Linguística computacional Linguagem natural Distributional semantics Idiomaticity Compositionality Multiword expressions
2	Statistical modeling of multiword expressions Su, Kim Nam January 2008 (has links) In natural languages, words can occur in single units called simplex words or in a group of simplex words that function as a single unit, called multiword expressions (MWEs). Although MWEs are similar to simplex words in their syntax and semantics, they pose their own sets of challenges (Sag et al. 2002). MWEs are arguably one of the biggest roadblocks in computational linguistics due to the bewildering range of syntactic, semantic, pragmatic and statistical idiomaticity they are associated with, and their high productivity. In addition, the large numbers in which they occur demand specialized handling. Moreover, dealing with MWEs has a broad range of applications, from syntactic disambiguation to semantic analysis in natural language processing (NLP) (Wacholder and Song 2003; Piao et al. 2003; Baldwin et al. 2004; Venkatapathy and Joshi 2006). / Our goals in this research are: to use computational techniques to shed light on the underlying linguistic processes giving rise to MWEs across constructions and languages; to generalize existing techniques by abstracting away from individual MWE types; and finally to exemplify the utility of MWE interpretation within general NLP tasks. / In this thesis, we target English MWEs due to resource availability. In particular, we focus on noun compounds (NCs) and verb-particle constructions (VPCs) due to their high productivity and frequency. / Challenges in processing noun compounds are: (1) interpreting the semantic relation (SR) that represents the underlying connection between the head noun and modifier(s); (2) resolving syntactic ambiguity in NCs comprising three or more terms; and (3) analyzing the impact of word sense on noun compound interpretation. Our basic approach to interpreting NCs relies on the semantic similarity of the NC components using firstly a nearest-neighbor method (Chapter 5), then verb semantics based on the observation that it is often an underlying verb that relates the nouns in NCs (Chapter 6), and finally semantic variation within NC sense collocations, in combination with bootstrapping (Chapter 7). / Challenges in dealing with verb-particle constructions are: (1) identifying VPCs in raw text data (Chapter 8); and (2) modeling the semantic compositionality of VPCs (Chapter 5). We place particular focus on identifying VPCs in context, and measuring the compositionality of unseen VPCs in order to predict their meaning. Our primary approach to the identification task is to adapt localized context information derived from linguistic features of VPCs to distinguish between VPCs and simple verb-PP combinations. To measure the compositionality of VPCs, we use semantic similarity among VPCs by testing the semantic contribution of each component. / Finally, we conclude the thesis with a chapter-by-chapter summary and outline of the findings of our work, suggestions of potential NLP applications, and a presentation of further research directions (Chapter 9).
3	Automatické propojování lexikografických zdrojů a korpusových dat. / Automatic linking of lexicographic sources and corpus data Bejček, Eduard January 2015 (has links) Along with the increasing development of language resources - i.e., new lexicons, lexical databases, corpora, treebanks - the need for their efficient interlinking is growing. With such a linking, one can easily benefit from all their properties and information. Considering the convergence of resources, universal lexicographic formats are frequently discussed. In the present thesis, we investigate and analyse methods of interlinking language resources automatically. We introduce a system for interlinking lexicons (such as VALLEX, PDT-Vallex, FrameNet or SemLex) that offer information on syntactic properties of their entries. The system is automated and can be used repeatedly with newer versions of lexicons under development. We also design a method for identification of multiword expressions in a parsed text based on syntactic information from the SemLex lexicon. An output that verifies feasibility of the used methods is, among others, the mapping between the VALLEX and the PDT-Vallex lexicons, resulting in tens of thousands of annotated treebank sentences from the PDT and the PCEDT treebanks added into VALLEX. Powered by TCPDF (www.tcpdf.org)
4	Distributional models of multiword expression compositionality prediction / Modelos distribucionais para a predição de composicionalidade de expressões multipalavras Cordeiro, Silvio Ricardo January 2018 (has links) Sistemas de processamento de linguagem natural baseiam-se com frequência na hipótese de que a linguagem humana é composicional, ou seja, que o significado de uma entidade linguística pode ser inferido a partir do significado de suas partes. Essa expectativa falha no caso de expressões multipalavras (EMPs). Por exemplo, uma pessoa caracterizada como pão-duro não é literalmente um pão, e também não tem uma consistência molecular mais dura que a de outras pessoas. Técnicas computacionais modernas para inferir o significado das palavras com base na sua distribuição no texto vêm obtendo um considerável sucesso em múltiplas tarefas, especialmente após o surgimento de abordagens de word embeddings. No entanto, a representação de EMPs continua a ser um problema em aberto na área. Em particular, não existe um método consolidado que prediga, com base em corpora, se uma determinada EMP deveria ser tratada como unidade indivisível (por exemplo olho gordo) ou como alguma combinação do significado de suas partes (por exemplo tartaruga marinha). Esta tese propõe um modelo de predição de composicionalidade de EMPs com base em representações de semântica distribucional, que são instanciadas no contexto de uma variedade de parâmetros. Também é apresentada uma avaliação minuciosa do impacto desses parâmetros em três novos conjuntos de dados que modelam a composicionalidade de EMP, abrangendo EMPs em inglês, francês e português. Por fim, é apresentada uma avaliação extrínseca dos níveis previstos de composicionalidade de EMPs, através da tarefa de identificação de EMPs. Os resultados obtidos sugerem que a escolha adequada do modelo distribucional e de parâmetros de corpus pode produzir predições de composicionalidade que são comparáveis às observadas no estado da arte. / Natural language processing systems often rely on the idea that language is compositional, that is, the meaning of a linguistic entity can be inferred from the meaning of its parts. This expectation fails in the case of multiword expressions (MWEs). For example, a person who is a sitting duck is neither a duck nor necessarily sitting. Modern computational techniques for inferring word meaning based on the distribution of words in the text have been quite successful at multiple tasks, especially since the rise of word embedding approaches. However, the representation of MWEs still remains an open problem in the field. In particular, it is unclear how one could predict from corpora whether a given MWE should be treated as an indivisible unit (e.g. nut case) or as some combination of the meaning of its parts (e.g. engine room). This thesis proposes a framework of MWE compositionality prediction based on representations of distributional semantics, which we instantiate under a variety of parameters. We present a thorough evaluation of the impact of these parameters on three new datasets of MWE compositionality, encompassing English, French and Portuguese MWEs. Finally, we present an extrinsic evaluation of the predicted levels of MWE compositionality on the task of MWE identification. Our results suggest that the proper choice of distributional model and corpus parameters can produce compositionality predictions that are comparable to the state of the art. Linguística computacional Linguagem natural Distributional semantics Idiomaticity Compositionality Multiword expressions
5	Distributional models of multiword expression compositionality prediction / Modelos distribucionais para a predição de composicionalidade de expressões multipalavras Cordeiro, Silvio Ricardo January 2018 (has links) Sistemas de processamento de linguagem natural baseiam-se com frequência na hipótese de que a linguagem humana é composicional, ou seja, que o significado de uma entidade linguística pode ser inferido a partir do significado de suas partes. Essa expectativa falha no caso de expressões multipalavras (EMPs). Por exemplo, uma pessoa caracterizada como pão-duro não é literalmente um pão, e também não tem uma consistência molecular mais dura que a de outras pessoas. Técnicas computacionais modernas para inferir o significado das palavras com base na sua distribuição no texto vêm obtendo um considerável sucesso em múltiplas tarefas, especialmente após o surgimento de abordagens de word embeddings. No entanto, a representação de EMPs continua a ser um problema em aberto na área. Em particular, não existe um método consolidado que prediga, com base em corpora, se uma determinada EMP deveria ser tratada como unidade indivisível (por exemplo olho gordo) ou como alguma combinação do significado de suas partes (por exemplo tartaruga marinha). Esta tese propõe um modelo de predição de composicionalidade de EMPs com base em representações de semântica distribucional, que são instanciadas no contexto de uma variedade de parâmetros. Também é apresentada uma avaliação minuciosa do impacto desses parâmetros em três novos conjuntos de dados que modelam a composicionalidade de EMP, abrangendo EMPs em inglês, francês e português. Por fim, é apresentada uma avaliação extrínseca dos níveis previstos de composicionalidade de EMPs, através da tarefa de identificação de EMPs. Os resultados obtidos sugerem que a escolha adequada do modelo distribucional e de parâmetros de corpus pode produzir predições de composicionalidade que são comparáveis às observadas no estado da arte. / Natural language processing systems often rely on the idea that language is compositional, that is, the meaning of a linguistic entity can be inferred from the meaning of its parts. This expectation fails in the case of multiword expressions (MWEs). For example, a person who is a sitting duck is neither a duck nor necessarily sitting. Modern computational techniques for inferring word meaning based on the distribution of words in the text have been quite successful at multiple tasks, especially since the rise of word embedding approaches. However, the representation of MWEs still remains an open problem in the field. In particular, it is unclear how one could predict from corpora whether a given MWE should be treated as an indivisible unit (e.g. nut case) or as some combination of the meaning of its parts (e.g. engine room). This thesis proposes a framework of MWE compositionality prediction based on representations of distributional semantics, which we instantiate under a variety of parameters. We present a thorough evaluation of the impact of these parameters on three new datasets of MWE compositionality, encompassing English, French and Portuguese MWEs. Finally, we present an extrinsic evaluation of the predicted levels of MWE compositionality on the task of MWE identification. Our results suggest that the proper choice of distributional model and corpus parameters can produce compositionality predictions that are comparable to the state of the art. Linguística computacional Linguagem natural Distributional semantics Idiomaticity Compositionality Multiword expressions
6	Distributional models of multiword expression compositionality prediction / Modèles distributionnels pour la prédiction de compositionnalité d’expressions polylexicales Cordeiro, Silvio Ricardo 18 December 2017 (has links) Les systèmes de traitement automatique des langues reposent souvent sur l'idée que le langage est compositionnel, c'est-à-dire que le sens d'une entité linguistique peut être déduite à partir du sens de ses parties. Cette supposition ne s’avère pas vraie dans le cas des expressions polylexicales (EPLs). Par exemple, une "poule mouillée" n'est ni une poule, ni nécessairement mouillée. Les techniques pour déduire le sens des mots en fonction de leur distribution dans le texte ont obtenu de bons résultats sur plusieurs tâches, en particulier depuis l'apparition des word embeddings. Cependant, la représentation des EPLs reste toujours un problème non résolu. En particulier, on ne sait pas comment prédire avec précision, à partir des corpus, si une EPL donnée doit être traitée comme une unité indivisible (p.ex. "carton plein") ou comme une combinaison du sens de ses parties (p.ex. "eau potable"). Cette thèse propose un cadre méthodologique pour la prédiction de compositionnalité d'EPLs fondé sur des représentations de la sémantique distributionnelle, que nous instancions à partir d’une variété de paramètres. Nous présenterons une évaluation complète de l'impact de ces paramètres sur trois nouveaux ensembles de données modélisant la compositionnalité d'EPLs, en anglais, français et portugais. Finalement, nous présenterons une évaluation extrinsèque des niveaux de compositionnalité prédits par le modèle dans le contexte d’un système d'identification d'EPLs. Les résultats suggèrent que le choix spécifique de modèle distributionnel et de paramètres de corpus peut produire des prédictions de compositionnalité qui sont comparables à celles présentées dans l'état de l'art. / Natural language processing systems often rely on the idea that language is compositional, that is, the meaning of a linguistic entity can be inferred from the meaning of its parts. This expectation fails in the case of multiword expressions (MWEs). For example, a person who is a "sitting duck" is neither a duck nor necessarily sitting. Modern computational techniques for inferring word meaning based on the distribution of words in the text have been quite successful at multiple tasks, especially since the rise of word embedding approaches. However, the representation of MWEs still remains an open problem in the field. In particular, it is unclear how one could predict from corpora whether a given MWE should be treated as an indivisible unit (e.g. "nut case") or as some combination of the meaning of its parts (e.g. "engine room"). This thesis proposes a framework of MWE compositionality prediction based on representations of distributional semantics, which we instantiate under a variety of parameters. We present a thorough evaluation of the impact of these parameters on three new datasets of MWE compositionality, encompassing English, French and Portuguese MWEs. Finally, we present an extrinsic evaluation of the predicted levels of MWE compositionality on the task of MWE identification. Our results suggest that the proper choice of distributional model and corpus parameters can produce compositionality predictions that are comparable to the state of the art. Expressions polylexicales Sémantique distributionnelle Compositionalité Idiomaticité Distributional semantics Multiword expressions Compositionality Idiomaticity 004
7	Indirect Influence of English on Kiswahili: The Case of Multiword Duplicates between Kiswahili and English Ochieng, Dunlop 22 October 2015 (has links) (PDF) Some proverbs, idioms, nominal compounds, and slogans duplicate in form and meaning between several languages. An example of these between German and English is Liebe auf den ersten Blick and “love at first sight” (Flippo, 2009), whereas, an example between Kiswahili and English is uchaguzi ulio huru na haki and “free and fair election.” Duplication of these strings of words between languages that are as different in descent and typology as Kiswahili and English is irregular. On this ground, Kiswahili academies and a number of experts of Kiswahili assumed – prior to the present study – that the Kiswahili versions of the expressions are the derivatives from their English congruent counterparts. The assumption nonetheless lacked empirical evidence and also discounted other potential causes of the phenomenon, i.e. analogical extension, nativism and cognitive metaphoricalization (Makkai, 1972; Land, 1974; Lakoff & Johnson, 1980b; Ruhlen, 1987; Lakoff, 1987; Gleitman and Newport, 1995). Out of this background, we assumed an academic obligation of empirically investigating what causes this formal and semantic duplication of strings of words (multiword expressions) between English and Kiswahili to a degree beyond chance expectations. In this endeavour, we employed checklist to 24, interview to 43, online questionnaire to 102, translation test to 47 and translationality test to 8 respondents. Online questionnaire respondents were from 21 regions of Tanzania, whereas, those of the rest of the tools were from Zanzibar, Dar es Salaam, Pwani, Lindi, Dodoma and Kigoma. Complementarily, we analysed the Chemnitz Corpus of Swahili (CCS), the Helsinki Swahili Corpus (HSC), and the Corpus of Contemporary American English (COCA) for clues on the sources and trends of expressions exhibiting this characteristic between Kiswahili and English. Furthermore, we reviewed the Bible, dictionaries, encyclopaedia, books, articles, expressions lists, wikis, and phrase books in pursuit of etymologies, and histories of concepts underlying the focus expressions. Our analysis shows that most of the Kiswahili versions of the focus expressions are the function of loan translation and rendition from English. We found that economic, political and technological changes, mostly induced by liberalization policy of the 1990s in Tanzania, created lexical gaps in Kiswahili that needed to be filled. We discovered that Kiswahili, among other means, fill such gaps through loan translation and loan rendition of English phrases. Prototypical examples of notions whose English labels Kiswahili has translated word for word are such as “human rights”, “free and fair election”, “the World Cup” and “multiparty democracy”. We can conclude that Kiswahili finds it easier and economical to translate the existing English labels for imported notions rather than innovating original labels for the concepts. Even so, our analysis revealed that a few of the Kiswahili duplicate multiword expressions might be a function of nativism, cognitive metaphoricalization and analogy phenomena. We, for instance, observed that formulation of figurative meanings follow more or less similar pattern across human languages – the secondary meanings deriving from source domains. As long as the source domains are common in many human\'s environment, we found it plausible for certain multiword expressions to spontaneously duplicate between several human languages. Academically, our study has demonstrated how multiword expressions, which duplicate between several languages, can be studied using primary data, corpora, documentary review and observation. In particular, the study has designed a framework for studying sources of the expressions and even terminologies for describing the phenomenon. What\'s more, the study has collected a number of expressions that duplicate between Kiswahili and English languages, which other researchers can use in similar studies. English Kiswahili Duplicate multiword expressions English on Kiswahili Loan expressionsi Indirect loan influence Widespread expressions Anglicism Englishzitation Borrowing of expressions Tanzania borrowings morphosyntax multiword expressions Swahili Englisch Tansania Morphosyntax Lehnwörter Mehrwortausdrücke ddc:820 Englisch Swahili Morphosyntax Tansania
8	Indirect Influence of English on Kiswahili: The Case of Multiword Duplicates between Kiswahili and English Ochieng, Dunlop 04 February 2015 (has links) Some proverbs, idioms, nominal compounds, and slogans duplicate in form and meaning between several languages. An example of these between German and English is Liebe auf den ersten Blick and “love at first sight” (Flippo, 2009), whereas, an example between Kiswahili and English is uchaguzi ulio huru na haki and “free and fair election.” Duplication of these strings of words between languages that are as different in descent and typology as Kiswahili and English is irregular. On this ground, Kiswahili academies and a number of experts of Kiswahili assumed – prior to the present study – that the Kiswahili versions of the expressions are the derivatives from their English congruent counterparts. The assumption nonetheless lacked empirical evidence and also discounted other potential causes of the phenomenon, i.e. analogical extension, nativism and cognitive metaphoricalization (Makkai, 1972; Land, 1974; Lakoff & Johnson, 1980b; Ruhlen, 1987; Lakoff, 1987; Gleitman and Newport, 1995). Out of this background, we assumed an academic obligation of empirically investigating what causes this formal and semantic duplication of strings of words (multiword expressions) between English and Kiswahili to a degree beyond chance expectations. In this endeavour, we employed checklist to 24, interview to 43, online questionnaire to 102, translation test to 47 and translationality test to 8 respondents. Online questionnaire respondents were from 21 regions of Tanzania, whereas, those of the rest of the tools were from Zanzibar, Dar es Salaam, Pwani, Lindi, Dodoma and Kigoma. Complementarily, we analysed the Chemnitz Corpus of Swahili (CCS), the Helsinki Swahili Corpus (HSC), and the Corpus of Contemporary American English (COCA) for clues on the sources and trends of expressions exhibiting this characteristic between Kiswahili and English. Furthermore, we reviewed the Bible, dictionaries, encyclopaedia, books, articles, expressions lists, wikis, and phrase books in pursuit of etymologies, and histories of concepts underlying the focus expressions. Our analysis shows that most of the Kiswahili versions of the focus expressions are the function of loan translation and rendition from English. We found that economic, political and technological changes, mostly induced by liberalization policy of the 1990s in Tanzania, created lexical gaps in Kiswahili that needed to be filled. We discovered that Kiswahili, among other means, fill such gaps through loan translation and loan rendition of English phrases. Prototypical examples of notions whose English labels Kiswahili has translated word for word are such as “human rights”, “free and fair election”, “the World Cup” and “multiparty democracy”. We can conclude that Kiswahili finds it easier and economical to translate the existing English labels for imported notions rather than innovating original labels for the concepts. Even so, our analysis revealed that a few of the Kiswahili duplicate multiword expressions might be a function of nativism, cognitive metaphoricalization and analogy phenomena. We, for instance, observed that formulation of figurative meanings follow more or less similar pattern across human languages – the secondary meanings deriving from source domains. As long as the source domains are common in many human\'s environment, we found it plausible for certain multiword expressions to spontaneously duplicate between several human languages. Academically, our study has demonstrated how multiword expressions, which duplicate between several languages, can be studied using primary data, corpora, documentary review and observation. In particular, the study has designed a framework for studying sources of the expressions and even terminologies for describing the phenomenon. What\'s more, the study has collected a number of expressions that duplicate between Kiswahili and English languages, which other researchers can use in similar studies. info:eu-repo/classification/ddc/820 ddc:820
9	Alinhamento léxico utilizando técnicas híbridas discriminativas e de pós-processamento / Text alignment Schreiner, Paulo January 2010 (has links) O alinhamento léxico automático é uma tarefa essencial para as técnicas de tradução de máquina empíricas modernas. A abordagem gerativa não-supervisionado têm sido substituída recentemente por uma abordagem discriminativa supervisionada que facilite inclusão de conhecimento linguístico de uma diversidade de fontes. Dentro deste contexto, este trabalho descreve uma série alinhadores léxicos discriminativos que incorporam heurísticas de pós-processamento com o objetivo de melhorar o desempenho dos mesmos para expressões multi-palavra, que constituem um dos desafios da área de processamento de linguagens naturais atualmente. A avaliação é realizada utilizando um gold-standard obtido a partir da anotação de um corpus paralelo de legendas de filmes. Os alinhadores propostos apresentam um desempenho superior tanto ao obtido por uma baseline quanto ao obtido por um alinhador gerativo do estado-da-arte (Giza++), tanto no caso geral quanto para as expressões foco do trabalho. / Lexical alignment is an essential task for modern empirical machine translation techniques. The unsupervised generative approach is being replaced by a supervised, discriminative one that considerably facilitates the inclusion of linguistic knowledge from several sources. Given this context, the present work describes a series of discriminative lexical aligners that incorporate post-processing heuristics with the goal of improving the quality of the alignments of multiword expressions, which is one of the major challanges in natural language processing today. The evaluation is conducted using a gold-standard obtained from a movie subtitle parallel corpus. The aligners proposed show an alignment quality that is superior both to our baseline and to a state-of-the-art generative aligner (Giza++), for the general case as well as for the expressions that are the focus of this work. Linguística computacional Processamento : Linguagem natural Natural language processing Lexical alignment Machine learning Parallel corpora Multiword expressions UFRGS
10	A generic and open framework for multiword expressions treatment : from acquisition to applications Ramisch, Carlos Eduardo January 2012 (has links) The treatment of multiword expressions (MWEs), like take off, bus stop and big deal, is a challenge for NLP applications. This kind of linguistic construction is not only arbitrary but also much more frequent than one would initially guess. This thesis investigates the behaviour of MWEs across different languages, domains and construction types, proposing and evaluating an integrated methodological framework for their acquisition. There have been many theoretical proposals to define, characterise and classify MWEs. We adopt generic definition stating that MWEs are word combinations which must be treated as a unit at some level of linguistic processing. They present a variable degree of institutionalisation, arbitrariness, heterogeneity and limited syntactic and semantic variability. There has been much research on automatic MWE acquisition in the recent decades, and the state of the art covers a large number of techniques and languages. Other tasks involving MWEs, namely disambiguation, interpretation, representation and applications, have received less emphasis in the field. The first main contribution of this thesis is the proposal of an original methodological framework for automatic MWE acquisition from monolingual corpora. This framework is generic, language independent, integrated and contains a freely available implementation, the mwetoolkit. It is composed of independent modules which may themselves use multiple techniques to solve a specific sub-task in MWE acquisition. The evaluation of MWE acquisition is modelled using four independent axes. We underline that the evaluation results depend on parameters of the acquisition context, e.g., nature and size of corpora, language and type of MWE, analysis depth, and existing resources. The second main contribution of this thesis is the application-oriented evaluation of our methodology proposal in two applications: computer-assisted lexicography and statistical machine translation. For the former, we evaluate the usefulness of automatic MWE acquisition with the mwetoolkit for creating three lexicons: Greek nominal expressions, Portuguese complex predicates and Portuguese sentiment expressions. For the latter, we test several integration strategies in order to improve the treatment given to English phrasal verbs when translated by a standard statistical MT system into Portuguese. Both applications can benefit from automatic MWE acquisition, as the expressions acquired automatically from corpora can both speed up and improve the quality of the results. The promising results of previous and ongoing experiments encourage further investigation about the optimal way to integrate MWE treatment into other applications. Thus, we conclude the thesis with an overview of the past, ongoing and future work. Linguagem natural Linguística computacional Natural language processing Computational linguistics Multiword expressions Lexical acquisition Machine translation Lexicography Corpus linguistics

Search results