Global ETD Search

21	Intégration de ressources lexicales riches dans un analyseur syntaxique probabiliste / Integration of lexical resources in a probabilistic parser Sigogne, Anthony 03 December 2012 (has links) Cette thèse porte sur l'intégration de ressources lexicales et syntaxiques du français dans deux tâches fondamentales du Traitement Automatique des Langues [TAL] que sont l'étiquetage morpho-syntaxique probabiliste et l'analyse syntaxique probabiliste. Dans ce mémoire, nous utilisons des données lexicales et syntaxiques créées par des processus automatiques ou par des linguistes afin de donner une réponse à deux problématiques que nous décrivons succinctement ci-dessous : la dispersion des données et la segmentation automatique des textes. Grâce à des algorithmes d'analyse syntaxique de plus en plus évolués, les performances actuelles des analyseurs sont de plus en plus élevées, et ce pour de nombreuses langues dont le français. Cependant, il existe plusieurs problèmes inhérents aux formalismes mathématiques permettant de modéliser statistiquement cette tâche (grammaire, modèles discriminants,...). La dispersion des données est l'un de ces problèmes, et est causée principalement par la faible taille des corpus annotés disponibles pour la langue. La dispersion représente la difficulté d'estimer la probabilité de phénomènes syntaxiques apparaissant dans les textes à analyser mais qui sont rares ou absents du corpus ayant servi à l'apprentissage des analyseurs. De plus, il est prouvé que la dispersion est en partie un problème lexical, car plus la flexion d'une langue est importante, moins les phénomènes lexicaux sont représentés dans les corpus annotés. Notre première problématique repose donc sur l'atténuation de l'effet négatif de la dispersion lexicale des données sur les performances des analyseurs. Dans cette optique, nous nous sommes intéressé à une méthode appelée regroupement lexical, et qui consiste à regrouper les mots du corpus et des textes en classes. Ces classes réduisent le nombre de mots inconnus et donc le nombre de phénomènes syntaxiques rares ou inconnus, liés au lexique, des textes à analyser. Notre objectif est donc de proposer des regroupements lexicaux à partir d'informations tirées des lexiques syntaxiques du français, et d'observer leur impact sur les performances d'analyseurs syntaxiques. Par ailleurs, la plupart des évaluations concernant l'étiquetage morpho-syntaxique probabiliste et l'analyse syntaxique probabiliste ont été réalisées avec une segmentation parfaite du texte, car identique à celle du corpus évalué. Or, dans les cas réels d'application, la segmentation d'un texte est très rarement disponible et les segmenteurs automatiques actuels sont loin de proposer une segmentation de bonne qualité, et ce, à cause de la présence de nombreuses unités multi-mots (mots composés, entités nommées,...). Dans ce mémoire, nous nous focalisons sur les unités multi-mots dites continues qui forment des unités lexicales auxquelles on peut associer une étiquette morpho-syntaxique, et que nous appelons mots composés. Par exemple, cordon bleu est un nom composé, et tout à fait un adverbe composé. Nous pouvons assimiler la tâche de repérage des mots composés à celle de la segmentation du texte. Notre deuxième problématique portera donc sur la segmentation automatique des textes français et son impact sur les performances des processus automatiques. Pour ce faire, nous nous sommes penché sur une approche consistant à coupler, dans un même modèle probabiliste, la reconnaissance des mots composés et une autre tâche automatique. Dans notre cas, il peut s'agir de l'analyse syntaxique ou de l'étiquetage morpho-syntaxique. La reconnaissance des mots composés est donc réalisée au sein du processus probabiliste et non plus dans une phase préalable. Notre objectif est donc de proposer des stratégies innovantes permettant d'intégrer des ressources de mots composés dans deux processus probabilistes combinant l'étiquetage ou l'analyse à la segmentation du texte / This thesis focuses on the integration of lexical and syntactic resources of French in two fundamental tasks of Natural Language Processing [NLP], that are probabilistic part-of-speech tagging and probabilistic parsing. In the case of French, there are a lot of lexical and syntactic data created by automatic processes or by linguists. In addition, a number of experiments have shown interest to use such resources in processes such as tagging or parsing, since they can significantly improve system performances. In this paper, we use these resources to give an answer to two problems that we describe briefly below : data sparseness and automatic segmentation of texts. Through more and more sophisticated parsing algorithms, parsing accuracy is becoming higher for many languages including French. However, there are several problems inherent in mathematical formalisms that statistically model the task (grammar, discriminant models,...). Data sparseness is one of those problems, and is mainly caused by the small size of annotated corpora available for the language. Data sparseness is the difficulty of estimating the probability of syntactic phenomena, appearing in the texts to be analyzed, that are rare or absent from the corpus used for learning parsers. Moreover, it is proved that spars ness is partly a lexical problem, because the richer the morphology of a language is, the sparser the lexicons built from a Treebank will be for that language. Our first problem is therefore based on mitigating the negative impact of lexical data sparseness on parsing performance. To this end, we were interested in a method called word clustering that consists in grouping words of corpus and texts into clusters. These clusters reduce the number of unknown words, and therefore the number of rare or unknown syntactic phenomena, related to the lexicon, in texts to be analyzed. Our goal is to propose word clustering methods based on syntactic information from French lexicons, and observe their impact on parsers accuracy. Furthermore, most evaluations about probabilistic tagging and parsing were performed with a perfect segmentation of the text, as identical to the evaluated corpus. But in real cases of application, the segmentation of a text is rarely available and automatic segmentation tools fall short of proposing a high quality segmentation, because of the presence of many multi-word units (compound words, named entities,...). In this paper, we focus on continuous multi-word units, called compound words, that form lexical units which we can associate a part-of-speech tag. We may see the task of searching compound words as text segmentation. Our second issue will therefore focus on automatic segmentation of French texts and its impact on the performance of automatic processes. In order to do this, we focused on an approach of coupling, in a unique probabilistic model, the recognition of compound words and another task. In our case, it may be parsing or tagging. Recognition of compound words is performed within the probabilistic process rather than in a preliminary phase. Our goal is to propose innovative strategies for integrating resources of compound words in both processes combining probabilistic tagging, or parsing, and text segmentation Analyse syntaxique Étiquetage morpho-syntaxique Lexiques Hybridation Dispersion des données Segmentation automatique Parsing Part-Of-Speech Tagging Lexicons Hybridisation Segmentation Data sparseness
22	Text Augmentation: Inserting markup into natural language text with PPM Models Yeates, Stuart Andrew January 2006 (has links) This thesis describes a new optimisation and new heuristics for automatically marking up XML documents, and CEM, a Java implementation, using PPM models. CEM is significantly more general than previous systems, marking up large numbers of hierarchical tags, using n-gram models for large n and a variety of escape methods. Four corpora are discussed, including the bibliography corpus of 14682 bibliographies laid out in seven standard styles using the BibTeX system and marked up in XML with every field from the original BibTeX. Other corpora include the ROCLING Chinese text segmentation corpus, the Computists' Communique corpus and the Reuters' corpus. A detailed examination is presented of the methods of evaluating mark up algorithms, including computation complexity measures and correctness measures from the fields of information retrieval, string processing, machine learning and information theory. A new taxonomy of markup complexities is established and the properties of each taxon are examined in relation to the complexity of marked up documents. The performance of the new heuristics and optimisation are examined using the four corpora. Markup Text Augmentation Textual Analysis Hidden Markov Models HMM PPM Viterbi Search Part-Of-Speech Tagging XML Metadata
23	Bidirectional LSTM-CNNs-CRF Models for POS Tagging Tang, Hao January 2018 (has links) In order to achieve state-of-the-art performance for part-of-speech(POS) tagging, the traditional systems require a significant amount of hand-crafted features and data pre-processing. In this thesis, we present a discriminative word embedding, character embedding and byte pair encoding (BPE) hybrid neural network architecture to implement a true end-to-end system without feature engineering and data pre-processing. The neural network architecture is a combination of bidirectional LSTM, CNNs, and CRF, which can achieve a state-of-the-art performance for a wide range of sequence labeling tasks. We evaluate our model on Universal Dependencies (UD) dataset for English, Spanish, and German POS tagging. It outperforms other models with 95.1%, 98.15%, and 93.43% accuracy on testing datasets respectively. Moreover, the largest improvements of our model appear on out-of-vocabulary corpora for Spanish and German. According to statistical significance testing, the improvements of English on testing and out-of-vocabulary corpora are not statistically significant. However, the improvements of the other more morphological languages are statistically significant on their corresponding corpora. bidirectional LSTM part of speech CNNs CRF byte pair encoding (BPE)
24	[pt] CLASSES DE PALAVRAS - DA GRÉCIA ANTIGA AO GOOGLE: UM ESTUDO MOTIVADO PELA CONVERSÃO DE TAGSETS / [en] PART OF SPEECH - FROM ANCIENT GREECE TO GOOGLE: A STUDY MOTIVATED BY TAGSET CONVERSION LUIZA FRIZZO TRUGO 10 November 2016 (has links) [pt] A dissertação Classes de palavras — da Grécia Antiga ao Google: um estudo motivado pela conversão de tagsets consiste em um estudo linguístico sobre classes gramaticais. A pesquisa tem como motivação uma tarefa específica da Linguística Computacional: a anotação de classes gramaticais (POS, do inglês part of speech ). Especificamente, a dissertação relata desafios e opções linguísticas decorrentes da tarefa de alinhamento entre dois tagsets: o tagset utilizado na anotação do corpus Mac-Morpho, um corpus brasileiro de 1.1 milhão de palavras, e o tagset proposto por uma equipe dos laboratórios Google e que vem sendo utilizado no âmbito do projeto Universal Dependencies (UD). A dissertação tem como metodologia a investigação por meio da anotação de grandes corpora e tematiza sobretudo o alinhamento entre as formas participiais. Como resultado, além do estudo e da documentação das opções linguísticas, a presente pesquisa também propiciou um cenário que viabiliza o estudo do impacto de diferentes tagsets em sistemas de Processamento de Linguagem Natural (PLN) e possibilitou a criação e a disponibilização de mais um recurso para a área de processamento de linguagem natural do português: o corpus Mac-Morpho anotado com o tagset e a filosofia de anotação do projeto UD, viabilizando assim estudos futuros sobre o impacto de diferentes tagsets no processamento automático de uma língua. / [en] The present dissertation, Part of speech — from Ancient Greece to Google: a study motivated by tagset conversion, is a linguistic study regarding gramatical word classes. This research is motivated by a specific task from Computational Linguistics: the annotation of part of speech (POS). Specifically, this dissertation reports the challenges and linguistic options arising from the task of aligning two tagsets: the first used in the annotation of the Mac-Morpho corpus — a Brazilian corpus with 1.1 million words — and the second proposed by Google research lab, which has been used in the context of the Universal Dependencies (UD) project. The present work adopts the annotation of large corpora as methodology and focuses mainly on the alignment of the past participle forms. As a result, in addition to the study and the documentation of the linguistic choices, this research provides a scenario which enables the study of the impact different tagsets have on Natural Language Processing (NLP) systems and presents another Portuguese NLP resource: the Mac-Morpho corpus annotated with project UD s tagset and consistent with its annotation philosophy, thus enabling future studies regarding the impact of different tagsets in the automatic processing of a language. [pt] LINGUISTICA COMPUTACIONAL [pt] PARTICIPIO [pt] ANOTACAO [pt] CORPUS [pt] CLASSE DE PALAVRAS [en] COMPUTATIONAL LINGUISTICS [en] PARTICIPLE [en] ANNOTATION [en] CORPORA [en] PART OF SPEECH
25	[en] SUPPORT NOUNS: OPERATIONAL CRITERIA FOR CHARACTERIZATION / [pt] O SUBSTANTIVO-SUPORTE: CRITÉRIOS OPERACIONAIS DE CARACTERIZAÇÃO CLAUDIA MARIA GARCIA MEDEIROS DE OLIVEIRA 06 March 2007 (has links) [pt] Este trabalho tem por objetivo prover um critério operacional para caracterizar substantivos em combinações de substantivo seguido de adjetivo, em que o substantivo apresenta situação análoga à dos chamados verbos leves ou verbos-suporte, largamente estudados em Lingüística e Processamento de Linguagem Natural nos últimos anos. O trabalho se situa na confluência entre estudos lingüísticos, lexicográficos e computacionais e pretende explorar a potencialidade da análise automática de corpora e instrumentos quantitativos em busca de uma maior objetividade na fundamentação de conceitos que norteiam a atividade de análise lingüística. O desenvolvimento da pesquisa alia a pesquisa em corpus ao dicionário tradicional para realizar o levantamento das principais propriedades das combinações S - Adj, particularizado para o caso de ocorrência de adjetivos denominais. A partir das informações lexicográficas e contextuais demonstra-se a existência de um conjunto de substantivos que participam das construções estudadas de maneira semelhante aos verbos- suporte em combinações V - SN. Um método automático de reconhecimento dos substantivos-suporte em textos é elaborado, com o objetivo de fornecer aos estudiosos um instrumento capaz de produzir evidências convincentes, dada a insuficiência de julgamentos intuitivos para justificar a delimitação de expressões de aparente irregularidade. / [en] The main goal of this work is to provide operational criteria for characterizing nouns in Noun - Adjective combinations, in which the noun occurs in an analogous way to so called light verbs or support verbs, widely studied in recent years in both Linguistics and Natural Language Processing. In the work, linguistic, lexicographic and computational studies converge in order to explore the potential for automatic analysis of corpora, whose aim is to provide quantitative tools and methods which would lead to a more objective way of establishing concepts which underlie linguistic analysis. The work unites corpus-based research with traditional lexicography in order to elicit the main properties of the N-Adj combinations occurring with denominal adjectives. The lexicographic and contextual data reveal the existence of a set of nouns that occur in the studied constructions in a way similar to light verbs in V-Noun phrasal combinations. An automatic method for recognizing support nouns in texts is developed, which will provide language specialists with an instrument capable of bringing solid evidence to add to intuitive judgments in the task of justifying the delimitation of expressions that are apparently irregular [pt] LINGUISTICA [en] LINGUISTICS [pt] LEXICOGRAFIA DE CORPUS [en] CORPUS LEXICOGRAPHY [pt] SUBSTANTIVO-SUPORTE [en] SUPPORT NOUN [pt] ADJETIVO DENOMINAL [en] DENOMINAL ADJECTIVE [pt] CLASSE DE PALAVRAS [en] PART OF SPEECH
26	A comparative analysis of word use in popular science and research articles in the natural sciences: A corpus linguistic investigation Nilsson, Fredrik January 2019 (has links) Within the realm of the natural sciences there are different written genres for interested readers to explore. Popular science articles aim to explain advanced scientific research to a non-expert audience while research articles target the science experts themselves. This study explores these genres in some detail in order to identify linguistic differences between them. Using two corpora consisting of over 200 000 words each, a corpus linguistic analysis was used to perform both quantitative and qualitative examinations of the two genres. The methods of analysis included word frequency, keyword, concordance, cluster and collocation analyses. Also, part-of-speech tagging was used as a complement to distinguish word class use between the two genres. The results show that popular science articles feature personal pronouns to a much greater extent compared to research articles, which contain more noun repetition and specific terminology overall. In addition, the keywords proved to be significant for the respective genres, both in and out of their original context as well as in word clusters, forming word constructions typical of each genre. Overall, the study showed that while both genres are very much related through their roots in natural science research they accomplish the task of disseminating scientific information using different linguistic approaches. corpus linguistics popular science popularization research articles word frequency concordance keyword-in-context word clusters part-of-speech General Language Studies and Linguistics
27	[en] SEMANTIC TYPOLOGIES OF ADVERBS: A COMPARATIVE STUDY / [pt] TIPOLOGIAS SEMÂNTICAS DE ADVÉRBIOS: UM ESTUDO COMPARATIVO ZENAIDE DIAS TEIXEIRA 02 June 2008 (has links) [pt] Este trabalho teve por objetivo descrever, analisar e discutir comparativamente tipologias semânticas de advérbios propostas em duas vertentes dos estudos da linguagem: a Gramática Tradicional, de um lado, e a Lingüística de orientação funcionalista, de outro. Para tal, mapeamos tipologias encontradas em um conjunto representativo de gramáticas tradicionais do português e em uma amostra não menos representativa de trabalhos de lingüistas brasileiros que se debruçaram sobre o tema adotando uma abordagem funcionalista. Propusemos dois quadros tipológicos resumitivos das duas vertentes de classificação, nos quais buscamos identificar as principais classes semânticas estabelecidas em cada uma das duas vertentes. Aplicamos, então esses dois instrumentos de classificação a um mesmo corpus de frases autênticas do português (extraído do centro de recursos distribuídos Linguateca/Frases PB), e analisamos os resultados comparando as duas classificações quanto aos seguintes critérios: (a) abrangência; (b) explicitude; e (c) adequação aos propósitos norteadores (normativo-didáticos ou teórico-descritivos). Tendo em vista tais critérios, apontamos vantagens e desvantagens relativas dos dois tipos de classificação e destacamos alguns problemas enfrentados igualmente na tradição gramatical e na lingüística funcionalista no que tange a caracterização do comportamento semântico dos advérbios. / [en] This work aims to describe, analyze and discuss comparatively semantic typologies of adverbs proposed in two srands of linguistic studies: Traditional Grammar on the one hand and Functional Linguistics on the other. For that, we analyzed and mapped typologies found in a representative set of Portuguese traditional grammars as well as in an equally significant sample of funtictionally-oriented Brazilian linguistic studies. Two typological schemes were proposed to represent the two approaches, identifying the main semantic categories established in each one of them. These two classification instruments were then applied to the analysis of a corpus of Portuguese authentic sentences (an excerpt from the corpus made availble by Linguateca/Frases PB). The classifications were analyzed and compared according to the following criteria: a) range; b) explicitude; c) suitability to purposes (didactic-normative and/or descriptive-theoretical). Using these criteria, it was possible to point out relative advantages and disadvantages in both types of approach and to expose some problems faced equally by traditional grammar and functional linguistics concerning the semantic behaviour of adverbs. [pt] FUNCIONALISMO [en] FUNCTIONALISM [pt] CLASSE DE PALAVRAS [en] PART OF SPEECH [pt] TIPOLOGIAS SEMANTICAS [en] SEMANTICS TYPES [pt] ADVERBIOS [en] ADVERBS [pt] GRAMATICA TRADICIONAL [en] TRADITIONAL GRAMMAR
28	New Chinese Words in 2014 – A Study of Word-formation Processes Warell, Peter January 2016 (has links) 随着社会的发展，尤其是互联网的发展，很多语言每年都涌现出了不少新词汇。词语是每个语言最基本也是最重要的组成部分，因此分析这些新词汇的结构特点以及构词法是很有意义的。这篇文章分析了2014年出现在中文里的新词汇和它们的构词方式，论文的目的是为了更好地了解中文词汇的发展和特点。本文以《2014汉语新词语》中公布的2014年出现的新词汇作为语料进行分析，发现了以下两个主要特点：第一，合成法，派生法，缩略法是2014年产生的新词汇的主要构词方式；第二，百分之七十二的新词汇是多音节词（包含三个或者三个以上音节），而百分之八十的是名词。这些特点说明中文词汇现阶段的特点和发展趋势，跟传统的中文词汇有不同之处。 / The aim of this thesis was to investigate how new Chinese words are formed and to examine the linguistic patterns among them. This thesis focused on the analysis of Chinese words formed in 2014. The quantitative data for the analysis included a collection of 423 new Chinese words from the book 2014 汉语新词语 (hànyǔxīn cíyǔ) by Hou and Zhou. Parts of speech and number of syllables in the new words were investigated, although the focus was on word-formation processes. A discussion of derivation, blending, abbreviation, analogy, borrowing, change of meaning, compounding and inventions is also included. The share of each word-formation process used for each of the new words was presented statistically in order to reveal the significance of each word-formation process. The analysis showed that compounding, derivation and abbreviation were the major word-formation processes in 2014. The study also suggests that words formed by derivation and analogy were much more frequent in 2014, in comparison to previous studies. Furthermore, the ways words are formed in Chinese are changing and evolving, as some word-formation processes are becoming more frequently used in the formation of new words. Words new words word-formation processes contemporary Chinese part of speech syllables blending abbreviation borrowing blending analogy derivation compounds language change quasi-affixes. Specific Languages Studier av enskilda språk
29	Traitement automatique du dialecte tunisien à l'aide d'outils et de ressources de l'arabe standard : application à l'étiquetage morphosyntaxique / Natural Language Processing Of Tunisian Dialect using Standard Arabic Tools and Resources : application to Part-Of-Speech Tagging Hamdi, Ahmed 04 December 2015 (has links) Le développement d’outils de traitement automatique pour les dialectes de l’arabe se heurte à l’absence de ressources pour ces derniers. Comme conséquence d’une situation de diglossie, il existe une variante de l’arabe, l’arabe moderne standard, pour laquelle de nombreuses ressources ont été développées et ont permis de construire des outils de traitement automatique de la langue. Étant donné la proximité des dialectes de l’arabe, avec l’arabe moderne standard, une voie consiste à réaliser une conversion surfacique du dialecte vers l’arabe mo- derne standard afin de pouvoir utiliser les outils existants pour l’arabe standard. Dans ce travail, nous nous intéressons particulièrement au traitement du dialecte tunisien. Nous proposons un système de conversion du tunisien vers une forme approximative de l’arabe standard pour laquelle l’application des outils conçus pour ce dernier permet d’obtenir de bons résultats. Afin de valider cette approche, nous avons eu recours à un étiqueteur morphosyntaxique conçu pour l’étiquetage de l’arabe standard. Ce dernier permet d’assigner des étiquettes morphosyntaxiques à la sortie de notre système de conver- sion. Ces étiquettes sont finalement projetées sur le tunisien. Notre système atteint une précision de 89% suite à la conversion qui repré- sente une augmentation absolue de ∼20% par rapport à l’étiquetage d’avant la conversion. / Developing natural language processing tools usually requires a large number of resources (lexica, annotated corpora, ...), which often do not exist for less- resourced languages. One way to overcome the problem of lack of resources is to devote substantial efforts to build new ones from scratch. Another approach is to exploit existing resources of closely related languages. Taking advantage of the closeness of standard Arabic and its dialects, one way to solve the problem of limited resources, consists in performing a conversion of Arabic dialects into standard Arabic in order to use the tools developed to handle the latter. In this work, we focus especially on processing Tunisian Arabic dialect. We propose a conversion system of Tunisian into a closely form of standard Arabic for which the application of natural language processing tools designed for the latter provides good results. In order to validate our approach, we focused on part-of-speech tagging. Our system achieved an accuracy of 89% which presents ∼20% of absolute improvement over a standard Arabic tagger baseline. Traitement automatique Étiquetage morphosyntaxique Outils Ressources Conversion Arabe standard Dialecte tunisien Natural language Processing Part-Of-Speech Tagging Resources Tools Conversion Modern standard Arabic Tunisian dialect 004
30	Toward an on-line preprocessor for Swedish / Mot en on-line preprocessor för svenska Wemmert, Oscar January 2017 (has links) This bachelor thesis presents OPT (Open Parse Tool), a java program allowing for independent parsers/taggers to be run in sequence. For this thesis the existing java versions of Stagger and Maltparser has been adapted for use as modules in this program, and OPT's performance has then been compared to an existing, in use, alternative (Språkbanken's Korp Corpus Pipeline, henceforth KCP). Execution speed has been compared, and OPT's accuracy has been coarsly tested as either comparable or divergent to that of KCP. The same collection of documents containing natural text has been fed through OPT and KCP in sequence, and execution time was recorded. The tagged output of OPT and KCP was then run through SCREAM (Sjöholm, 2012) and if SCREAM produced comparable results between the two, the accuracy of OPT was considered as comparable to KCP. The results show that OPT completes its tagging and parsing of the documents in around 35 minutes, while KCP took over four hours to complete. SCREAM performed almost exactly the same using the outputs of either program, except for one case in which OPT's output gave better results than KCP's. The accuracy of OPT was thus considered comparable to KCP. The one divergent example can not fully be understood or explained in this thesis, given that the thesis considers SCREAM's internals as mostly that of a black box. Natural Language Preprocessing Part-of-Speech-Tagging Dependency Parsing Readability Human Computer Interaction

Search results