Global ETD Search

31	Tvorba závislostního korpusu pro jorubštinu s využitím paralelních dat / Tvorba závislostního korpusu pro jorubštinu s využitím paralelních dat Oluokun, Adedayo January 2018 (has links) The goal of this thesis is to create a dependency treebank for Yorùbá, a language with very little pre-existing machine-readable resources. The treebank follows the Universal Dependencies (UD) annotation standard, certain language-specific guidelines for Yorùbá were specified. Known techniques for porting resources from resource-rich languages were tested, in particular projection of annotation across parallel bilingual data. Manual annotation is not the main focus of this thesis; nevertheless, a small portion of the data was verified manually in order to evaluate the annotation quality. Also, a model was trained on the manual annotation using UDPipe.
32	Bidirectional LSTM-CNNs-CRF Models for POS Tagging Tang, Hao January 2018 (has links) In order to achieve state-of-the-art performance for part-of-speech(POS) tagging, the traditional systems require a significant amount of hand-crafted features and data pre-processing. In this thesis, we present a discriminative word embedding, character embedding and byte pair encoding (BPE) hybrid neural network architecture to implement a true end-to-end system without feature engineering and data pre-processing. The neural network architecture is a combination of bidirectional LSTM, CNNs, and CRF, which can achieve a state-of-the-art performance for a wide range of sequence labeling tasks. We evaluate our model on Universal Dependencies (UD) dataset for English, Spanish, and German POS tagging. It outperforms other models with 95.1%, 98.15%, and 93.43% accuracy on testing datasets respectively. Moreover, the largest improvements of our model appear on out-of-vocabulary corpora for Spanish and German. According to statistical significance testing, the improvements of English on testing and out-of-vocabulary corpora are not statistically significant. However, the improvements of the other more morphological languages are statistically significant on their corresponding corpora. bidirectional LSTM part of speech CNNs CRF byte pair encoding (BPE)
33	[pt] CLASSES DE PALAVRAS - DA GRÉCIA ANTIGA AO GOOGLE: UM ESTUDO MOTIVADO PELA CONVERSÃO DE TAGSETS / [en] PART OF SPEECH - FROM ANCIENT GREECE TO GOOGLE: A STUDY MOTIVATED BY TAGSET CONVERSION LUIZA FRIZZO TRUGO 10 November 2016 (has links) [pt] A dissertação Classes de palavras — da Grécia Antiga ao Google: um estudo motivado pela conversão de tagsets consiste em um estudo linguístico sobre classes gramaticais. A pesquisa tem como motivação uma tarefa específica da Linguística Computacional: a anotação de classes gramaticais (POS, do inglês part of speech ). Especificamente, a dissertação relata desafios e opções linguísticas decorrentes da tarefa de alinhamento entre dois tagsets: o tagset utilizado na anotação do corpus Mac-Morpho, um corpus brasileiro de 1.1 milhão de palavras, e o tagset proposto por uma equipe dos laboratórios Google e que vem sendo utilizado no âmbito do projeto Universal Dependencies (UD). A dissertação tem como metodologia a investigação por meio da anotação de grandes corpora e tematiza sobretudo o alinhamento entre as formas participiais. Como resultado, além do estudo e da documentação das opções linguísticas, a presente pesquisa também propiciou um cenário que viabiliza o estudo do impacto de diferentes tagsets em sistemas de Processamento de Linguagem Natural (PLN) e possibilitou a criação e a disponibilização de mais um recurso para a área de processamento de linguagem natural do português: o corpus Mac-Morpho anotado com o tagset e a filosofia de anotação do projeto UD, viabilizando assim estudos futuros sobre o impacto de diferentes tagsets no processamento automático de uma língua. / [en] The present dissertation, Part of speech — from Ancient Greece to Google: a study motivated by tagset conversion, is a linguistic study regarding gramatical word classes. This research is motivated by a specific task from Computational Linguistics: the annotation of part of speech (POS). Specifically, this dissertation reports the challenges and linguistic options arising from the task of aligning two tagsets: the first used in the annotation of the Mac-Morpho corpus — a Brazilian corpus with 1.1 million words — and the second proposed by Google research lab, which has been used in the context of the Universal Dependencies (UD) project. The present work adopts the annotation of large corpora as methodology and focuses mainly on the alignment of the past participle forms. As a result, in addition to the study and the documentation of the linguistic choices, this research provides a scenario which enables the study of the impact different tagsets have on Natural Language Processing (NLP) systems and presents another Portuguese NLP resource: the Mac-Morpho corpus annotated with project UD s tagset and consistent with its annotation philosophy, thus enabling future studies regarding the impact of different tagsets in the automatic processing of a language. [pt] LINGUISTICA COMPUTACIONAL [pt] PARTICIPIO [pt] ANOTACAO [pt] CORPUS [pt] CLASSE DE PALAVRAS [en] COMPUTATIONAL LINGUISTICS [en] PARTICIPLE [en] ANNOTATION [en] CORPORA [en] PART OF SPEECH
34	[en] SUPPORT NOUNS: OPERATIONAL CRITERIA FOR CHARACTERIZATION / [pt] O SUBSTANTIVO-SUPORTE: CRITÉRIOS OPERACIONAIS DE CARACTERIZAÇÃO CLAUDIA MARIA GARCIA MEDEIROS DE OLIVEIRA 06 March 2007 (has links) [pt] Este trabalho tem por objetivo prover um critério operacional para caracterizar substantivos em combinações de substantivo seguido de adjetivo, em que o substantivo apresenta situação análoga à dos chamados verbos leves ou verbos-suporte, largamente estudados em Lingüística e Processamento de Linguagem Natural nos últimos anos. O trabalho se situa na confluência entre estudos lingüísticos, lexicográficos e computacionais e pretende explorar a potencialidade da análise automática de corpora e instrumentos quantitativos em busca de uma maior objetividade na fundamentação de conceitos que norteiam a atividade de análise lingüística. O desenvolvimento da pesquisa alia a pesquisa em corpus ao dicionário tradicional para realizar o levantamento das principais propriedades das combinações S - Adj, particularizado para o caso de ocorrência de adjetivos denominais. A partir das informações lexicográficas e contextuais demonstra-se a existência de um conjunto de substantivos que participam das construções estudadas de maneira semelhante aos verbos- suporte em combinações V - SN. Um método automático de reconhecimento dos substantivos-suporte em textos é elaborado, com o objetivo de fornecer aos estudiosos um instrumento capaz de produzir evidências convincentes, dada a insuficiência de julgamentos intuitivos para justificar a delimitação de expressões de aparente irregularidade. / [en] The main goal of this work is to provide operational criteria for characterizing nouns in Noun - Adjective combinations, in which the noun occurs in an analogous way to so called light verbs or support verbs, widely studied in recent years in both Linguistics and Natural Language Processing. In the work, linguistic, lexicographic and computational studies converge in order to explore the potential for automatic analysis of corpora, whose aim is to provide quantitative tools and methods which would lead to a more objective way of establishing concepts which underlie linguistic analysis. The work unites corpus-based research with traditional lexicography in order to elicit the main properties of the N-Adj combinations occurring with denominal adjectives. The lexicographic and contextual data reveal the existence of a set of nouns that occur in the studied constructions in a way similar to light verbs in V-Noun phrasal combinations. An automatic method for recognizing support nouns in texts is developed, which will provide language specialists with an instrument capable of bringing solid evidence to add to intuitive judgments in the task of justifying the delimitation of expressions that are apparently irregular [pt] LINGUISTICA [en] LINGUISTICS [pt] LEXICOGRAFIA DE CORPUS [en] CORPUS LEXICOGRAPHY [pt] SUBSTANTIVO-SUPORTE [en] SUPPORT NOUN [pt] ADJETIVO DENOMINAL [en] DENOMINAL ADJECTIVE [pt] CLASSE DE PALAVRAS [en] PART OF SPEECH
35	A comparative analysis of word use in popular science and research articles in the natural sciences: A corpus linguistic investigation Nilsson, Fredrik January 2019 (has links) Within the realm of the natural sciences there are different written genres for interested readers to explore. Popular science articles aim to explain advanced scientific research to a non-expert audience while research articles target the science experts themselves. This study explores these genres in some detail in order to identify linguistic differences between them. Using two corpora consisting of over 200 000 words each, a corpus linguistic analysis was used to perform both quantitative and qualitative examinations of the two genres. The methods of analysis included word frequency, keyword, concordance, cluster and collocation analyses. Also, part-of-speech tagging was used as a complement to distinguish word class use between the two genres. The results show that popular science articles feature personal pronouns to a much greater extent compared to research articles, which contain more noun repetition and specific terminology overall. In addition, the keywords proved to be significant for the respective genres, both in and out of their original context as well as in word clusters, forming word constructions typical of each genre. Overall, the study showed that while both genres are very much related through their roots in natural science research they accomplish the task of disseminating scientific information using different linguistic approaches. corpus linguistics popular science popularization research articles word frequency concordance keyword-in-context word clusters part-of-speech General Language Studies and Linguistics
36	[en] SEMANTIC TYPOLOGIES OF ADVERBS: A COMPARATIVE STUDY / [pt] TIPOLOGIAS SEMÂNTICAS DE ADVÉRBIOS: UM ESTUDO COMPARATIVO ZENAIDE DIAS TEIXEIRA 02 June 2008 (has links) [pt] Este trabalho teve por objetivo descrever, analisar e discutir comparativamente tipologias semânticas de advérbios propostas em duas vertentes dos estudos da linguagem: a Gramática Tradicional, de um lado, e a Lingüística de orientação funcionalista, de outro. Para tal, mapeamos tipologias encontradas em um conjunto representativo de gramáticas tradicionais do português e em uma amostra não menos representativa de trabalhos de lingüistas brasileiros que se debruçaram sobre o tema adotando uma abordagem funcionalista. Propusemos dois quadros tipológicos resumitivos das duas vertentes de classificação, nos quais buscamos identificar as principais classes semânticas estabelecidas em cada uma das duas vertentes. Aplicamos, então esses dois instrumentos de classificação a um mesmo corpus de frases autênticas do português (extraído do centro de recursos distribuídos Linguateca/Frases PB), e analisamos os resultados comparando as duas classificações quanto aos seguintes critérios: (a) abrangência; (b) explicitude; e (c) adequação aos propósitos norteadores (normativo-didáticos ou teórico-descritivos). Tendo em vista tais critérios, apontamos vantagens e desvantagens relativas dos dois tipos de classificação e destacamos alguns problemas enfrentados igualmente na tradição gramatical e na lingüística funcionalista no que tange a caracterização do comportamento semântico dos advérbios. / [en] This work aims to describe, analyze and discuss comparatively semantic typologies of adverbs proposed in two srands of linguistic studies: Traditional Grammar on the one hand and Functional Linguistics on the other. For that, we analyzed and mapped typologies found in a representative set of Portuguese traditional grammars as well as in an equally significant sample of funtictionally-oriented Brazilian linguistic studies. Two typological schemes were proposed to represent the two approaches, identifying the main semantic categories established in each one of them. These two classification instruments were then applied to the analysis of a corpus of Portuguese authentic sentences (an excerpt from the corpus made availble by Linguateca/Frases PB). The classifications were analyzed and compared according to the following criteria: a) range; b) explicitude; c) suitability to purposes (didactic-normative and/or descriptive-theoretical). Using these criteria, it was possible to point out relative advantages and disadvantages in both types of approach and to expose some problems faced equally by traditional grammar and functional linguistics concerning the semantic behaviour of adverbs. [pt] FUNCIONALISMO [en] FUNCTIONALISM [pt] CLASSE DE PALAVRAS [en] PART OF SPEECH [pt] TIPOLOGIAS SEMANTICAS [en] SEMANTICS TYPES [pt] ADVERBIOS [en] ADVERBS [pt] GRAMATICA TRADICIONAL [en] TRADITIONAL GRAMMAR
37	Hledání struktury vět přirozeného jazyka pomocí částečně řízených metod / Discovering the structure of natural language sentences by semi-supervised methods Rosa, Rudolf January 2018 (has links) Discovering the structure of natural language sentences by semi-supervised methods Rudolf Rosa In this thesis, we focus on the problem of automatically syntactically ana- lyzing a language for which there is no syntactically annotated training data. We explore several methods for cross-lingual transfer of syntactic as well as morphological annotation, ultimately based on utilization of bilingual or multi- lingual sentence-aligned corpora and machine translation approaches. We pay particular attention to automatic estimation of the appropriateness of a source language for the analysis of a given target language, devising a novel measure based on the similarity of part-of-speech sequences frequent in the languages. The effectiveness of the presented methods has been confirmed by experiments conducted both by us as well as independently by other respectable researchers. 1
38	New Chinese Words in 2014 – A Study of Word-formation Processes Warell, Peter January 2016 (has links) 随着社会的发展，尤其是互联网的发展，很多语言每年都涌现出了不少新词汇。词语是每个语言最基本也是最重要的组成部分，因此分析这些新词汇的结构特点以及构词法是很有意义的。这篇文章分析了2014年出现在中文里的新词汇和它们的构词方式，论文的目的是为了更好地了解中文词汇的发展和特点。本文以《2014汉语新词语》中公布的2014年出现的新词汇作为语料进行分析，发现了以下两个主要特点：第一，合成法，派生法，缩略法是2014年产生的新词汇的主要构词方式；第二，百分之七十二的新词汇是多音节词（包含三个或者三个以上音节），而百分之八十的是名词。这些特点说明中文词汇现阶段的特点和发展趋势，跟传统的中文词汇有不同之处。 / The aim of this thesis was to investigate how new Chinese words are formed and to examine the linguistic patterns among them. This thesis focused on the analysis of Chinese words formed in 2014. The quantitative data for the analysis included a collection of 423 new Chinese words from the book 2014 汉语新词语 (hànyǔxīn cíyǔ) by Hou and Zhou. Parts of speech and number of syllables in the new words were investigated, although the focus was on word-formation processes. A discussion of derivation, blending, abbreviation, analogy, borrowing, change of meaning, compounding and inventions is also included. The share of each word-formation process used for each of the new words was presented statistically in order to reveal the significance of each word-formation process. The analysis showed that compounding, derivation and abbreviation were the major word-formation processes in 2014. The study also suggests that words formed by derivation and analogy were much more frequent in 2014, in comparison to previous studies. Furthermore, the ways words are formed in Chinese are changing and evolving, as some word-formation processes are becoming more frequently used in the formation of new words. Words new words word-formation processes contemporary Chinese part of speech syllables blending abbreviation borrowing blending analogy derivation compounds language change quasi-affixes. Specific Languages Studier av enskilda språk
39	Traitement automatique du dialecte tunisien à l'aide d'outils et de ressources de l'arabe standard : application à l'étiquetage morphosyntaxique / Natural Language Processing Of Tunisian Dialect using Standard Arabic Tools and Resources : application to Part-Of-Speech Tagging Hamdi, Ahmed 04 December 2015 (has links) Le développement d’outils de traitement automatique pour les dialectes de l’arabe se heurte à l’absence de ressources pour ces derniers. Comme conséquence d’une situation de diglossie, il existe une variante de l’arabe, l’arabe moderne standard, pour laquelle de nombreuses ressources ont été développées et ont permis de construire des outils de traitement automatique de la langue. Étant donné la proximité des dialectes de l’arabe, avec l’arabe moderne standard, une voie consiste à réaliser une conversion surfacique du dialecte vers l’arabe mo- derne standard afin de pouvoir utiliser les outils existants pour l’arabe standard. Dans ce travail, nous nous intéressons particulièrement au traitement du dialecte tunisien. Nous proposons un système de conversion du tunisien vers une forme approximative de l’arabe standard pour laquelle l’application des outils conçus pour ce dernier permet d’obtenir de bons résultats. Afin de valider cette approche, nous avons eu recours à un étiqueteur morphosyntaxique conçu pour l’étiquetage de l’arabe standard. Ce dernier permet d’assigner des étiquettes morphosyntaxiques à la sortie de notre système de conver- sion. Ces étiquettes sont finalement projetées sur le tunisien. Notre système atteint une précision de 89% suite à la conversion qui repré- sente une augmentation absolue de ∼20% par rapport à l’étiquetage d’avant la conversion. / Developing natural language processing tools usually requires a large number of resources (lexica, annotated corpora, ...), which often do not exist for less- resourced languages. One way to overcome the problem of lack of resources is to devote substantial efforts to build new ones from scratch. Another approach is to exploit existing resources of closely related languages. Taking advantage of the closeness of standard Arabic and its dialects, one way to solve the problem of limited resources, consists in performing a conversion of Arabic dialects into standard Arabic in order to use the tools developed to handle the latter. In this work, we focus especially on processing Tunisian Arabic dialect. We propose a conversion system of Tunisian into a closely form of standard Arabic for which the application of natural language processing tools designed for the latter provides good results. In order to validate our approach, we focused on part-of-speech tagging. Our system achieved an accuracy of 89% which presents ∼20% of absolute improvement over a standard Arabic tagger baseline. Traitement automatique Étiquetage morphosyntaxique Outils Ressources Conversion Arabe standard Dialecte tunisien Natural language Processing Part-Of-Speech Tagging Resources Tools Conversion Modern standard Arabic Tunisian dialect 004
40	Toward an on-line preprocessor for Swedish / Mot en on-line preprocessor för svenska Wemmert, Oscar January 2017 (has links) This bachelor thesis presents OPT (Open Parse Tool), a java program allowing for independent parsers/taggers to be run in sequence. For this thesis the existing java versions of Stagger and Maltparser has been adapted for use as modules in this program, and OPT's performance has then been compared to an existing, in use, alternative (Språkbanken's Korp Corpus Pipeline, henceforth KCP). Execution speed has been compared, and OPT's accuracy has been coarsly tested as either comparable or divergent to that of KCP. The same collection of documents containing natural text has been fed through OPT and KCP in sequence, and execution time was recorded. The tagged output of OPT and KCP was then run through SCREAM (Sjöholm, 2012) and if SCREAM produced comparable results between the two, the accuracy of OPT was considered as comparable to KCP. The results show that OPT completes its tagging and parsing of the documents in around 35 minutes, while KCP took over four hours to complete. SCREAM performed almost exactly the same using the outputs of either program, except for one case in which OPT's output gave better results than KCP's. The accuracy of OPT was thus considered comparable to KCP. The one divergent example can not fully be understood or explained in this thesis, given that the thesis considers SCREAM's internals as mostly that of a black box. Natural Language Preprocessing Part-of-Speech-Tagging Dependency Parsing Readability Human Computer Interaction

Search results