Spelling suggestions: "subject:"para off speech"" "subject:"para oof speech""
31 |
Tvorba závislostního korpusu pro jorubštinu s využitím paralelních dat / Tvorba závislostního korpusu pro jorubštinu s využitím paralelních datOluokun, Adedayo January 2018 (has links)
The goal of this thesis is to create a dependency treebank for Yorùbá, a language with very little pre-existing machine-readable resources. The treebank follows the Universal Dependencies (UD) annotation standard, certain language-specific guidelines for Yorùbá were specified. Known techniques for porting resources from resource-rich languages were tested, in particular projection of annotation across parallel bilingual data. Manual annotation is not the main focus of this thesis; nevertheless, a small portion of the data was verified manually in order to evaluate the annotation quality. Also, a model was trained on the manual annotation using UDPipe.
|
32 |
Bidirectional LSTM-CNNs-CRF Models for POS TaggingTang, Hao January 2018 (has links)
In order to achieve state-of-the-art performance for part-of-speech(POS) tagging, the traditional systems require a significant amount of hand-crafted features and data pre-processing. In this thesis, we present a discriminative word embedding, character embedding and byte pair encoding (BPE) hybrid neural network architecture to implement a true end-to-end system without feature engineering and data pre-processing. The neural network architecture is a combination of bidirectional LSTM, CNNs, and CRF, which can achieve a state-of-the-art performance for a wide range of sequence labeling tasks. We evaluate our model on Universal Dependencies (UD) dataset for English, Spanish, and German POS tagging. It outperforms other models with 95.1%, 98.15%, and 93.43% accuracy on testing datasets respectively. Moreover, the largest improvements of our model appear on out-of-vocabulary corpora for Spanish and German. According to statistical significance testing, the improvements of English on testing and out-of-vocabulary corpora are not statistically significant. However, the improvements of the other more morphological languages are statistically significant on their corresponding corpora.
|
33 |
[pt] CLASSES DE PALAVRAS - DA GRÉCIA ANTIGA AO GOOGLE: UM ESTUDO MOTIVADO PELA CONVERSÃO DE TAGSETS / [en] PART OF SPEECH - FROM ANCIENT GREECE TO GOOGLE: A STUDY MOTIVATED BY TAGSET CONVERSIONLUIZA FRIZZO TRUGO 10 November 2016 (has links)
[pt] A dissertação Classes de palavras — da Grécia Antiga ao Google:
um estudo motivado pela conversão de tagsets consiste em um estudo
linguístico sobre classes gramaticais. A pesquisa tem como motivação uma
tarefa específica da Linguística Computacional: a anotação de classes
gramaticais (POS, do inglês part of speech ). Especificamente, a
dissertação relata desafios e opções linguísticas decorrentes da tarefa de
alinhamento entre dois tagsets: o tagset utilizado na anotação do corpus
Mac-Morpho, um corpus brasileiro de 1.1 milhão de palavras, e o tagset
proposto por uma equipe dos laboratórios Google e que vem sendo utilizado
no âmbito do projeto Universal Dependencies (UD). A dissertação tem
como metodologia a investigação por meio da anotação de grandes corpora
e tematiza sobretudo o alinhamento entre as formas participiais. Como
resultado, além do estudo e da documentação das opções linguísticas, a
presente pesquisa também propiciou um cenário que viabiliza o estudo do
impacto de diferentes tagsets em sistemas de Processamento de Linguagem
Natural (PLN) e possibilitou a criação e a disponibilização de mais um
recurso para a área de processamento de linguagem natural do português: o
corpus Mac-Morpho anotado com o tagset e a filosofia de anotação do
projeto UD, viabilizando assim estudos futuros sobre o impacto de
diferentes tagsets no processamento automático de uma língua. / [en] The present dissertation, Part of speech — from Ancient Greece to
Google: a study motivated by tagset conversion, is a linguistic study
regarding gramatical word classes. This research is motivated by a specific
task from Computational Linguistics: the annotation of part of speech
(POS). Specifically, this dissertation reports the challenges and linguistic
options arising from the task of aligning two tagsets: the first used in the
annotation of the Mac-Morpho corpus — a Brazilian corpus with 1.1
million words — and the second proposed by Google research lab, which
has been used in the context of the Universal Dependencies (UD) project.
The present work adopts the annotation of large corpora as methodology
and focuses mainly on the alignment of the past participle forms. As a
result, in addition to the study and the documentation of the linguistic
choices, this research provides a scenario which enables the study of the
impact different tagsets have on Natural Language Processing (NLP)
systems and presents another Portuguese NLP resource: the Mac-Morpho
corpus annotated with project UD s tagset and consistent with its annotation
philosophy, thus enabling future studies regarding the impact of different
tagsets in the automatic processing of a language.
|
34 |
[en] SUPPORT NOUNS: OPERATIONAL CRITERIA FOR CHARACTERIZATION / [pt] O SUBSTANTIVO-SUPORTE: CRITÉRIOS OPERACIONAIS DE CARACTERIZAÇÃOCLAUDIA MARIA GARCIA MEDEIROS DE OLIVEIRA 06 March 2007 (has links)
[pt] Este trabalho tem por objetivo prover um critério
operacional para caracterizar substantivos em combinações
de substantivo seguido de adjetivo, em que o substantivo
apresenta situação análoga à dos chamados verbos leves ou
verbos-suporte, largamente estudados em Lingüística e
Processamento de Linguagem Natural nos últimos anos. O
trabalho se situa na confluência entre estudos
lingüísticos, lexicográficos e computacionais e pretende
explorar a potencialidade da análise automática de corpora
e instrumentos quantitativos em busca de uma maior
objetividade na fundamentação de conceitos que norteiam a
atividade de análise lingüística. O desenvolvimento da
pesquisa alia a pesquisa em corpus ao dicionário
tradicional para realizar o levantamento das principais
propriedades das combinações S - Adj, particularizado para
o caso de ocorrência de adjetivos denominais. A partir das
informações lexicográficas e contextuais demonstra-se a
existência de um conjunto de substantivos que participam
das construções estudadas de maneira semelhante aos verbos-
suporte em combinações V - SN. Um método automático de
reconhecimento dos substantivos-suporte em textos é
elaborado, com o objetivo de fornecer aos estudiosos um
instrumento capaz de produzir evidências convincentes,
dada a insuficiência de julgamentos intuitivos para
justificar a delimitação de expressões de aparente
irregularidade. / [en] The main goal of this work is to provide operational
criteria for
characterizing nouns in Noun - Adjective combinations, in
which the noun
occurs in an analogous way to so called light verbs or
support verbs, widely
studied in recent years in both Linguistics and Natural
Language Processing.
In the work, linguistic, lexicographic and computational
studies converge in
order to explore the potential for automatic analysis of
corpora, whose aim
is to provide quantitative tools and methods which would
lead to a more
objective way of establishing concepts which underlie
linguistic analysis.
The work unites corpus-based research with traditional
lexicography in
order to elicit the main properties of the N-Adj
combinations occurring
with denominal adjectives. The lexicographic and
contextual data reveal
the existence of a set of nouns that occur in the studied
constructions
in a way similar to light verbs in V-Noun phrasal
combinations. An
automatic method for recognizing support nouns in texts is
developed, which
will provide language specialists with an instrument
capable of bringing
solid evidence to add to intuitive judgments in the task
of justifying the
delimitation of expressions that are apparently irregular
|
35 |
A comparative analysis of word use in popular science and research articles in the natural sciences: A corpus linguistic investigationNilsson, Fredrik January 2019 (has links)
Within the realm of the natural sciences there are different written genres for interested readers to explore. Popular science articles aim to explain advanced scientific research to a non-expert audience while research articles target the science experts themselves. This study explores these genres in some detail in order to identify linguistic differences between them. Using two corpora consisting of over 200 000 words each, a corpus linguistic analysis was used to perform both quantitative and qualitative examinations of the two genres. The methods of analysis included word frequency, keyword, concordance, cluster and collocation analyses. Also, part-of-speech tagging was used as a complement to distinguish word class use between the two genres. The results show that popular science articles feature personal pronouns to a much greater extent compared to research articles, which contain more noun repetition and specific terminology overall. In addition, the keywords proved to be significant for the respective genres, both in and out of their original context as well as in word clusters, forming word constructions typical of each genre. Overall, the study showed that while both genres are very much related through their roots in natural science research they accomplish the task of disseminating scientific information using different linguistic approaches.
|
36 |
[en] SEMANTIC TYPOLOGIES OF ADVERBS: A COMPARATIVE STUDY / [pt] TIPOLOGIAS SEMÂNTICAS DE ADVÉRBIOS: UM ESTUDO COMPARATIVOZENAIDE DIAS TEIXEIRA 02 June 2008 (has links)
[pt] Este trabalho teve por objetivo descrever, analisar e
discutir
comparativamente tipologias semânticas de advérbios
propostas em duas vertentes
dos estudos da linguagem: a Gramática Tradicional, de um
lado, e a Lingüística
de orientação funcionalista, de outro. Para tal, mapeamos
tipologias encontradas
em um conjunto representativo de gramáticas tradicionais do
português e em uma
amostra não menos representativa de trabalhos de lingüistas
brasileiros que se
debruçaram sobre o tema adotando uma abordagem
funcionalista. Propusemos
dois quadros tipológicos resumitivos das duas vertentes de
classificação, nos quais
buscamos identificar as principais classes semânticas
estabelecidas em cada uma
das duas vertentes. Aplicamos, então esses dois
instrumentos de classificação a
um mesmo corpus de frases autênticas do português (extraído
do centro de
recursos distribuídos Linguateca/Frases PB), e analisamos
os resultados
comparando as duas classificações quanto aos seguintes
critérios: (a) abrangência;
(b) explicitude; e (c) adequação aos propósitos norteadores
(normativo-didáticos
ou teórico-descritivos). Tendo em vista tais critérios,
apontamos vantagens e
desvantagens relativas dos dois tipos de classificação e
destacamos alguns
problemas enfrentados igualmente na tradição gramatical e
na lingüística
funcionalista no que tange a caracterização do
comportamento semântico dos
advérbios. / [en] This work aims to describe, analyze and discuss
comparatively semantic
typologies of adverbs proposed in two srands of linguistic
studies: Traditional
Grammar on the one hand and Functional Linguistics on the
other. For that, we
analyzed and mapped typologies found in a representative
set of Portuguese
traditional grammars as well as in an equally significant
sample of funtictionally-oriented
Brazilian linguistic studies. Two typological schemes were
proposed to
represent the two approaches, identifying the main semantic
categories established
in each one of them. These two classification instruments
were then applied to the
analysis of a corpus of Portuguese authentic sentences (an
excerpt from the corpus
made availble by Linguateca/Frases PB). The classifications
were analyzed and
compared according to the following criteria: a) range; b)
explicitude; c)
suitability to purposes (didactic-normative and/or
descriptive-theoretical). Using
these criteria, it was possible to point out relative
advantages and disadvantages in
both types of approach and to expose some problems faced
equally by traditional
grammar and functional linguistics concerning the semantic
behaviour of adverbs.
|
37 |
Hledání struktury vět přirozeného jazyka pomocí částečně řízených metod / Discovering the structure of natural language sentences by semi-supervised methodsRosa, Rudolf January 2018 (has links)
Discovering the structure of natural language sentences by semi-supervised methods Rudolf Rosa In this thesis, we focus on the problem of automatically syntactically ana- lyzing a language for which there is no syntactically annotated training data. We explore several methods for cross-lingual transfer of syntactic as well as morphological annotation, ultimately based on utilization of bilingual or multi- lingual sentence-aligned corpora and machine translation approaches. We pay particular attention to automatic estimation of the appropriateness of a source language for the analysis of a given target language, devising a novel measure based on the similarity of part-of-speech sequences frequent in the languages. The effectiveness of the presented methods has been confirmed by experiments conducted both by us as well as independently by other respectable researchers. 1
|
38 |
New Chinese Words in 2014 – A Study of Word-formation ProcessesWarell, Peter January 2016 (has links)
随着社会的发展,尤其是互联网的发展,很多语言每年都涌现出了不少新词汇。词语是每个语言最基本也是最重要的组成部分,因此分析这些新词汇的结构特点以及构词法是很有意义的。这篇文章分析了2014年出现在中文里的新词汇和它们的构词方式,论文的目的是为了更好地了解中文词汇的发展和特点。本文以《2014汉语新词语》中公布的2014年出现的新词汇作为语料进行分析,发现了以下两个主要特点:第一,合成法,派生法,缩略法是2014年产生的新词汇的主要构词方式;第二, 百分之七十二的新词汇是多音节词(包含三个或者三个以上音节),而百分之八十的是名词。这些特点说明中文词汇现阶段的特点和发展趋势,跟传统的中文词汇有不同之处。 / The aim of this thesis was to investigate how new Chinese words are formed and to examine the linguistic patterns among them. This thesis focused on the analysis of Chinese words formed in 2014. The quantitative data for the analysis included a collection of 423 new Chinese words from the book 2014 汉语新词语 (hànyǔxīn cíyǔ) by Hou and Zhou. Parts of speech and number of syllables in the new words were investigated, although the focus was on word-formation processes. A discussion of derivation, blending, abbreviation, analogy, borrowing, change of meaning, compounding and inventions is also included. The share of each word-formation process used for each of the new words was presented statistically in order to reveal the significance of each word-formation process. The analysis showed that compounding, derivation and abbreviation were the major word-formation processes in 2014. The study also suggests that words formed by derivation and analogy were much more frequent in 2014, in comparison to previous studies. Furthermore, the ways words are formed in Chinese are changing and evolving, as some word-formation processes are becoming more frequently used in the formation of new words.
|
39 |
Traitement automatique du dialecte tunisien à l'aide d'outils et de ressources de l'arabe standard : application à l'étiquetage morphosyntaxique / Natural Language Processing Of Tunisian Dialect using Standard Arabic Tools and Resources : application to Part-Of-Speech TaggingHamdi, Ahmed 04 December 2015 (has links)
Le développement d’outils de traitement automatique pour les dialectes de l’arabe se heurte à l’absence de ressources pour ces derniers. Comme conséquence d’une situation de diglossie, il existe une variante de l’arabe, l’arabe moderne standard, pour laquelle de nombreuses ressources ont été développées et ont permis de construire des outils de traitement automatique de la langue. Étant donné la proximité des dialectes de l’arabe, avec l’arabe moderne standard, une voie consiste à réaliser une conversion surfacique du dialecte vers l’arabe mo- derne standard afin de pouvoir utiliser les outils existants pour l’arabe standard. Dans ce travail, nous nous intéressons particulièrement au traitement du dialecte tunisien. Nous proposons un système de conversion du tunisien vers une forme approximative de l’arabe standard pour laquelle l’application des outils conçus pour ce dernier permet d’obtenir de bons résultats. Afin de valider cette approche, nous avons eu recours à un étiqueteur morphosyntaxique conçu pour l’étiquetage de l’arabe standard. Ce dernier permet d’assigner des étiquettes morphosyntaxiques à la sortie de notre système de conver- sion. Ces étiquettes sont finalement projetées sur le tunisien. Notre système atteint une précision de 89% suite à la conversion qui repré- sente une augmentation absolue de ∼20% par rapport à l’étiquetage d’avant la conversion. / Developing natural language processing tools usually requires a large number of resources (lexica, annotated corpora, ...), which often do not exist for less- resourced languages. One way to overcome the problem of lack of resources is to devote substantial efforts to build new ones from scratch. Another approach is to exploit existing resources of closely related languages. Taking advantage of the closeness of standard Arabic and its dialects, one way to solve the problem of limited resources, consists in performing a conversion of Arabic dialects into standard Arabic in order to use the tools developed to handle the latter. In this work, we focus especially on processing Tunisian Arabic dialect. We propose a conversion system of Tunisian into a closely form of standard Arabic for which the application of natural language processing tools designed for the latter provides good results. In order to validate our approach, we focused on part-of-speech tagging. Our system achieved an accuracy of 89% which presents ∼20% of absolute improvement over a standard Arabic tagger baseline.
|
40 |
Toward an on-line preprocessor for Swedish / Mot en on-line preprocessor för svenskaWemmert, Oscar January 2017 (has links)
This bachelor thesis presents OPT (Open Parse Tool), a java program allowing for independent parsers/taggers to be run in sequence. For this thesis the existing java versions of Stagger and Maltparser has been adapted for use as modules in this program, and OPT's performance has then been compared to an existing, in use, alternative (Språkbanken's Korp Corpus Pipeline, henceforth KCP). Execution speed has been compared, and OPT's accuracy has been coarsly tested as either comparable or divergent to that of KCP. The same collection of documents containing natural text has been fed through OPT and KCP in sequence, and execution time was recorded. The tagged output of OPT and KCP was then run through SCREAM (Sjöholm, 2012) and if SCREAM produced comparable results between the two, the accuracy of OPT was considered as comparable to KCP. The results show that OPT completes its tagging and parsing of the documents in around 35 minutes, while KCP took over four hours to complete. SCREAM performed almost exactly the same using the outputs of either program, except for one case in which OPT's output gave better results than KCP's. The accuracy of OPT was thus considered comparable to KCP. The one divergent example can not fully be understood or explained in this thesis, given that the thesis considers SCREAM's internals as mostly that of a black box.
|
Page generated in 0.0773 seconds