Spelling suggestions: "subject:"partofspeech"" "subject:"autospeech""
1 |
Tagging a Morphologically Complex Language Using an Averaged Perceptron Tagger: The Case of IcelandicÖstling, Robert January 2013 (has links)
In this paper, we experiment with using Stagger, an open-source implementation of an Averaged Perceptron tagger, to tag Icelandic, a morphologically complex language. By adding languagespecific linguistic features and using IceMorphy, an unknown word guesser, we obtain state-of- the-art tagging accuracy of 92.82%. Furthermore, by adding data from a morphological database, and word embeddings induced from an unannotated corpus, the accuracy increases to 93.84%. This is equivalent to an error reduction of 5.5%, compared to the previously best tagger for Icelandic, consisting of linguistic rules and a Hidden Markov Model.
|
2 |
Automatic Lexicon Generation for Unsupervised Part-of-Speech Tagging Using Only Unannotated TextPereira, Dennis V. 02 September 2004 (has links)
With the growing number of textual resources available, the ability to understand them becomes critical. An essential first step in understanding these sources is the ability to identify the parts-of-speech in each sentence. The goal of this research is to propose, improve, and implement an algorithm capable of finding terms (words in a corpus) that are used in similar ways--a term categorizer. Such a term categorizer can be used to find a particular part-of-speech, i.e. nouns in a corpus, and generate a lexicon. The proposed work is not dependent on any external sources of information, such as dictionaries, and it shows a significant improvement (~30%) over an existing method of categorization. More importantly, the proposed algorithm can be applied as a component of an unsupervised part-of-speech tagger, making it truly unsupervised, requiring only unannotated text. The algorithm is discussed in detail, along with its background, and its performance. Experimentation shows that the proposed algorithm performs within 3% of the baseline, the Penn-TreeBank Lexicon. / Master of Science
|
3 |
Functional Shift in EnglishOusley, Emma Gene 02 1900 (has links)
The purpose of this study will be to make an investigation of the shifting of a word from one part of speech to another, to see whether this linguistic process existed in Old English, Middle English, and to note the prevalence of functional shift among present-day writers.
|
4 |
Assessing the impact of manual corrections in the Groningen Meaning Bank / Assessing the impact of manual corrections in the Groningen Meaning BankWeck, Benno January 2016 (has links)
The Groningen Meaning Bank (GMB) project develops a corpus with rich syntactic and semantic annotations. Annotations in GMB are generated semi-automatically and stem from two sources: (i) Initial annotations from a set of standard NLP tools, (ii) Corrections/refinements by human annotators. For example, on the part-of-speech level of annotation there are currently 18,000 of those corrections, so called Bits of Wisdom (BOWs). For applying this information to boost the NLP processing we experiment how to use the BOWs in retraining the part-of-speech tagger and found that it can be improved to correct up to 70% of identified errors within held-out data. Moreover an improved tagger helps to raise the performance of the parser. Preferring sentences with a high rate of verified tags in retraining has proven to be the most reliable way. With a simulated active learning experiment using Query-by-Uncertainty (QBU) and Query-by- Committee (QBC) we proved that selectively sampling sentences for retraining yields better results with less data needed than random selection. In an additional pilot study we found that a standard maximum-entropy part-of-speech tagger can be augmented so that it uses already known tags to enhance its tagging decisions on an entire sequence without retraining a new model first. Powered by...
|
5 |
Modelagem de contextos para aprendizado automático aplicado à análise morfossintática / Modeling contexts for automatic learning applied to morphosyntactic analysisKepler, Fábio Natanael 28 May 2010 (has links)
A etiquetagem morfossintática envolve atribuir às palavras de uma sentença suas classes morfossintáticas de acordo com os contextos em que elas aparecem. Cadeias de Markov de Tamanho Variável (VLMCs, do inglês \"Variable-Length Markov Chains\") oferecem uma forma de modelar contextos maiores que trigramas sem sofrer demais com a esparsidade de dados e a complexidade do espaço de estados. Mesmo assim, duas palavras do português apresentam um alto grau de ambiguidade: \'que\' e \'a\'. O número de erros na etiquetagem dessas palavras corresponde a um quarto do total de erros cometidos por um etiquetador baseado em VLMCs. Além disso, essas palavras parecem apresentar dois diferentes tipos de ambiguidade: um dependendo de contexto não local e outro de contexto direito. Exploramos maneiras de expandir o modelo baseado em VLMCs através do uso de diferentes modelos e métodos, a fim de atacar esses problemas. As abordagens mostraram variado grau de sucesso, com um método em particular (aprendizado guiado) se mostrando capaz de resolver boa parte da ambiguidade de \'a\'. Discutimos razões para isso acontecer. Com relação a \'que\', ao longo desta tese propusemos e testamos diversos métodos de aprendizado de informação contextual para tentar desambiguá-lo. Mostramos como, em todos eles, o nível de ambiguidade de \'que\' permanece praticamente constante. / Part-of-speech tagging involves assigning to words in a sentence their part-of-speech class based on the contexts they appear in. Variable-Length Markov Chains (VLMCs) offer a way of modeling contexts longer than trigrams without suffering too much from data sparsity and state space complexity. Even so, two words in Portuguese show a high degree of ambiguity: \'que\' and \'a\'. The number of errors tagging these words corresponds to a quarter of the total errors made by a VLMC-based tagger. Moreover, these words seem to show two different types of ambiguity: one depending on non-local context and one on right context. We searched ways of expanding the VLMC-based model with a number of different models and methods in order to tackle these issues. The approaches showed variable degrees of success, with one particular method (Guided Learning) solving much of the ambiguity of \'a\'. We explore reasons why this happened. Rega rding \'que\', throughout this thesis we propose and test various methods for learning contextual information in order to try to disambiguate it. We show how, in all of them, the level of ambiguity shown by \'que\' remains practically c onstant.
|
6 |
Avaliando um rotulador estatístico de categorias morfo-sintáticas para a língua portuguesa / Evaluating a stochastic part-of-speech tagger for the portuguese languageVillavicencio, Aline January 1995 (has links)
O Processamento de Linguagem Natural (PLN) é uma área da Ciência da Computação, que vem tentando, ao longo dos anos, aperfeiçoar a comunicação entre o homem e o computador. Varias técnicas tem sido utilizadas para aperfeiçoar esta comunicação, entre elas a aplicação de métodos estatísticos. Estes métodos tem sido usados por pesquisadores de PLN, com um crescente sucesso e uma de suas maiores vantagens é a possibilidade do tratamento de textos irrestritos. Em particular, a aplicação dos métodos estatísticos, na marcação automática de "corpus" com categorias morfo-sintáticas, tem se mostrado bastante promissora, obtendo resultados surpreendentes. Assim sendo, este trabalho descreve o processo de marcação automática de categorias morfo-sintáticas. Inicialmente, são apresentados e comparados os principais métodos aplicados a marcação automática: os métodos baseados em regras e os métodos estatísticos. São descritos os principais formalismos e técnicas usadas para esta finalidade pelos métodos estatísticos. E introduzida a marcação automática para a Língua Portuguesa, algo até então inédito. O objetivo deste trabalho é fazer um estudo detalhado e uma avaliação do sistema rotulador de categorias morfo-sintáticas, a fim de que se possa definir um padrão no qual o sistema apresente a mais alta precisão possível. Para efetuar esta avaliação, são especificados alguns critérios: a qualidade do "corpus" de treinamento, o seu tamanho e a influencia das palavras desconhecidas. A partir dos resultados obtidos, espera-se poder aperfeiçoar o sistema rotulador, de forma a aproveitar, da melhor maneira possível, os recursos disponíveis para a Língua Portuguesa. / Natural Language Processing (NLP) is an area of Computer Sciences, that have been trying to improve communication between human beings and computers. A number of different techniques have been used to improve this communication and among them, the use of stochastic methods. These methods have successfully being used by NLP researchers and one of their most remarkable advantages is that they are able to deal with unrestricted texts. Namely, the use of stochastic methods for part-of-speech tagging has achieving some extremely good results. Thus, this work describes the process of part-of-speech tagging. At first, we present and compare the main tagging methods: the rule-based methods and the stochastic ones. We describe the main stochastic tagging formalisms and techniques for part-of-speech tagging. We also introduce part-of-speech tagging for the Portuguese Language. The main purpose of this work is to study and evaluate a part-of-speech tagger system in order to establish a pattern in which it is possible to achieve the greatest accuracy. To perform this evaluation, several parameters were set: the corpus quality, its size and the relation between unknown words and accuracy. The results obtained will be used to improve the tagger, in order to use better the available Portuguese Language resources.
|
7 |
Grammatical Functions and Possibilistic Reasoning for the Extraction and Representation of Semantic Knowledge in Text DocumentsKhoury, Richard January 2007 (has links)
This study seeks to explore and develop innovative methods for the extraction of semantic knowledge from unlabelled written English documents and the representation of this knowledge using a formal mathematical expression to facilitate its use in practical applications.
The first method developed in this research focuses on semantic information extraction. To perform this task, the study introduces a natural language processing (NLP) method designed to extract information-rich keywords from English sentences. The method involves initially learning a set of rules that guide the extraction of keywords from parts of sentences. Once this learning stage is completed, the method can be used to extract the keywords from complete sentences by pairing these sentences to the most similar sequence of rules. The key innovation in this method is the use of a part-of-speech hierarchy. By raising words to increasingly general grammatical categories in this hierarchy, the system can compare rules, compute the degree of similarity between them, and learn new rules.
The second method developed in this study addresses the problem of knowledge representation. This method processes triplets of keywords through several successive steps to represent information contained in the triplets using possibility distributions. These distributions represent the possibility of a topic given a particular triplet of keywords. Using this methodology, the information contained in the natural language triplets can be quantified and represented in a mathematical format, which can be easily used in a number of applications, such as document classifiers.
In further extensions to the research, a theoretical justification and mathematical development for both methods are provided, and examples are given to illustrate these notions. Sample applications are also developed based on these methods, and the experimental results generated through these implementations are expounded and thoroughly analyzed to confirm that the methods are reliable in practice.
|
8 |
Grammatical Functions and Possibilistic Reasoning for the Extraction and Representation of Semantic Knowledge in Text DocumentsKhoury, Richard January 2007 (has links)
This study seeks to explore and develop innovative methods for the extraction of semantic knowledge from unlabelled written English documents and the representation of this knowledge using a formal mathematical expression to facilitate its use in practical applications.
The first method developed in this research focuses on semantic information extraction. To perform this task, the study introduces a natural language processing (NLP) method designed to extract information-rich keywords from English sentences. The method involves initially learning a set of rules that guide the extraction of keywords from parts of sentences. Once this learning stage is completed, the method can be used to extract the keywords from complete sentences by pairing these sentences to the most similar sequence of rules. The key innovation in this method is the use of a part-of-speech hierarchy. By raising words to increasingly general grammatical categories in this hierarchy, the system can compare rules, compute the degree of similarity between them, and learn new rules.
The second method developed in this study addresses the problem of knowledge representation. This method processes triplets of keywords through several successive steps to represent information contained in the triplets using possibility distributions. These distributions represent the possibility of a topic given a particular triplet of keywords. Using this methodology, the information contained in the natural language triplets can be quantified and represented in a mathematical format, which can be easily used in a number of applications, such as document classifiers.
In further extensions to the research, a theoretical justification and mathematical development for both methods are provided, and examples are given to illustrate these notions. Sample applications are also developed based on these methods, and the experimental results generated through these implementations are expounded and thoroughly analyzed to confirm that the methods are reliable in practice.
|
9 |
Weakly supervised part-of-speech tagging for Chinese using label propagationDing, Weiwei, 1985- 02 February 2012 (has links)
Part-of-speech (POS) tagging is one of the most fundamental and crucial tasks in Natural Language Processing. Chinese POS tagging is challenging because it also involves word segmentation. In this report, research will be focused on how to improve unsupervised Part-of-Speech (POS) tagging using Hidden Markov Models and the Expectation Maximization parameter estimation approach (EM-HMM). The traditional EM-HMM system uses a dictionary, which is used to constrain possible tag sequences and initialize the model parameters. This is a very crude initialization: the emission
parameters are set uniformly in accordance with the tag dictionary. To improve this, word alignments can be used. Word alignments are the word-level translation correspondent pairs generated from parallel text between two languages. In this report, Chinese-English word alignment is used. The performance is expected to be better, as these two tasks are complementary to each other. The dictionary provides information on word types, while word alignment provides information on word tokens. However, it is found to be of limited benefit.
In this report, another method is proposed. To improve the dictionary coverage and get better POS distribution, Modified Adsorption, a label propagation algorithm is used. We construct a graph connecting word tokens to feature types (such as word unigrams and bigrams) and connecting those tokens to information from knowledge sources, such as a small tag dictionary, Wiktionary, and word alignments. The core idea is to use a small amount of supervision, in the form of a tag dictionary and acquire POS distributions for each word (both known and unknown) and provide this as an improved initialization for EM learning for HMM. We find this strategy to work very well, especially when we have a small tag dictionary. Label propagation provides a better initialization for the EM-HMM method, because it greatly increases the coverage of the dictionary. In addition, label propagation is quite flexible to incorporate many kinds of knowledge. However, results also show that some resources, such as the word alignments, are not easily exploited with label propagation. / text
|
10 |
Outomatiese Afrikaanse woordsoortetikettering / deur Suléne PilonPilon, Suléne January 2005 (has links)
Any community that wants to be part of technological progress has to ensure that the language(s) of that community has/have the necessary human language technology resources. Part of these resources are so-called "core technologies", including part-of-speech taggers. The first part-of-speech tagger for Afrikaans is developed in this
research project.
It is indicated that three resources (a tag set, a twig algorithm and annotated training data) are necessary for the development of such a part-of-speech tagger. Since none of these resources exist for Afrikaans, three objectives are formulated for this project, i.e. (a) to develop a linpsticdy accurate tag set for Afrikaans; (b) to deter-
mine which algorithm is the most effective one to use; and (c) to find an effective method for generating annotated Afrikaans training data.
To reach the first objective, a unique and language-specific tag set was developed for Afrikaans. The resulting tag set is relatively big and consists of 139 tags. The level of specificity of the tag set can easily be adjusted to make the tag set smaller and less specific.
After the development of the tag set, research is done on different approaches to, and techniques that can be used in, the development of a part-of-speech tagger. The available algorithms are evaluated by means of prerequisites that were set and in doing so, the most effective algorithm for the purposes of this project, TnT, is identified.
Bootstrapping is then used to generate training data with the help of the TnT algorithm. This process results in 20,000 correctly annotated words, and thus annotated training data, the hard resource which is necessary for the development of a part-of-speech tagger, is developed. The tagger that is trained with 20,000 words reaches an accuracy of 85.87% when evaluated. The tag set is then simplified to thirteen tags in order to determine the effect that the size of the tag set has on the accuracy of the tagger. The tagger is 93.69% accurate when using the diminished tag set.
The main conclusion of this study is that training data of 20,000 words is not enough for the Afrikaans TnT tagger to compete with other state-of-the-art taggers. The tagger and the data that is developed in this project can be used to generate even more training data in order to develop an optimally accurate Afrikaans TnT tagger. Different techniques might also lead to better results; therefore other algorithms should be tested. / Thesis (M.A.)--North-West University, Potchefstroom Campus, 2005.
|
Page generated in 0.0296 seconds