Spelling suggestions: "subject:"partofspeech tagging"" "subject:"autospeech tagging""
1 |
Tagging a Morphologically Complex Language Using an Averaged Perceptron Tagger: The Case of IcelandicÖstling, Robert January 2013 (has links)
In this paper, we experiment with using Stagger, an open-source implementation of an Averaged Perceptron tagger, to tag Icelandic, a morphologically complex language. By adding languagespecific linguistic features and using IceMorphy, an unknown word guesser, we obtain state-of- the-art tagging accuracy of 92.82%. Furthermore, by adding data from a morphological database, and word embeddings induced from an unannotated corpus, the accuracy increases to 93.84%. This is equivalent to an error reduction of 5.5%, compared to the previously best tagger for Icelandic, consisting of linguistic rules and a Hidden Markov Model.
|
2 |
Modelagem de contextos para aprendizado automático aplicado à análise morfossintática / Modeling contexts for automatic learning applied to morphosyntactic analysisKepler, Fábio Natanael 28 May 2010 (has links)
A etiquetagem morfossintática envolve atribuir às palavras de uma sentença suas classes morfossintáticas de acordo com os contextos em que elas aparecem. Cadeias de Markov de Tamanho Variável (VLMCs, do inglês \"Variable-Length Markov Chains\") oferecem uma forma de modelar contextos maiores que trigramas sem sofrer demais com a esparsidade de dados e a complexidade do espaço de estados. Mesmo assim, duas palavras do português apresentam um alto grau de ambiguidade: \'que\' e \'a\'. O número de erros na etiquetagem dessas palavras corresponde a um quarto do total de erros cometidos por um etiquetador baseado em VLMCs. Além disso, essas palavras parecem apresentar dois diferentes tipos de ambiguidade: um dependendo de contexto não local e outro de contexto direito. Exploramos maneiras de expandir o modelo baseado em VLMCs através do uso de diferentes modelos e métodos, a fim de atacar esses problemas. As abordagens mostraram variado grau de sucesso, com um método em particular (aprendizado guiado) se mostrando capaz de resolver boa parte da ambiguidade de \'a\'. Discutimos razões para isso acontecer. Com relação a \'que\', ao longo desta tese propusemos e testamos diversos métodos de aprendizado de informação contextual para tentar desambiguá-lo. Mostramos como, em todos eles, o nível de ambiguidade de \'que\' permanece praticamente constante. / Part-of-speech tagging involves assigning to words in a sentence their part-of-speech class based on the contexts they appear in. Variable-Length Markov Chains (VLMCs) offer a way of modeling contexts longer than trigrams without suffering too much from data sparsity and state space complexity. Even so, two words in Portuguese show a high degree of ambiguity: \'que\' and \'a\'. The number of errors tagging these words corresponds to a quarter of the total errors made by a VLMC-based tagger. Moreover, these words seem to show two different types of ambiguity: one depending on non-local context and one on right context. We searched ways of expanding the VLMC-based model with a number of different models and methods in order to tackle these issues. The approaches showed variable degrees of success, with one particular method (Guided Learning) solving much of the ambiguity of \'a\'. We explore reasons why this happened. Rega rding \'que\', throughout this thesis we propose and test various methods for learning contextual information in order to try to disambiguate it. We show how, in all of them, the level of ambiguity shown by \'que\' remains practically c onstant.
|
3 |
Weakly supervised part-of-speech tagging for Chinese using label propagationDing, Weiwei, 1985- 02 February 2012 (has links)
Part-of-speech (POS) tagging is one of the most fundamental and crucial tasks in Natural Language Processing. Chinese POS tagging is challenging because it also involves word segmentation. In this report, research will be focused on how to improve unsupervised Part-of-Speech (POS) tagging using Hidden Markov Models and the Expectation Maximization parameter estimation approach (EM-HMM). The traditional EM-HMM system uses a dictionary, which is used to constrain possible tag sequences and initialize the model parameters. This is a very crude initialization: the emission
parameters are set uniformly in accordance with the tag dictionary. To improve this, word alignments can be used. Word alignments are the word-level translation correspondent pairs generated from parallel text between two languages. In this report, Chinese-English word alignment is used. The performance is expected to be better, as these two tasks are complementary to each other. The dictionary provides information on word types, while word alignment provides information on word tokens. However, it is found to be of limited benefit.
In this report, another method is proposed. To improve the dictionary coverage and get better POS distribution, Modified Adsorption, a label propagation algorithm is used. We construct a graph connecting word tokens to feature types (such as word unigrams and bigrams) and connecting those tokens to information from knowledge sources, such as a small tag dictionary, Wiktionary, and word alignments. The core idea is to use a small amount of supervision, in the form of a tag dictionary and acquire POS distributions for each word (both known and unknown) and provide this as an improved initialization for EM learning for HMM. We find this strategy to work very well, especially when we have a small tag dictionary. Label propagation provides a better initialization for the EM-HMM method, because it greatly increases the coverage of the dictionary. In addition, label propagation is quite flexible to incorporate many kinds of knowledge. However, results also show that some resources, such as the word alignments, are not easily exploited with label propagation. / text
|
4 |
Outomatiese Afrikaanse woordsoortetikettering / deur Suléne PilonPilon, Suléne January 2005 (has links)
Any community that wants to be part of technological progress has to ensure that the language(s) of that community has/have the necessary human language technology resources. Part of these resources are so-called "core technologies", including part-of-speech taggers. The first part-of-speech tagger for Afrikaans is developed in this
research project.
It is indicated that three resources (a tag set, a twig algorithm and annotated training data) are necessary for the development of such a part-of-speech tagger. Since none of these resources exist for Afrikaans, three objectives are formulated for this project, i.e. (a) to develop a linpsticdy accurate tag set for Afrikaans; (b) to deter-
mine which algorithm is the most effective one to use; and (c) to find an effective method for generating annotated Afrikaans training data.
To reach the first objective, a unique and language-specific tag set was developed for Afrikaans. The resulting tag set is relatively big and consists of 139 tags. The level of specificity of the tag set can easily be adjusted to make the tag set smaller and less specific.
After the development of the tag set, research is done on different approaches to, and techniques that can be used in, the development of a part-of-speech tagger. The available algorithms are evaluated by means of prerequisites that were set and in doing so, the most effective algorithm for the purposes of this project, TnT, is identified.
Bootstrapping is then used to generate training data with the help of the TnT algorithm. This process results in 20,000 correctly annotated words, and thus annotated training data, the hard resource which is necessary for the development of a part-of-speech tagger, is developed. The tagger that is trained with 20,000 words reaches an accuracy of 85.87% when evaluated. The tag set is then simplified to thirteen tags in order to determine the effect that the size of the tag set has on the accuracy of the tagger. The tagger is 93.69% accurate when using the diminished tag set.
The main conclusion of this study is that training data of 20,000 words is not enough for the Afrikaans TnT tagger to compete with other state-of-the-art taggers. The tagger and the data that is developed in this project can be used to generate even more training data in order to develop an optimally accurate Afrikaans TnT tagger. Different techniques might also lead to better results; therefore other algorithms should be tested. / Thesis (M.A.)--North-West University, Potchefstroom Campus, 2005.
|
5 |
Generalized Probabilistic Topic and Syntax Models for Natural Language ProcessingDarling, William Michael 14 September 2012 (has links)
This thesis proposes a generalized probabilistic approach to modelling document collections along the combined axes of both semantics and syntax. Probabilistic topic (or semantic) models view documents as random mixtures of unobserved latent topics which are themselves represented as probabilistic distributions over words. They have grown immensely in popularity since the introduction of the original topic model, Latent Dirichlet Allocation (LDA), in 2004, and have seen successes in computational linguistics, bioinformatics, political science, and many other fields. Furthermore, the modular nature of topic models allows them to be extended and adapted to specific tasks with relative ease. Despite the recorded successes, however, there remains a gap in combining axes of information from different sources and in developing models that are as useful as possible for specific applications, particularly in Natural Language Processing (NLP). The main contributions of this thesis are two-fold. First, we present generalized probabilistic models (both parametric and nonparametric) that are semantically and syntactically coherent and contain many simpler probabilistic models as special cases. Our models are consistent along both axes of word information in that an LDA-like component sorts words that are semantically related into distinct topics and a Hidden Markov Model (HMM)-like component determines the syntactic parts-of-speech of words so that we can group words that are both semantically and syntactically affiliated in an unsupervised manner, leading to such groups as verbs about health care and nouns about sports. Second, we apply our generalized probabilistic models to two NLP tasks. Specifically, we present new approaches to automatic text summarization and unsupervised part-of-speech (POS) tagging using our models and report results commensurate with the state-of-the-art in these two sub-fields. Our successes demonstrate the general applicability of our modelling techniques to important areas in computational linguistics and NLP.
|
6 |
Morphosyntactic Corpora and Tools for PersianSeraji, Mojgan January 2015 (has links)
This thesis presents open source resources in the form of annotated corpora and modules for automatic morphosyntactic processing and analysis of Persian texts. More specifically, the resources consist of an improved part-of-speech tagged corpus and a dependency treebank, as well as tools for text normalization, sentence segmentation, tokenization, part-of-speech tagging, and dependency parsing for Persian. In developing these resources and tools, two key requirements are observed: compatibility and reuse. The compatibility requirement encompasses two parts. First, the tools in the pipeline should be compatible with each other in such a way that the output of one tool is compatible with the input requirements of the next. Second, the tools should be compatible with the annotated corpora and deliver the same analysis that is found in these. The reuse requirement means that all the components in the pipeline are developed by reusing resources, standard methods, and open source state-of-the-art tools. This is necessary to make the project feasible. Given these requirements, the thesis investigates two main research questions. The first is how can we develop morphologically and syntactically annotated corpora and tools while satisfying the requirements of compatibility and reuse? The approach taken is to accept the tokenization variations in the corpora to achieve robustness. The tokenization variations in Persian texts are related to the orthographic variations of writing fixed expressions, as well as various types of affixes and clitics. Since these variations are inherent properties of Persian texts, it is important that the tools in the pipeline can handle them. Therefore, they should not be trained on idealized data. The second question concerns how accurately we can perform morphological and syntactic analysis for Persian by adapting and applying existing tools to the annotated corpora. The experimental evaluation of the tools shows that the sentence segmenter and tokenizer achieve an F-score close to 100%, the tagger has an accuracy of nearly 97.5%, and the parser achieves a best labeled accuracy of over 82% (with unlabeled accuracy close to 87%).
|
7 |
Outomatiese Afrikaanse woordsoortetikettering / deur Suléne PilonPilon, Suléne January 2005 (has links)
Any community that wants to be part of technological progress has to ensure that the language(s) of that community has/have the necessary human language technology resources. Part of these resources are so-called "core technologies", including part-of-speech taggers. The first part-of-speech tagger for Afrikaans is developed in this
research project.
It is indicated that three resources (a tag set, a twig algorithm and annotated training data) are necessary for the development of such a part-of-speech tagger. Since none of these resources exist for Afrikaans, three objectives are formulated for this project, i.e. (a) to develop a linpsticdy accurate tag set for Afrikaans; (b) to deter-
mine which algorithm is the most effective one to use; and (c) to find an effective method for generating annotated Afrikaans training data.
To reach the first objective, a unique and language-specific tag set was developed for Afrikaans. The resulting tag set is relatively big and consists of 139 tags. The level of specificity of the tag set can easily be adjusted to make the tag set smaller and less specific.
After the development of the tag set, research is done on different approaches to, and techniques that can be used in, the development of a part-of-speech tagger. The available algorithms are evaluated by means of prerequisites that were set and in doing so, the most effective algorithm for the purposes of this project, TnT, is identified.
Bootstrapping is then used to generate training data with the help of the TnT algorithm. This process results in 20,000 correctly annotated words, and thus annotated training data, the hard resource which is necessary for the development of a part-of-speech tagger, is developed. The tagger that is trained with 20,000 words reaches an accuracy of 85.87% when evaluated. The tag set is then simplified to thirteen tags in order to determine the effect that the size of the tag set has on the accuracy of the tagger. The tagger is 93.69% accurate when using the diminished tag set.
The main conclusion of this study is that training data of 20,000 words is not enough for the Afrikaans TnT tagger to compete with other state-of-the-art taggers. The tagger and the data that is developed in this project can be used to generate even more training data in order to develop an optimally accurate Afrikaans TnT tagger. Different techniques might also lead to better results; therefore other algorithms should be tested. / Thesis (M.A.)--North-West University, Potchefstroom Campus, 2005.
|
8 |
The effects of part–of–speech tagging on text–to–speech synthesis for resource–scarce languages / G.I. SchlünzSchlünz, Georg Isaac January 2010 (has links)
In the world of human language technology, resource–scarce languages (RSLs) suffer from the problem
of little available electronic data and linguistic expertise. The Lwazi project in South Africa
is a large–scale endeavour to collect and apply such resources for all eleven of the official South
African languages. One of the deliverables of the project is more natural text–to–speech (TTS)
voices. Naturalness is primarily determined by prosody and it is shown that many aspects of
prosodic modelling is, in turn, dependent on part–of–speech (POS) information. Solving the POS
problem is, therefore, a prudent first step towards meeting the goal of natural TTS voices.
In a resource–scarce environment, obtaining and applying the POS information are not trivial.
Firstly, an automatic tagger is required to tag the text to be synthesised with POS categories, but
state–of–the–art POS taggers are data–driven and thus require large amounts of labelled training
data. Secondly, the subsequent processes in TTS that are used to apply the POS information
towards prosodic modelling are resource–intensive themselves: some require non–trivial linguistic
knowledge; others require labelled data as well.
The first problem asks the question of which available POS tagging algorithm will be the most
accurate on little training data. This research sets out to answer the question by reviewing the
most popular supervised data–driven algorithms. Since literature to date consists mostly of isolated
papers discussing one algorithm, the aim of the review is to consolidate the research into a single
point of reference. A subsequent experimental investigation compares the tagging algorithms on
small training data sets of English and Afrikaans, and it is shown that the hidden Markov model
(HMM) tagger outperforms the rest when using both a comprehensive and a reduced POS tagset.
Regarding the second problem, the question arises whether it is perhaps possible to circumvent
the traditional approaches to prosodic modelling by learning the latter directly from the speech
data using POS information. In other words, does the addition of POS features to the HTS context
labels improve the naturalness of a TTS voice? Towards answering this question, HTS voices are
trained from English and Afrikaans prosodically rich speech. The voices are compared with and
without POS features incorporated into the HTS context labels, analytically and perceptually. For
the analytical experiments, measures of prosody to quantify the comparisons are explored. It is
then also noted whether the results of the perceptual experiments correlate with their analytical
counterparts. It is found that, when a minimal feature set is used for the HTS context labels, the
addition of POS tags does improve the naturalness of the voice. However, the same effect can be
accomplished by including segmental counting and positional information instead of the POS tags. / Thesis (M.Sc. Engineering Sciences (Electrical and Electronic Engineering))--North-West University, Potchefstroom Campus, 2011.
|
9 |
The effects of part–of–speech tagging on text–to–speech synthesis for resource–scarce languages / G.I. SchlünzSchlünz, Georg Isaac January 2010 (has links)
In the world of human language technology, resource–scarce languages (RSLs) suffer from the problem
of little available electronic data and linguistic expertise. The Lwazi project in South Africa
is a large–scale endeavour to collect and apply such resources for all eleven of the official South
African languages. One of the deliverables of the project is more natural text–to–speech (TTS)
voices. Naturalness is primarily determined by prosody and it is shown that many aspects of
prosodic modelling is, in turn, dependent on part–of–speech (POS) information. Solving the POS
problem is, therefore, a prudent first step towards meeting the goal of natural TTS voices.
In a resource–scarce environment, obtaining and applying the POS information are not trivial.
Firstly, an automatic tagger is required to tag the text to be synthesised with POS categories, but
state–of–the–art POS taggers are data–driven and thus require large amounts of labelled training
data. Secondly, the subsequent processes in TTS that are used to apply the POS information
towards prosodic modelling are resource–intensive themselves: some require non–trivial linguistic
knowledge; others require labelled data as well.
The first problem asks the question of which available POS tagging algorithm will be the most
accurate on little training data. This research sets out to answer the question by reviewing the
most popular supervised data–driven algorithms. Since literature to date consists mostly of isolated
papers discussing one algorithm, the aim of the review is to consolidate the research into a single
point of reference. A subsequent experimental investigation compares the tagging algorithms on
small training data sets of English and Afrikaans, and it is shown that the hidden Markov model
(HMM) tagger outperforms the rest when using both a comprehensive and a reduced POS tagset.
Regarding the second problem, the question arises whether it is perhaps possible to circumvent
the traditional approaches to prosodic modelling by learning the latter directly from the speech
data using POS information. In other words, does the addition of POS features to the HTS context
labels improve the naturalness of a TTS voice? Towards answering this question, HTS voices are
trained from English and Afrikaans prosodically rich speech. The voices are compared with and
without POS features incorporated into the HTS context labels, analytically and perceptually. For
the analytical experiments, measures of prosody to quantify the comparisons are explored. It is
then also noted whether the results of the perceptual experiments correlate with their analytical
counterparts. It is found that, when a minimal feature set is used for the HTS context labels, the
addition of POS tags does improve the naturalness of the voice. However, the same effect can be
accomplished by including segmental counting and positional information instead of the POS tags. / Thesis (M.Sc. Engineering Sciences (Electrical and Electronic Engineering))--North-West University, Potchefstroom Campus, 2011.
|
10 |
Neural Networks for Part-of-Speech TaggingStrandqvist, Wiktor January 2016 (has links)
The aim of this thesis is to explore the viability of artificial neural networks using a purely contextual word representation as a solution for part-of-speech tagging. Furthermore, the effects of deep learning and increased contextual information of the network are explored. This was achieved by creating an artificial neural network written in Python. The input vectors employed were created by Word2Vec. This system was compared to a baseline using a tagger with handcrafted features in respect to accuracy and precision. The results show that the use of artificial neural networks using a purely contextual word representation shows promise, but ultimately falls roughly two percent short of the baseline. The suspected reason for this is the suboptimal representation for rare words. The use of deeper network architectures shows an insignificant improvement, indicating that the data sets used might be too small. The use of additional context information provided a higher accuracy, but started to decline after a context size of one.
|
Page generated in 0.058 seconds