Spelling suggestions: "subject:"para off speech"" "subject:"para oof speech""
11 |
Publicistinio stiliaus diferenciacija / The typology of the publicistic styleStukaitė, Alina 23 June 2005 (has links)
In the system of functional styles, the publicistic style occupies an intermediate position between subject-specific styles and the belles-lettres style. Presently, it is possible to speak about a separate publicistic style, since its varieties have been fully formed. It has been observed that the issue of analysing publicistic-style varieties is very topical and results from the growing popularity of this style.
The typology of the publicistic style has not been widely investigated in works on Lithuanian language stylistics. This style is not divided into clear-cut varieties: more emphasis is placed on its genre differentiation. On the basis of linguistic and non-linguistic factors, the following varieties of the publicistic style (sub-styles) may be distinguished: informational, analytical and the sub-style of expressive information (artistic publicistics). So far, the place of artistic publicistics between the publicistic and other functional styles has not been defined; therefore, description of texts of this sub-style poses problems.
The publicistic style is defined by the linguistic characteristics of different levels of language: the length of the sentence, the frequency of different parts of speech in a sentence, etc. The syntactical and morphological structure of this style is also influenced by non-linguistic attributes: the addressant, the addressee, the content of texts, functions of the act of speech, etc.
The average length of sentences in publicistic-style... [to full text]
|
12 |
Generalized Probabilistic Topic and Syntax Models for Natural Language ProcessingDarling, William Michael 14 September 2012 (has links)
This thesis proposes a generalized probabilistic approach to modelling document collections along the combined axes of both semantics and syntax. Probabilistic topic (or semantic) models view documents as random mixtures of unobserved latent topics which are themselves represented as probabilistic distributions over words. They have grown immensely in popularity since the introduction of the original topic model, Latent Dirichlet Allocation (LDA), in 2004, and have seen successes in computational linguistics, bioinformatics, political science, and many other fields. Furthermore, the modular nature of topic models allows them to be extended and adapted to specific tasks with relative ease. Despite the recorded successes, however, there remains a gap in combining axes of information from different sources and in developing models that are as useful as possible for specific applications, particularly in Natural Language Processing (NLP). The main contributions of this thesis are two-fold. First, we present generalized probabilistic models (both parametric and nonparametric) that are semantically and syntactically coherent and contain many simpler probabilistic models as special cases. Our models are consistent along both axes of word information in that an LDA-like component sorts words that are semantically related into distinct topics and a Hidden Markov Model (HMM)-like component determines the syntactic parts-of-speech of words so that we can group words that are both semantically and syntactically affiliated in an unsupervised manner, leading to such groups as verbs about health care and nouns about sports. Second, we apply our generalized probabilistic models to two NLP tasks. Specifically, we present new approaches to automatic text summarization and unsupervised part-of-speech (POS) tagging using our models and report results commensurate with the state-of-the-art in these two sub-fields. Our successes demonstrate the general applicability of our modelling techniques to important areas in computational linguistics and NLP.
|
13 |
The Effect of Natural Language Processing in Bioinspired DesignBurns, Madison Suzann 1987- 14 March 2013 (has links)
Bioinspired design methods are a new and evolving collection of techniques used to extract biological principles from nature to solve engineering problems. The application of bioinspired design methods is typically confined to existing problems encountered in new product design or redesign. A primary goal of this research is to utilize existing bioinspired design methods to solve a complex engineering problem to examine the versatility of the method in solving new problems. Here, current bioinspired design methods are applied to seek a biologically inspired solution to geoengineering. Bioinspired solutions developed in the case study include droplet density shields, phosphorescent mineral injection, and reflective orbiting satellites. The success of the methods in the case study indicates that bioinspired design methods have the potential to solve new problems and provide a platform of innovation for old problems.
A secondary goal of this research is to help engineers use bioinspired design methods more efficiently by reducing post-processing time and eliminating the need for extensive knowledge of biological terminology by applying natural language processing techniques. Using the complex problem of geoengineering, a hypothesis is developed that asserts the usefulness of nouns in creating higher quality solutions. A designation is made between the types of nouns in a sentence, primary and spatial, and the hypothesis is refined to state that primary nouns are the most influential part of speech in providing biological inspiration for high quality ideas. Through three design experiments, the author determines that engineers are more likely to develop a higher quality solution using the primary noun in a given passage of biological text.
The identification of primary nouns through part of speech tagging will provide engineers an analogous biological system without extensive analysis of the results. The use of noun identification to improve the efficiency of bioinspired design method applications is a new concept and is the primary contribution of this research.
|
14 |
Morphosyntactic Corpora and Tools for PersianSeraji, Mojgan January 2015 (has links)
This thesis presents open source resources in the form of annotated corpora and modules for automatic morphosyntactic processing and analysis of Persian texts. More specifically, the resources consist of an improved part-of-speech tagged corpus and a dependency treebank, as well as tools for text normalization, sentence segmentation, tokenization, part-of-speech tagging, and dependency parsing for Persian. In developing these resources and tools, two key requirements are observed: compatibility and reuse. The compatibility requirement encompasses two parts. First, the tools in the pipeline should be compatible with each other in such a way that the output of one tool is compatible with the input requirements of the next. Second, the tools should be compatible with the annotated corpora and deliver the same analysis that is found in these. The reuse requirement means that all the components in the pipeline are developed by reusing resources, standard methods, and open source state-of-the-art tools. This is necessary to make the project feasible. Given these requirements, the thesis investigates two main research questions. The first is how can we develop morphologically and syntactically annotated corpora and tools while satisfying the requirements of compatibility and reuse? The approach taken is to accept the tokenization variations in the corpora to achieve robustness. The tokenization variations in Persian texts are related to the orthographic variations of writing fixed expressions, as well as various types of affixes and clitics. Since these variations are inherent properties of Persian texts, it is important that the tools in the pipeline can handle them. Therefore, they should not be trained on idealized data. The second question concerns how accurately we can perform morphological and syntactic analysis for Persian by adapting and applying existing tools to the annotated corpora. The experimental evaluation of the tools shows that the sentence segmenter and tokenizer achieve an F-score close to 100%, the tagger has an accuracy of nearly 97.5%, and the parser achieves a best labeled accuracy of over 82% (with unlabeled accuracy close to 87%).
|
15 |
Outomatiese Afrikaanse woordsoortetikettering / deur Suléne PilonPilon, Suléne January 2005 (has links)
Any community that wants to be part of technological progress has to ensure that the language(s) of that community has/have the necessary human language technology resources. Part of these resources are so-called "core technologies", including part-of-speech taggers. The first part-of-speech tagger for Afrikaans is developed in this
research project.
It is indicated that three resources (a tag set, a twig algorithm and annotated training data) are necessary for the development of such a part-of-speech tagger. Since none of these resources exist for Afrikaans, three objectives are formulated for this project, i.e. (a) to develop a linpsticdy accurate tag set for Afrikaans; (b) to deter-
mine which algorithm is the most effective one to use; and (c) to find an effective method for generating annotated Afrikaans training data.
To reach the first objective, a unique and language-specific tag set was developed for Afrikaans. The resulting tag set is relatively big and consists of 139 tags. The level of specificity of the tag set can easily be adjusted to make the tag set smaller and less specific.
After the development of the tag set, research is done on different approaches to, and techniques that can be used in, the development of a part-of-speech tagger. The available algorithms are evaluated by means of prerequisites that were set and in doing so, the most effective algorithm for the purposes of this project, TnT, is identified.
Bootstrapping is then used to generate training data with the help of the TnT algorithm. This process results in 20,000 correctly annotated words, and thus annotated training data, the hard resource which is necessary for the development of a part-of-speech tagger, is developed. The tagger that is trained with 20,000 words reaches an accuracy of 85.87% when evaluated. The tag set is then simplified to thirteen tags in order to determine the effect that the size of the tag set has on the accuracy of the tagger. The tagger is 93.69% accurate when using the diminished tag set.
The main conclusion of this study is that training data of 20,000 words is not enough for the Afrikaans TnT tagger to compete with other state-of-the-art taggers. The tagger and the data that is developed in this project can be used to generate even more training data in order to develop an optimally accurate Afrikaans TnT tagger. Different techniques might also lead to better results; therefore other algorithms should be tested. / Thesis (M.A.)--North-West University, Potchefstroom Campus, 2005.
|
16 |
The effects of part–of–speech tagging on text–to–speech synthesis for resource–scarce languages / G.I. SchlünzSchlünz, Georg Isaac January 2010 (has links)
In the world of human language technology, resource–scarce languages (RSLs) suffer from the problem
of little available electronic data and linguistic expertise. The Lwazi project in South Africa
is a large–scale endeavour to collect and apply such resources for all eleven of the official South
African languages. One of the deliverables of the project is more natural text–to–speech (TTS)
voices. Naturalness is primarily determined by prosody and it is shown that many aspects of
prosodic modelling is, in turn, dependent on part–of–speech (POS) information. Solving the POS
problem is, therefore, a prudent first step towards meeting the goal of natural TTS voices.
In a resource–scarce environment, obtaining and applying the POS information are not trivial.
Firstly, an automatic tagger is required to tag the text to be synthesised with POS categories, but
state–of–the–art POS taggers are data–driven and thus require large amounts of labelled training
data. Secondly, the subsequent processes in TTS that are used to apply the POS information
towards prosodic modelling are resource–intensive themselves: some require non–trivial linguistic
knowledge; others require labelled data as well.
The first problem asks the question of which available POS tagging algorithm will be the most
accurate on little training data. This research sets out to answer the question by reviewing the
most popular supervised data–driven algorithms. Since literature to date consists mostly of isolated
papers discussing one algorithm, the aim of the review is to consolidate the research into a single
point of reference. A subsequent experimental investigation compares the tagging algorithms on
small training data sets of English and Afrikaans, and it is shown that the hidden Markov model
(HMM) tagger outperforms the rest when using both a comprehensive and a reduced POS tagset.
Regarding the second problem, the question arises whether it is perhaps possible to circumvent
the traditional approaches to prosodic modelling by learning the latter directly from the speech
data using POS information. In other words, does the addition of POS features to the HTS context
labels improve the naturalness of a TTS voice? Towards answering this question, HTS voices are
trained from English and Afrikaans prosodically rich speech. The voices are compared with and
without POS features incorporated into the HTS context labels, analytically and perceptually. For
the analytical experiments, measures of prosody to quantify the comparisons are explored. It is
then also noted whether the results of the perceptual experiments correlate with their analytical
counterparts. It is found that, when a minimal feature set is used for the HTS context labels, the
addition of POS tags does improve the naturalness of the voice. However, the same effect can be
accomplished by including segmental counting and positional information instead of the POS tags. / Thesis (M.Sc. Engineering Sciences (Electrical and Electronic Engineering))--North-West University, Potchefstroom Campus, 2011.
|
17 |
The effects of part–of–speech tagging on text–to–speech synthesis for resource–scarce languages / G.I. SchlünzSchlünz, Georg Isaac January 2010 (has links)
In the world of human language technology, resource–scarce languages (RSLs) suffer from the problem
of little available electronic data and linguistic expertise. The Lwazi project in South Africa
is a large–scale endeavour to collect and apply such resources for all eleven of the official South
African languages. One of the deliverables of the project is more natural text–to–speech (TTS)
voices. Naturalness is primarily determined by prosody and it is shown that many aspects of
prosodic modelling is, in turn, dependent on part–of–speech (POS) information. Solving the POS
problem is, therefore, a prudent first step towards meeting the goal of natural TTS voices.
In a resource–scarce environment, obtaining and applying the POS information are not trivial.
Firstly, an automatic tagger is required to tag the text to be synthesised with POS categories, but
state–of–the–art POS taggers are data–driven and thus require large amounts of labelled training
data. Secondly, the subsequent processes in TTS that are used to apply the POS information
towards prosodic modelling are resource–intensive themselves: some require non–trivial linguistic
knowledge; others require labelled data as well.
The first problem asks the question of which available POS tagging algorithm will be the most
accurate on little training data. This research sets out to answer the question by reviewing the
most popular supervised data–driven algorithms. Since literature to date consists mostly of isolated
papers discussing one algorithm, the aim of the review is to consolidate the research into a single
point of reference. A subsequent experimental investigation compares the tagging algorithms on
small training data sets of English and Afrikaans, and it is shown that the hidden Markov model
(HMM) tagger outperforms the rest when using both a comprehensive and a reduced POS tagset.
Regarding the second problem, the question arises whether it is perhaps possible to circumvent
the traditional approaches to prosodic modelling by learning the latter directly from the speech
data using POS information. In other words, does the addition of POS features to the HTS context
labels improve the naturalness of a TTS voice? Towards answering this question, HTS voices are
trained from English and Afrikaans prosodically rich speech. The voices are compared with and
without POS features incorporated into the HTS context labels, analytically and perceptually. For
the analytical experiments, measures of prosody to quantify the comparisons are explored. It is
then also noted whether the results of the perceptual experiments correlate with their analytical
counterparts. It is found that, when a minimal feature set is used for the HTS context labels, the
addition of POS tags does improve the naturalness of the voice. However, the same effect can be
accomplished by including segmental counting and positional information instead of the POS tags. / Thesis (M.Sc. Engineering Sciences (Electrical and Electronic Engineering))--North-West University, Potchefstroom Campus, 2011.
|
18 |
Avaliando um rotulador estatístico de categorias morfo-sintáticas para a língua portuguesa / Evaluating a stochastic part-of-speech tagger for the portuguese languageVillavicencio, Aline January 1995 (has links)
O Processamento de Linguagem Natural (PLN) é uma área da Ciência da Computação, que vem tentando, ao longo dos anos, aperfeiçoar a comunicação entre o homem e o computador. Varias técnicas tem sido utilizadas para aperfeiçoar esta comunicação, entre elas a aplicação de métodos estatísticos. Estes métodos tem sido usados por pesquisadores de PLN, com um crescente sucesso e uma de suas maiores vantagens é a possibilidade do tratamento de textos irrestritos. Em particular, a aplicação dos métodos estatísticos, na marcação automática de "corpus" com categorias morfo-sintáticas, tem se mostrado bastante promissora, obtendo resultados surpreendentes. Assim sendo, este trabalho descreve o processo de marcação automática de categorias morfo-sintáticas. Inicialmente, são apresentados e comparados os principais métodos aplicados a marcação automática: os métodos baseados em regras e os métodos estatísticos. São descritos os principais formalismos e técnicas usadas para esta finalidade pelos métodos estatísticos. E introduzida a marcação automática para a Língua Portuguesa, algo até então inédito. O objetivo deste trabalho é fazer um estudo detalhado e uma avaliação do sistema rotulador de categorias morfo-sintáticas, a fim de que se possa definir um padrão no qual o sistema apresente a mais alta precisão possível. Para efetuar esta avaliação, são especificados alguns critérios: a qualidade do "corpus" de treinamento, o seu tamanho e a influencia das palavras desconhecidas. A partir dos resultados obtidos, espera-se poder aperfeiçoar o sistema rotulador, de forma a aproveitar, da melhor maneira possível, os recursos disponíveis para a Língua Portuguesa. / Natural Language Processing (NLP) is an area of Computer Sciences, that have been trying to improve communication between human beings and computers. A number of different techniques have been used to improve this communication and among them, the use of stochastic methods. These methods have successfully being used by NLP researchers and one of their most remarkable advantages is that they are able to deal with unrestricted texts. Namely, the use of stochastic methods for part-of-speech tagging has achieving some extremely good results. Thus, this work describes the process of part-of-speech tagging. At first, we present and compare the main tagging methods: the rule-based methods and the stochastic ones. We describe the main stochastic tagging formalisms and techniques for part-of-speech tagging. We also introduce part-of-speech tagging for the Portuguese Language. The main purpose of this work is to study and evaluate a part-of-speech tagger system in order to establish a pattern in which it is possible to achieve the greatest accuracy. To perform this evaluation, several parameters were set: the corpus quality, its size and the relation between unknown words and accuracy. The results obtained will be used to improve the tagger, in order to use better the available Portuguese Language resources.
|
19 |
Avaliando um rotulador estatístico de categorias morfo-sintáticas para a língua portuguesa / Evaluating a stochastic part-of-speech tagger for the portuguese languageVillavicencio, Aline January 1995 (has links)
O Processamento de Linguagem Natural (PLN) é uma área da Ciência da Computação, que vem tentando, ao longo dos anos, aperfeiçoar a comunicação entre o homem e o computador. Varias técnicas tem sido utilizadas para aperfeiçoar esta comunicação, entre elas a aplicação de métodos estatísticos. Estes métodos tem sido usados por pesquisadores de PLN, com um crescente sucesso e uma de suas maiores vantagens é a possibilidade do tratamento de textos irrestritos. Em particular, a aplicação dos métodos estatísticos, na marcação automática de "corpus" com categorias morfo-sintáticas, tem se mostrado bastante promissora, obtendo resultados surpreendentes. Assim sendo, este trabalho descreve o processo de marcação automática de categorias morfo-sintáticas. Inicialmente, são apresentados e comparados os principais métodos aplicados a marcação automática: os métodos baseados em regras e os métodos estatísticos. São descritos os principais formalismos e técnicas usadas para esta finalidade pelos métodos estatísticos. E introduzida a marcação automática para a Língua Portuguesa, algo até então inédito. O objetivo deste trabalho é fazer um estudo detalhado e uma avaliação do sistema rotulador de categorias morfo-sintáticas, a fim de que se possa definir um padrão no qual o sistema apresente a mais alta precisão possível. Para efetuar esta avaliação, são especificados alguns critérios: a qualidade do "corpus" de treinamento, o seu tamanho e a influencia das palavras desconhecidas. A partir dos resultados obtidos, espera-se poder aperfeiçoar o sistema rotulador, de forma a aproveitar, da melhor maneira possível, os recursos disponíveis para a Língua Portuguesa. / Natural Language Processing (NLP) is an area of Computer Sciences, that have been trying to improve communication between human beings and computers. A number of different techniques have been used to improve this communication and among them, the use of stochastic methods. These methods have successfully being used by NLP researchers and one of their most remarkable advantages is that they are able to deal with unrestricted texts. Namely, the use of stochastic methods for part-of-speech tagging has achieving some extremely good results. Thus, this work describes the process of part-of-speech tagging. At first, we present and compare the main tagging methods: the rule-based methods and the stochastic ones. We describe the main stochastic tagging formalisms and techniques for part-of-speech tagging. We also introduce part-of-speech tagging for the Portuguese Language. The main purpose of this work is to study and evaluate a part-of-speech tagger system in order to establish a pattern in which it is possible to achieve the greatest accuracy. To perform this evaluation, several parameters were set: the corpus quality, its size and the relation between unknown words and accuracy. The results obtained will be used to improve the tagger, in order to use better the available Portuguese Language resources.
|
20 |
Disaster tweet classification using parts-of-speech tags: a domain adaptation approachRobinson, Tyler January 1900 (has links)
Master of Science / Department of Computer Science / Doina Caragea / Twitter is one of the most active social media sites today. Almost everyone is using it, as it is a medium by which people stay in touch and inform others about events in their lives. Among many other types of events, people tweet about disaster events. Both man made and natural disasters, unfortunately, occur all the time. When these tragedies transpire, people tend to cope in their own ways. One of the most popular ways people convey their feelings towards disaster events is by offering or asking for support, providing valuable information about the disaster, and voicing their disapproval towards those who may be the cause. However, not all of the tweets posted during a disaster are guaranteed to be useful or informative to authorities nor to the general public. As the number of tweets that are posted during a disaster can reach the hundred thousands range, it is necessary to automatically distinguish tweets that provide useful information from those that don't.
Manual annotation cannot scale up to the large number of tweets, as it takes significant time and effort, which makes it unsuitable for real-time disaster tweet annotation. Alternatively, supervised machine learning has been traditionally used to learn classifiers that can quickly annotate new unseen tweets. But supervised machine learning algorithms make use of labeled training data from the disaster of interest, which is presumably not available for a current target disaster. However, it is reasonable to assume that some amount of labeled data is available for a prior source disaster. Therefore, domain adaptation algorithms that make use of labeled data from a source disaster to learn classifiers for the target disaster provide a promising direction in the area of tweet classification for disaster management. In prior work, domain adaptation algorithms have been trained based on tweets represented as bag-of-words. In this research, I studied the effect of Part of Speech (POS) tag unigrams and bigrams on the performance of the domain adaptation classifiers. Specifically, I used POS tag unigram and bigram features in conjunction with a Naive Bayes Domain Adaptation algorithm to learn classifiers from source labeled data together with target unlabeled data, and subsequently used the resulting classifiers to classify target disaster tweets. The main research question addressed through this work was if the POS tags can help improve the performance of the classifiers learned from tweet bag-of-words representations only. Experimental results have shown that the POS tags can improve the performance of the classifiers learned from words only, but not always. Furthermore, the results of the experiments show that POS tag bigrams contain more information as compared to POS tag unigrams, as the classifiers learned from bigrams have better performance than those learned from unigrams.
|
Page generated in 0.0516 seconds