271 |
Modelagem de contextos para aprendizado automático aplicado à análise morfossintática / Modeling contexts for automatic learning applied to morphosyntactic analysisKepler, Fábio Natanael 28 May 2010 (has links)
A etiquetagem morfossintática envolve atribuir às palavras de uma sentença suas classes morfossintáticas de acordo com os contextos em que elas aparecem. Cadeias de Markov de Tamanho Variável (VLMCs, do inglês \"Variable-Length Markov Chains\") oferecem uma forma de modelar contextos maiores que trigramas sem sofrer demais com a esparsidade de dados e a complexidade do espaço de estados. Mesmo assim, duas palavras do português apresentam um alto grau de ambiguidade: \'que\' e \'a\'. O número de erros na etiquetagem dessas palavras corresponde a um quarto do total de erros cometidos por um etiquetador baseado em VLMCs. Além disso, essas palavras parecem apresentar dois diferentes tipos de ambiguidade: um dependendo de contexto não local e outro de contexto direito. Exploramos maneiras de expandir o modelo baseado em VLMCs através do uso de diferentes modelos e métodos, a fim de atacar esses problemas. As abordagens mostraram variado grau de sucesso, com um método em particular (aprendizado guiado) se mostrando capaz de resolver boa parte da ambiguidade de \'a\'. Discutimos razões para isso acontecer. Com relação a \'que\', ao longo desta tese propusemos e testamos diversos métodos de aprendizado de informação contextual para tentar desambiguá-lo. Mostramos como, em todos eles, o nível de ambiguidade de \'que\' permanece praticamente constante. / Part-of-speech tagging involves assigning to words in a sentence their part-of-speech class based on the contexts they appear in. Variable-Length Markov Chains (VLMCs) offer a way of modeling contexts longer than trigrams without suffering too much from data sparsity and state space complexity. Even so, two words in Portuguese show a high degree of ambiguity: \'que\' and \'a\'. The number of errors tagging these words corresponds to a quarter of the total errors made by a VLMC-based tagger. Moreover, these words seem to show two different types of ambiguity: one depending on non-local context and one on right context. We searched ways of expanding the VLMC-based model with a number of different models and methods in order to tackle these issues. The approaches showed variable degrees of success, with one particular method (Guided Learning) solving much of the ambiguity of \'a\'. We explore reasons why this happened. Rega rding \'que\', throughout this thesis we propose and test various methods for learning contextual information in order to try to disambiguate it. We show how, in all of them, the level of ambiguity shown by \'que\' remains practically c onstant.
|
272 |
Evaluation of the usability of constraint diagrams as a visual modelling language : theoretical and empirical investigationsFetais, Noora January 2013 (has links)
This research evaluates the constraint diagrams (CD) notation, which is a formal representation for program specification that has some promise to be used by people who are not expert in software design. Multiple methods were adopted in order to provide triangulated evidence of the potential benefits of constraint diagrams compared with other notational systems. Three main approaches were adopted for this research. The first approach was a semantic and task analysis of the CD notation. This was conducted by the application of the Cognitive Dimensions framework, which was used to examine the relative strengths and weaknesses of constraint diagrams and conventional notations in terms of the perceptive facilitation or impediments of these different representations. From this systematic analysis, we found that CD cognitively reduced the cost of exploratory design, modification, incrementation, searching, and transcription activities with regard to the cognitive dimensions: consistency, visibility, abstraction, closeness of mapping, secondary notation, premature commitment, role-expressiveness, progressive evaluation, diffuseness, provisionality, hidden dependency, viscosity, hard mental operations, and error-proneness. The second approach was an empirical evaluation of the comprehension of CD compared to natural language (NL) with computer science students. This experiment took the form of a web-based competition in which 33 participants were given instructions and training on either CD or the equivalent NL specification expressions, and then after each example, they responded to three multiple-choice questions requiring the interpretation of expressions in their particular notation. Although the CD group spent more time on the training and had less confidence, they obtained comparable interpretation scores to the NL group and took less time to answer the questions, although they had no prior experience of CD notation. The third approach was an experiment on the construction of CD. 20 participants were given instructions and training on either CD or the equivalent NL specification expressions, and then after each example, they responded to three questions requiring the construction of expressions in their particular notation. We built an editor to allow the construction of the two notations, which automatically logged their interactions. In general, for constructing program specification, the CD group had more accurate answers, they had spent less time in training, and their returns to the training examples were fewer than those of the NL group. Overall it was found that CD is understandable, usable, intuitive, and expressive with unambiguous semantic notation.
|
273 |
Machine translation for Chinese medical literature.January 1997 (has links)
Li Hoi-Ying. / Thesis (M.Phil.)--Chinese University of Hong Kong, 1997. / Includes bibliographical references (leaves 117-120). / Abstract --- p.i / Acknowledgement --- p.iii / Chapter 1 --- Introduction --- p.1 / Chapter 2 --- Background --- p.8 / Chapter 2.1 --- Strategies in Machine Translation Systems --- p.9 / Chapter 2.1.1 --- Direct MT Strategy --- p.10 / Chapter 2.1.2 --- Transfer MT strategy --- p.11 / Chapter 2.1.3 --- Interlingua MT Strategy --- p.13 / Chapter 2.1.4 --- AI Approach --- p.15 / Chapter 2.1.5 --- Statistical Approach --- p.15 / Chapter 2.2 --- Grammars --- p.16 / Chapter 2.3 --- Sublanguages --- p.19 / Chapter 2.4 --- Human Interaction --- p.21 / Chapter 2.5 --- Evaluation for Performance --- p.23 / Chapter 2.6 --- Machine Translation between Chinese and English --- p.25 / Chapter 2.7 --- Problems and Issues in MTCML --- p.29 / Chapter 2.7.1 --- Linguistic Characteristics of the Corpus --- p.29 / Chapter 2.7.2 --- Strategies for problems in MTCML --- p.31 / Chapter 3 --- Segmentation --- p.34 / Chapter 3.1 --- Strategies for Segmentation --- p.34 / Chapter 3.2 --- Segmentation algorithm in MTCML --- p.36 / Chapter 4 --- Tagging --- p.40 / Chapter 4.1 --- Objective --- p.40 / Chapter 4.2 --- Approach --- p.41 / Chapter 4.2.1 --- Category and Sub-category --- p.41 / Chapter 4.2.2 --- Tools --- p.45 / Chapter 5 --- Analysis --- p.48 / Chapter 5.1 --- Linguistic Study of the Corpus --- p.48 / Chapter 5.1.1 --- Imperative Sentences --- p.49 / Chapter 5.1.2 --- Elliptical Sentences --- p.50 / Chapter 5.1.3 --- Inverted Sentences --- p.52 / Chapter 5.1.4 --- Voice and Tense --- p.53 / Chapter 5.1.5 --- Vocabulary --- p.54 / Chapter 5.2 --- Pattern Extraction --- p.54 / Chapter 5.3 --- Pattern Reduction --- p.56 / Chapter 5.3.1 --- Case Study --- p.56 / Chapter 5.3.2 --- Syntactic Rules --- p.61 / Chapter 5.4 --- Disambiguation --- p.62 / Chapter 5.4.1 --- Category Ambiguity --- p.63 / Chapter 5.4.2 --- Structural Ambiguity --- p.65 / Chapter 6 --- Transfer --- p.68 / Chapter 6.1 --- Principle of Transfer --- p.68 / Chapter 6.2 --- Extraction of Templates --- p.71 / Chapter 6.2.1 --- Similarity Comparison --- p.72 / Chapter 6.2.2 --- Algorithm --- p.74 / Chapter 6.3 --- Classification of Templates --- p.76 / Chapter 6.3.1 --- Classification --- p.76 / Chapter 6.3.2 --- A Class-based Filter --- p.79 / Chapter 6.4 --- Transfer Rule-base --- p.80 / Chapter 6.4.1 --- Transfer Rules --- p.81 / Chapter 6.4.2 --- Rule Matching --- p.84 / Chapter 6.5 --- Chapter Summary --- p.85 / Chapter 7 --- Generation --- p.87 / Chapter 7.1 --- Sentence Generation --- p.87 / Chapter 7.2 --- Disambiguation of Homographs --- p.90 / Chapter 7.3 --- Sentence Polishing --- p.91 / Chapter 8 --- System Implementation --- p.95 / Chapter 8.1 --- Corpus --- p.95 / Chapter 8.2 --- Dictionaries and Lexicons --- p.97 / Chapter 8.3 --- Reduction Rules --- p.100 / Chapter 8.4 --- Transfer Rules --- p.102 / Chapter 8.5 --- Efficiency of the System --- p.104 / Chapter 8.6 --- Case Study --- p.105 / Chapter 8.6.1 --- Sample Result and Assessment --- p.105 / Chapter 8.6.2 --- Results of Segmentation and Tagging --- p.107 / Chapter 8.6.3 --- Results of Analysis --- p.108 / Chapter 8.6.4 --- Results of Transfer --- p.110 / Chapter 8.6.5 --- Results of Generation --- p.110 / Chapter 9 --- Conclusion --- p.112 / Bibliography --- p.117 / Chapter A --- Programmer's Guide --- p.121 / Chapter B --- Translation Instances --- p.125
|
274 |
Painting Pictures with Words - From Theory to SystemCoyne, Robert Eric January 2017 (has links)
A picture paints a thousand words, or so we are told. But how many words does it take to paint a picture? And how can words create pictures in the first place? In this thesis we examine a new theory of linguistic meaning -- where the meaning of words and sentences is determined by the scenes they evoke. We describe how descriptive text is parsed and semantically interpreted and how the semantic interpretation is then depicted as a rendered 3D scene. In doing so, we describe WordsEye, our text-to-scene system, and touch upon many fascinating issues of lexical semantics, knowledge representation, and what we call "graphical semantics." We introduce the notion of vignettes as a way to bridge between function and form, between the semantics of language and the grounded semantics of 3D scenes. And we describe how VigNet, our lexical semantic and graphical knowledge base, mediates the whole process.
In the second part of this thesis, we describe four different ways WordsEye has been tested. We first discuss an evaluation of the system in an educational environment where WordsEye was shown to significantly improve literacy skills for sixth grade students versus a control group. We then compare WordsEye with Google Image Search on "realistic" and "imaginative" sentences in order to evaluate its performance on a sentence-by-sentence level and test its potential as a way to augment existing image search tools. Thirdly, we describe what we have learned in testing WordsEye as an online 3D authoring system where it has attracted 20,000 real-world users who have performed almost one million scene depictions. Finally, we describe tests of WordsEye as an elicitation tool for field linguists studying endangered languages. We then sum up by presenting a roadmap for enhancing the capabilities of the system and identifying key
opportunities and issues to be addressed.
|
275 |
Investigação de métodos de desambiguação lexical de sentidos de verbos do português do Brasil / Research of word sense disambiguation methods for verbs in brazilian portugueseMarco Antonio Sobrevilla Cabezudo 28 August 2015 (has links)
A Desambiguação Lexical de Sentido (DLS) consiste em determinar o sentido mais apropriado da palavra em um contexto determinado, utilizando-se um repositório de sentidos pré-especificado. Esta tarefa é importante para outras aplicações, por exemplo, a tradução automática. Para o inglês, a DLS tem sido amplamente explorada, utilizando diferentes abordagens e técnicas, contudo, esta tarefa ainda é um desafio para os pesquisadores em semântica. Analisando os resultados dos métodos por classes gramaticais, nota-se que todas as classes não apresentam os mesmos resultados, sendo que os verbos são os que apresentam os piores resultados. Estudos ressaltam que os métodos de DLS usam informações superficiais e os verbos precisam de informação mais profunda para sua desambiguação, como frames sintáticos ou restrições seletivas. Para o português, existem poucos trabalhos nesta área e só recentemente tem-se investigado métodos de uso geral. Além disso, salienta-se que, nos últimos anos, têm sido desenvolvidos recursos lexicais focados nos verbos. Nesse contexto, neste trabalho de mestrado, visou-se investigar métodos de DLS de verbos em textos escritos em português do Brasil. Em particular, foram explorados alguns métodos tradicionais da área e, posteriormente, foi incorporado conhecimento linguístico proveniente da Verbnet.Br. Para subsidiar esta investigação, o córpus CSTNews foi anotado com sentidos de verbos usando a WordNet-Pr como repositório de sentidos. Os resultados obtidos mostraram que os métodos de DLS investigados não conseguiram superar o baseline mais forte e que a incorporação de conhecimento da VerbNet.Br produziu melhorias nos métodos, porém, estas melhorias não foram estatisticamente significantes. Algumas contribuições deste trabalho de mestrado foram um córpus anotado com sentidos de verbos, a criação de uma ferramenta que auxilie a anotação de sentidos, a investigação de métodos de DLS e o uso de informações especificas de verbos (provenientes da VerbNet.Br) na DLS de verbos. / Word Sense Disambiguation (WSD) aims at identifying the appropriate sense of a word in a given context, using a pre-specified sense-repository. This task is important to other applications as Machine Translation. For English, WSD has been widely studied, using different approaches and techniques, however, this task is still a challenge for researchers in Semantics. Analyzing the performance of different methods by the morphosyntactic class, note that not all classes have the same results, and the worst results are obtained for Verbs. Studies highlight that WSD methods use shallow information and Verbs need deeper information for its disambiguation, like syntactic frames or selectional restrictions. For Portuguese, there are few works in WSD and, recently, some works for general purpose. In addition, it is noted that, recently, have been developed lexical resources focused on Verbs. In this context, this master work aimed at researching WSD methods for verbs in texts written in Brazilian Portuguese. In particular, traditional WSD methods were explored and, subsequently, linguistic knowledge of VerbNet.Br was incorporated in these methods. To support this research, CSTNews corpus was annotated with verb senses using the WordNet-Pr as a sense-repository. The results showed that explored WSD methods did not outperform the hard baseline and the incorporation of VerbNet.Br knowledge yielded improvements in the methods, however, these improvements were not statistically significant. Some contributions of this work were the sense-annotated corpus, the creation of a tool for support the sense-annotation, the research of WSD methods for verbs and the use of specific information of verbs (from VerbNet.Br) in the WSD of verbs.
|
276 |
A Study on the Efficacy of Sentiment Analysis in Author AttributionSchneider, Michael J 01 August 2015 (has links)
The field of authorship attribution seeks to characterize an author’s writing style well enough to determine whether he or she has written a text of interest. One subfield of authorship attribution, stylometry, seeks to find the necessary literary attributes to quantify an author’s writing style. The research presented here sought to determine the efficacy of sentiment analysis as a new stylometric feature, by comparing its performance in attributing authorship against the performance of traditional stylometric features. Experimentation, with a corpus of sci-fi texts, found sentiment analysis to have a much lower performance in assigning authorship than the traditional stylometric features.
|
277 |
Nevertheless, She Persisted: A Linguistic Analysis of the Speech of Elizabeth Warren, 2007-2017Jennings, Matthew 01 May 2018 (has links)
A breakout star among American progressives in the recent past, Elizabeth Warren has quickly gone from a law professor to a leading figure in Democratic politics. This paper analyzes Warren’s speech from before her time as a political figure to the present using the quantitative textual methodology established by Jones (2016) in order to see if Warren’s speech supports Jones’s assertion that masculine speech is the language of power. Ratios of feminine to masculine markers ultimately indicate that despite her increasing political sway, Warren’s speech becomes increasingly feminine instead. However, despite associations of feminine speech with weakness, Warren’s speech scores highly for expertise and confidence as its feminine scores increase. These findings relate to the relevant political context and have implications for presumptions of masculine speech as the standard for political power.
|
278 |
#HASHTAGS: A LOOK AT THE EVALUATIVE ROLES OF HASHTAGS ON TWITTERSchaede, Leah Rose 01 January 2018 (has links)
Social media has become a large part of today’s pop culture and keeping up with what is going on not only in our social circles, but around the world. It has given many a platform to unite their causes, build fandoms, and share their commentary with the world. A tool in helping group posts together or give commentary on a thought is the hashtag. In this paper I explore the evaluative roles of hashtags in social media discourse, specifically on Twitter. I use a sample of randomly selected tweets from the Twitter API stream I collected and compiled myself. I collected a total of 200,000 tweets and filtered out Re-tweets. Looking at each individual hashtag I sorted them into the categories outlined by the Appraisal Theory proposed by Martin and White (Martin & White, 2005). I explore the types of evaluation expressed in hashtags, the relationships between evaluative hashtags and how users negotiate evaluations using meme hashtags.
|
279 |
Is Simple Wikipedia simple? : – A study of readability and guidelinesIsaksson, Fabian January 2018 (has links)
Creating easy-to-read text is an issue that has traditionally been solved with manual work. But with advancing research in natural language processing, automatic systems for text simplification are being developed. These systems often need training data that is parallel aligned. For several years, simple Wikipedia has been the main source for this data. In the current study, several readability measures has been tested on a popular simplification corpus. A selection of guidelines from simple Wikipedia has also been operationalized and tested. The results imply that the following of guidelines are not greater in simple Wikipedia than in standard Wikipedia. There are however differences in the readability measures. The syntactical structures of simple Wikipedia seems to be less complex than those of standard Wikipedia. A continuation of this study would be to examine other readability measures and evaluate the guidelines not covered within the current work.
|
280 |
Hierarchical text classification of fiction books : With Thema subject categoriesReinaudo, Alice January 2019 (has links)
Categorizing books and literature of any genre and subject area is a vital task for publishers which seek to distribute their books to the appropriate audiences. It is common that different countries use different subject categorization schemes, which makes international book trading more difficult due to the need to categorize books from scratch once they reach another country. A solution to this problem has been proposed in the form of an international standard called Thema, which encompasses thousands of hierarchical subject categories. However, because this scheme is quite recent, many books published before its creation are yet to be assigned subject categories. It also is often the case that even recent books are not categorized. In this work, methods for automatic categorization of books are investigated, based on multinomial Naive Bayes and Facebook's classifier fastText. The results show some amount of promise for both classifiers, but overall, due to data imbalance and a very long training time that made it difficult to use more data, it is not possible to determine with certainty which classifier actually is best.
|
Page generated in 0.1568 seconds