Global ETD Search

11	Authorship Attribution Through Words Surrounding Named Entities Jacovino, Julia Maureen 03 April 2014 (has links) In text analysis, authorship attribution occurs in a variety of ways. The field of computational linguistics becomes more important as the need of authorship attribution and text analysis becomes more widespread. For this research, pre-existing authorship attribution software, Java Graphical Authorship Attribution Program (JGAAP), implements a named entity recognizer, specifically the Stanford Named Entity Recognizer, to probe into similar genre text and to aid in extricating the correct author. This research specifically examines the words authors use around named entities in order to test the ability of these words at attributing authorship / McAnulty College and Graduate School of Liberal Arts; / Computational Mathematics / MS; / Thesis;
12	Analysing E-mail Text Authorship for Forensic Purposes Corney, Malcolm W. January 2003 (has links) E-mail has become the most popular Internet application and with its rise in use has come an inevitable increase in the use of e-mail for criminal purposes. It is possible for an e-mail message to be sent anonymously or through spoofed servers. Computer forensics analysts need a tool that can be used to identify the author of such e-mail messages. This thesis describes the development of such a tool using techniques from the fields of stylometry and machine learning. An author's style can be reduced to a pattern by making measurements of various stylometric features from the text. E-mail messages also contain macro-structural features that can be measured. These features together can be used with the Support Vector Machine learning algorithm to classify or attribute authorship of e-mail messages to an author providing a suitable sample of messages is available for comparison. In an investigation, the set of authors may need to be reduced from an initial large list of possible suspects. This research has trialled authorship characterisation based on sociolinguistic cohorts, such as gender and language background, as a technique for profiling the anonymous message so that the suspect list can be reduced. E-Mail Computer Forensics Authorship Attribution Authorship Characterisation Stylistics Support Vector Machine
13	Development of new models for authorship recognition using complex networks / Desenvolvimento de novos modelos para reconhecimento de autoria com a utilização de redes complexas Vanessa Queiroz Marinho 14 July 2017 (has links) Complex networks have been successfully applied to different fields, being the subject of study in different areas that include, for example, physics and computer science. The finding that methods of complex networks can be used to analyze texts in their different complexity levels has implied in advances in natural language processing (NLP) tasks. Examples of applications analyzed with the methods of complex networks are keyword identification, development of automatic summarizers, and authorship attribution systems. The latter task has been studied with some success through the representation of co-occurrence (or adjacency) networks that connect only the closest words in the text. Despite this success, only a few works have attempted to extend this representation or employ different ones. Moreover, many approaches use a similar set of measurements to characterize the networks and do not combine their techniques with the ones traditionally used for the authorship attribution task. This Masters research proposes some extensions to the traditional co-occurrence model and investigates new attributes and other representations (such as mesoscopic and named entity networks) for the task. The connectivity information of function words is used to complement the characterization of authors writing styles, as these words are relevant for the task. Finally, the main contribution of this research is the development of hybrid classifiers, called labelled motifs, that combine traditional factors with properties obtained with the topological analysis of complex networks. The relevance of these classifiers is verified in the context of authorship attribution and translationese identification. With this hybrid approach, we show that it is possible to improve the performance of networkbased techniques when they are combined with traditional ones usually employed in NLP. By adapting, combining and improving the model, not only the performance of authorship attribution systems was improved, but also it was possible to better understand what are the textual quantitative factors (measured through networks) that can be used in stylometry studies. The advances obtained during this project may be useful to study related applications, such as the analysis of stylistic inconsistencies and plagiarism, and the analysis of text complexity. Furthermore, most of the methods proposed in this work can be easily applied to many natural languages. / Redes complexas vem sendo aplicadas com sucesso em diferentes domínios, sendo o tema de estudo de distintas áreas que incluem, por exemplo, a física e a computação. A descoberta de que métodos de redes complexas podem ser utilizados para analisar textos em seus distintos níveis de complexidade proporcionou avanços em tarefas de processamento de línguas naturais (PLN). Exemplos de aplicações analisadas com os métodos de redes complexas são a detecção de palavras-chave, a criação de sumarizadores automáticos e o reconhecimento de autoria. Esta última tarefa tem sido estudada com certo sucesso através da representação de redes de co-ocorrência (ou adjacência) de palavras que conectam apenas as palavras mais próximas no texto. Apesar deste sucesso, poucos trabalhos tentaram estender essas redes ou utilizar diferentes representações. Além disso, muitas das abordagens utilizam um conjunto semelhante de medidas de redes complexas e não combinam suas técnicas com as utilizadas tradicionalmente na tarefa de reconhecimento de autoria. Esta pesquisa de mestrado propõe extensões à modelagem tradicional de co-ocorrência e investiga a adequabilidade de novos atributos e de outras modelagens (como as redes mesoscópicas e de entidades nomeadas) para a tarefa. A informação de conectividade de palavras funcionais é utilizada para complementar a caracterização da escrita dos autores, uma vez que essas palavras são relevantes para a tarefa. Finalmente, a maior contribuição deste trabalho consiste no desenvolvimento de classificadores híbridos, denominados labelled motifs, que combinam fatores tradicionais com as propriedades fornecidas pela análise topológica de redes complexas. A relevância desses classificadores é verificada no contexto de reconhecimento de autoria e identificação de translationese. Com esta abordagem híbrida, mostra-se que é possível melhorar o desempenho de técnicas baseadas em rede ao combiná-las com técnicas tradicionais em PLN. Através da adaptação, combinação e aperfeiçoamento da modelagem, não apenas o desempenho dos sistemas de reconhecimento de autoria foi melhorado, mas também foi possível entender melhor quais são os fatores quantitativos textuais (medidos via redes) que podem ser utilizados na área de estilometria. Os avanços obtidos durante este projeto podem ser utilizados para estudar aplicações relacionadas, como é o caso da análise de inconsistências estilísticas e plagiarismos, e análise da complexidade textual. Além disso, muitos dos métodos propostos neste trabalho podem ser facilmente aplicados em diversas línguas naturais. Processamento de línguas naturais Reconhecimento de autoria Redes complexas Authorship attribution Complex networks Natural language processing
14	Propriedades de redes aplicadas à atribuição de autoria / Network features for authorship attribution Camilo Akimushkin Valencia 22 May 2017 (has links) O reconhecimento de autoria é uma área de pesquisa efervescente, com muitas aplicações, incluindo detecção de plágio, análise de textos históricos, reconhecimento de mensagens terroristas ou falsificação de documentos. Modelos teóricos de redes complexas já são usados para o reconhecimento de autoria, mas alguns aspectos importantes têm sido ignorados. Neste trabalho, exploramos a dinâmica de redes de co-ocorrência e a relação com as palavras que representam os nós e descobrimos que ambas são claras assinaturas de autoria. Com otimização dos descritores da topologia das redes e de algoritmos de aprendizado de máquina, foi possível obter taxas de acerto maiores que 85%, sendo atingida uma taxa de 98.75% em um caso específico, para coleções de 80 livros, cada uma compilada de 8 autores de língua inglesa com 10 livros por autor. Esta tese demonstra que existem ainda aspectos inexplorados das redes de co-ocorrência de textos, o que deve permitir avanços ainda maiores no futuro próximo. / Authorship attribution is an active research area with many applications, including detection of plagiarism, analysis of historical texts, terrorist message identification or document falsification. Theoretical models of complex networks are already used for authorship attribution, but some issues have been ignored. In this thesis, we explore the dynamics of co-occurrence networks and the role of words, and found that they are both clear signatures of authorship. Using optimized descriptors for the network topology and machine learning algorithms, it has been possible to achieve accuracy rates above 85%, with a rate of 98.75% being reached in a particular case, for collections of 80 books produced by 8 English-speaking writers with 10 books per author. It is also shown that there are still many unexplored aspects of co-occurrence networks of texts, which seems promising for near future developments. Línguas naturais Reconhecimento de autoria Redes complexas Séries temporais Authorship attribution Complex networks Spoken languages Time series
15	A Study of Media Polarization with Authorship Attribution Methods Yifei Hu (9193709) 14 December 2020 (has links) <div>Media polarization is a serious issue that can affect someone's views, ranging from a scientific fact to the perceived results of a presidential election. The media outlets in the United States are aligned along political spectrum representing different stances on various issues. Without providing any false information (but usually by omitting some facts), media outlets can report events by deliberately using the words and styles that favor particular political positions. <br></div>This research investigated the U.S. media polarization with authorship attribution approaches, analyzing stylistic differences between the left-leaning and right-leaning media and discovering specific linguistic patterns that made the news articles display biased political attitudes. Several models of authorship attribution were tested while controlling for topic, stance, and style, and were applied to media companies and their identity within a political spectrum. Style features that were compared included semantic and/or sentiment-related information, such as stance taking, with features that seemingly do not capture it, such as part of speech tags. The results demonstrate that a successful classification of articles as left-leaning or right-learning is possible regardless of their stance. Finally, we provide an analysis of the patterns that we found. Natural Language Processing Natural language processing media polarization Authorship attribution Stylometry
16	Atribuce autorství básnických textů / Authorship Attribution of Poetic Texts Plecháč, Petr January 2019 (has links) Title: Authorship Attribution of Poetic Texts Author: Mgr. Petr Plecháč, Ph.D. Department: Institute of Czech National Corpus Supervisor: doc. Mgr. Václav Cvrček, Ph.D. ABSTRACT Contemporary stylometry offers a number of methods for authorship recognition of po- etic texts based on a variety of textual features (e.g. word frequencies, frequencies of character n-grams). However, it seems that one important aspect of these texts has been rather left aside - this aspect is versification. The thesis uses four corpora of poetic texts (Czech, German, Spanish, and English) in order to analyze to what extent versification features - such as frequencies of rhythmic patterns or frequencies of various types of rhymes - may be used as an indicator of authorship. We show that (1) versification-based models significantly outperform the random baseline, (2) in some cases versification- based models even outperform the traditionally used lexical models, (3) in most of the cases combination of both types of models outperforms the given models alone. Versifi- cation features are consequently employed for the purpose of attribution of two texts of doubted authorship: (1) the versified play The Famous History of the Life of King Henry the Eigth which was originally published under the name of William Shakespeare, but where...
17	Benchmarking authorship attribution techniques using over a thousand books by fifty Victorian era novelists Gungor, Abdulmecit 03 April 2018 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Authorship attribution (AA) is the process of identifying the author of a given text and from the machine learning perspective, it can be seen as a classification problem. In the literature, there are a lot of classification methods for which feature extraction techniques are conducted. In this thesis, we explore information retrieval techniques such as Word2Vec, paragraph2vec, and other useful feature selection and extraction techniques for a given text with different classifiers. We have performed experiments on novels that are extracted from GDELT database by using different features such as bag of words, n-grams or newly developed techniques like Word2Vec. To improve our success rate, we have combined some useful features some of which are diversity measure of text, bag of words, bigrams, specific words that are written differently between English and American authors. Support vector machine classifiers with nu-SVC type is observed to give best success rates on the stacked useful feature set. The main purpose of this work is to lay the foundations of feature extraction techniques in AA. These are lexical, character-level, syntactic, semantic, application specific features. We also have aimed to offer a new data resource for the author attribution research community and demonstrate how it can be used to extract features as in any kind of AA problem. The dataset we have introduced consists of works of Victorian era authors and the main feature extraction techniques are shown with exemplary code snippets for audiences in different knowledge domains. Feature extraction approaches and implementation with different classifiers are employed in simple ways such that it would also serve as a beginner step to AA. Some feature extraction techniques introduced in this work are also meant to be employed in different NLP tasks such as sentiment analysis with Word2Vec or text summarization. Using the introduced NLP tasks and feature extraction techniques one can start to implement them on our dataset. We have also introduced several methods to implement extracted features in different methodologies such as feature stack engineering with different classifiers, or using Word2Vec to create sentence level vectors. Authorship Attribution Word2Vec Doc2Vec Word2Vec Inversion Word Scoring
18	Document Forensics Through Textual Analysis Belvisi, Nicole Mariah Sharon January 2019 (has links) This project aims at giving a brief overview of the area of research called Authorship Analysis with main focus on Authorship Attribution and the existing methods. The second objective of this project is to test whether one of the main approaches in the field can be still be applied successfully to today's new ways of communicating. The study uses multiple stylometric features to establish the authorship of a text as well as a model based on the TF-IDF model. digital forensics textual analysis similarity-based authorship attribution Twitter Övrig annan teknik
19	以詞性組合為基礎之中文語言特徵研究 / A Study of Part-of-Speech Pair-based Language Features in Chinese Texts 江易倫, Jiang, Yi Lun Unknown Date (has links) 在作者歸屬的研究中，語言特徵的選擇一直是很重要的一環，因為會反映到整個預測結果表現。大多數常用的語言特徵雖然在分類上表現優異，像是高頻詞彙、n-grams、及標點符號等，但這些語言特徵內的詞組卻無法解釋分類間的因果關係及相互差異。為了解決這問題，本論文提出詞性組合、否定程度組合及情態詞組合共3種具有語言學意義的語言特徵作為輔助驗證，並以雷震這位作者的文本為基準，探討在「同主題不同作者」及「同作者不同主題」兩個研究方向上是否適用。本論文將會使用隨機森林演算法建立分類模型，使用OOB錯誤率評估分類模型分類表現，並透過重要特徵數值找出各詞組作為決策點的權重。最後希望能從分類規則中，找出不同作者以及不同類型間語言特徵的獨特性詞組並做解釋。 / In the study of authorship attribution, the choice of language features have always been a very important part because it reflects the performance of the whole prediction. Most of the commonly used language features are excellent in classification, such as word frequencies, n-grams, and punctuation, but the phrases within these language features can not explain the causal relationship between categories and the differences between them. In order to solve this problem, this paper proposes 3 kinds of linguistic meaning as a auxiliary verification, and based on the Lei-Chen 's text, discussed "different authors with same topics" and "different genres with same author" is applied on the two research directions. In this paper, we will use the random forest algorithm to establish the classification model, use the OOB error rate assessment classification model classification performance, and through the important feature values to find the weight of each phrase as a decision point. Finally, we hope to find out unique phrases of different authors and different genres of language features from the classification rules and explain them. 作者歸屬語言特徵隨機森林 Authorship attribution Language features Random forest
20	Modeling Alcohol Consumption Using Blog Data Koh, Kok Chuan 05 1900 (has links) How do the content and writing style of people who drink alcohol beverages stand out from non-drinkers? How much information can we learn about a person's alcohol consumption behavior by reading text that they have authored? This thesis attempts to extend the methods deployed in authorship attribution and authorship profiling research into the domain of automatically identifying the human action of drinking alcohol beverages. I examine how a psycholinguistics dictionary (the Linguistics Inquiry and Word Count lexicon, developed by James Pennebaker), together with Kenneth Burke's concept of words as symbols of human action, and James Wertsch's concept of mediated action provide a framework for analyzing meaningful data patterns from the content of blogs written by consumers of alcohol beverages. The contributions of this thesis to the research field are twofold. First, I show that it is possible to automatically identify blog posts that have content related to the consumption of alcohol beverages. And second, I provide a framework and tools to model human behavior through text analysis of blog data. Natural language processing blog data LIWC PBAA linguistics inquiry and word count profile based authorship attribution alcohol consumption word symbols mediated action

Search results