Spelling suggestions: "subject:"authorship attribution"" "subject:"outhorship attribution""
1 |
Authorship Attribution of Source CodeTennyson, Matthew Francis 01 January 2013 (has links)
Authorship attribution of source code is the task of deciding who wrote a program, given its source code. Applications include software forensics, plagiarism detection, and determining software ownership. A number of methods for the authorship attribution of source code have been presented in the past. A review of those existing methods is presented, while focusing on the two state-of-the-art methods: SCAP and Burrows.
The primary goal was to develop a new method for authorship attribution of source code that is even more effective than the current state-of-the-art methods. Toward that end, a comparative study of the methods was performed in order to determine their relative effectiveness and establish a baseline. A suitable set of test data was also established in a manner intended to support the vision of a universal data set suitable for standard use in authorship attribution experiments. A data set was chosen consisting of 7,231 open-source and textbook programs written in C++ and Java by thirty unique authors.
The baseline study showed both the Burrows and SCAP methods were indeed state-of-the-art. The Burrows method correctly attributed 89% of all documents, while the SCAP method correctly attributed 95%. The Burrows method inherently anonymizes the data by stripping all comments and string literals, while the SCAP method does not. So the methods were also compared using anonymized data. The SCAP method correctly attributed 91% of the anonymized documents, compared to 89% by Burrows.
The Burrows method was improved in two ways: the set of features used to represent programs was updated and the similarity metric was updated. As a result, the improved method successfully attributed nearly 94% of all documents, compared to 89% attributed in the baseline.
The SCAP method was also improved in two ways: the technique used to anonymize documents was changed and the amount of information retained in the source code author profiles was determined differently. As a result, the improved method successfully attributed 97% of anonymized documents and 98% of non-anonymized documents, compared to 91% and 95% that were attributed in the baseline, respectively.
The two improved methods were used to create an ensemble method based on the Bayes optimal classifier. The ensemble method successfully attributed nearly 99% of all documents in the data set.
|
2 |
"Art Made Tongue-tied By Authority?" : The Shakespeare Authorship QuestionLindholm, Lars January 2012 (has links)
The essay presents the scholarly controversy over the correct attribution of the works by “Shakespeare”. The main alternative author is Edward de Vere, 17th earl of Oxford. 16th century conventions allowed noblemen to write poetry or drama only for private circulation. To appear in print, such works had to be anonymous or under pseudonym. Overtly writing for public theatre, a profitable business, would have been a degrading conduct. Oxford’s contemporary fame as an author is little matched by known works. Great gaps in relevant sources indicate that documents concerning not only his person and authorship but also the life of Shakspere from Stratford, the alleged author, have been deliberately eliminated in order to transfer the authorship, for which the political authority of the Elizabethan and Jacobean autocratic society had motive and resources enough. A restored identity would imply radical redating of plays and poems. To what extent literature is autobiographical, or was in that age, and whether restoring a lost identity from written works is legitimate at all, are basic issues of the debate, always implying tradition without real proof versus circumstantial evidence. As such arguments are incompatible, both sides have incessantly missed their targets. The historical conditions for the sequence of events that created the fiction, and its main steps, are related. Oxford will be in focus, since most old and new evidence for making a case has reference to him. The views of the two parties on different points are presented by continual quoting from representative recent works by Shakespeare scholars, where the often scornful tone of the debate still echoes. It is claimed that the urge for concrete results will make the opinion veer to the side that proves productive and eventually can create a new coherent picture, but better communication between the parties’ scholars is called for. / Literary Degree Project
|
3 |
Using Style Markers for Detecting Plagiarism in Natural Language DocumentsKimler, Marco January 2003 (has links)
<p>Most of the existing plagiarism detection systems compare a text to a database of other texts. These external approaches, however, are vulnerable because texts not contained in the database cannot be detected as source texts. This paper examines an internal plagiarism detection method that uses style markers from authorship attribution studies in order to find stylistic changes in a text. These changes might pinpoint plagiarized passages. Additionally, a new style marker called specific words is introduced. A pre-study tests if the style markers can fingerprint an author s style and if they are constant with sample size. It is shown that vocabulary richness measures do not fulfil these prerequisites. The other style markers - simple ratio measures, readability scores, frequency lists, and entropy measures - have these characteristics and are, together with the new specific words measure, used in a main study with an unsupervised approach for detecting stylistic changes in plagiarized texts at sentence and paragraph levels. It is shown that at these small levels the style markers generally cannot detect plagiarized sections because of intra-authorial stylistic variations (i.e. noise), and that at bigger levels the results are strongly a ected by the sliding window approach. The specific words measure, however, can pinpoint single sentences written by another author.</p>
|
4 |
Using Style Markers for Detecting Plagiarism in Natural Language DocumentsKimler, Marco January 2003 (has links)
Most of the existing plagiarism detection systems compare a text to a database of other texts. These external approaches, however, are vulnerable because texts not contained in the database cannot be detected as source texts. This paper examines an internal plagiarism detection method that uses style markers from authorship attribution studies in order to find stylistic changes in a text. These changes might pinpoint plagiarized passages. Additionally, a new style marker called specific words is introduced. A pre-study tests if the style markers can fingerprint an author s style and if they are constant with sample size. It is shown that vocabulary richness measures do not fulfil these prerequisites. The other style markers - simple ratio measures, readability scores, frequency lists, and entropy measures - have these characteristics and are, together with the new specific words measure, used in a main study with an unsupervised approach for detecting stylistic changes in plagiarized texts at sentence and paragraph levels. It is shown that at these small levels the style markers generally cannot detect plagiarized sections because of intra-authorial stylistic variations (i.e. noise), and that at bigger levels the results are strongly a ected by the sliding window approach. The specific words measure, however, can pinpoint single sentences written by another author.
|
5 |
AN ANALYSIS ON SHORT-FORM TEXT AND DERIVED ENGAGEMENTRyan J Schwarz (19178926) 22 July 2024 (has links)
<p dir="ltr">Short text has historically proven challenging to work with in many Natural Language<br>Processing (NLP) applications. Traditional tasks such as authorship attribution benefit<br>from having longer samples of work to derive features from. Even newer tasks, such as<br>synthetic text detection, struggle to distinguish between authentic and synthetic text in<br>the short-form. Due to the widespread usage of social media and the proliferation of freely<br>available Large Language Models (LLMs), such as the GPT series from OpenAI and Bard<br>from Google, there has been a deluge of short-form text on the internet in recent years.<br>Short-form text has either become or remained a staple in several ubiquitous areas such as<br>schoolwork, entertainment, social media, and academia. This thesis seeks to analyze this<br>short text through the lens of NLP tasks such as synthetic text detection, LLM authorship<br>attribution, derived engagement, and predicted engagement. The first focus explores the task<br>of detection in the binary case of determining whether tweets are synthetically generated or<br>not and proposes a novel feature extraction technique to improve classifier results. The<br>second focus further explores the challenges presented by short-form text in determining<br>authorship, a cavalcade of related difficulties, and presents a potential work around to those<br>issues. The final focus attempts to predict social media engagement based on the NLP<br>representations of comments, and results in some new understanding of the social media<br>environment and the multitude of additional factors required for engagement prediction.</p>
|
6 |
Characterization of Prose by Rhetorical Structure for Machine Learning ClassificationJava, James 01 January 2015 (has links)
Measures of classical rhetorical structure in text can improve accuracy in certain types of stylistic classification tasks such as authorship attribution. This research augments the relatively scarce work in the automated identification of rhetorical figures and uses the resulting statistics to characterize an author's rhetorical style. These characterizations of style can then become part of the feature set of various classification models.
Our Rhetorica software identifies 14 classical rhetorical figures in free English text, with generally good precision and recall, and provides summary measures to use in descriptive or classification tasks. Classification models trained on Rhetorica's rhetorical measures paired with lexical features typically performed better at authorship attribution than either set of features used individually. The rhetorical measures also provide new stylistic quantities for describing texts, authors, genres, etc.
|
7 |
Authorship Attribution with Function Word N-GramsJohnson, Russell Clark 01 January 2013 (has links)
Prior research has considered the sequential order of function words, after the contextual words of the text have been removed, as a stylistic indicator of authorship. This research describes an effort to enhance authorship attribution accuracy based on this same information source with alternate classifiers, alternate n-gram construction methods, and a genetically tuned configuration.
The approach is original in that it is the first time that probabilistic versions of Burrows's Delta have been used. Instead of using z-scores as an input for a classifier, the z-scores were converted to probabilistic equivalents (since z-scores cannot be subtracted, added, or divided without the possibility of distorting their probabilistic meaning); this adaptation enhanced accuracy. Multiple versions of Burrows's Delta were evaluated; this includes a hybrid of the Probabilistic Burrows's Delta and the version proposed by Smith & Aldridge (2011); in this case accuracy was enhanced when individual frequent words were evaluated as indicators of style. Other novel aspects include alternate n-gram construction methods; a reconciliation process that allows texts of various lengths from different authors to be compared; and a GA selection process that determines which function (or frequent) words (see Smith & Rickards, 2008; see also Shaker, Corne, & Everson, 2007) may be used in the construction of function word n-grams.
|
8 |
Development of new models for authorship recognition using complex networks / Desenvolvimento de novos modelos para reconhecimento de autoria com a utilização de redes complexasMarinho, Vanessa Queiroz 14 July 2017 (has links)
Complex networks have been successfully applied to different fields, being the subject of study in different areas that include, for example, physics and computer science. The finding that methods of complex networks can be used to analyze texts in their different complexity levels has implied in advances in natural language processing (NLP) tasks. Examples of applications analyzed with the methods of complex networks are keyword identification, development of automatic summarizers, and authorship attribution systems. The latter task has been studied with some success through the representation of co-occurrence (or adjacency) networks that connect only the closest words in the text. Despite this success, only a few works have attempted to extend this representation or employ different ones. Moreover, many approaches use a similar set of measurements to characterize the networks and do not combine their techniques with the ones traditionally used for the authorship attribution task. This Masters research proposes some extensions to the traditional co-occurrence model and investigates new attributes and other representations (such as mesoscopic and named entity networks) for the task. The connectivity information of function words is used to complement the characterization of authors writing styles, as these words are relevant for the task. Finally, the main contribution of this research is the development of hybrid classifiers, called labelled motifs, that combine traditional factors with properties obtained with the topological analysis of complex networks. The relevance of these classifiers is verified in the context of authorship attribution and translationese identification. With this hybrid approach, we show that it is possible to improve the performance of networkbased techniques when they are combined with traditional ones usually employed in NLP. By adapting, combining and improving the model, not only the performance of authorship attribution systems was improved, but also it was possible to better understand what are the textual quantitative factors (measured through networks) that can be used in stylometry studies. The advances obtained during this project may be useful to study related applications, such as the analysis of stylistic inconsistencies and plagiarism, and the analysis of text complexity. Furthermore, most of the methods proposed in this work can be easily applied to many natural languages. / Redes complexas vem sendo aplicadas com sucesso em diferentes domínios, sendo o tema de estudo de distintas áreas que incluem, por exemplo, a física e a computação. A descoberta de que métodos de redes complexas podem ser utilizados para analisar textos em seus distintos níveis de complexidade proporcionou avanços em tarefas de processamento de línguas naturais (PLN). Exemplos de aplicações analisadas com os métodos de redes complexas são a detecção de palavras-chave, a criação de sumarizadores automáticos e o reconhecimento de autoria. Esta última tarefa tem sido estudada com certo sucesso através da representação de redes de co-ocorrência (ou adjacência) de palavras que conectam apenas as palavras mais próximas no texto. Apesar deste sucesso, poucos trabalhos tentaram estender essas redes ou utilizar diferentes representações. Além disso, muitas das abordagens utilizam um conjunto semelhante de medidas de redes complexas e não combinam suas técnicas com as utilizadas tradicionalmente na tarefa de reconhecimento de autoria. Esta pesquisa de mestrado propõe extensões à modelagem tradicional de co-ocorrência e investiga a adequabilidade de novos atributos e de outras modelagens (como as redes mesoscópicas e de entidades nomeadas) para a tarefa. A informação de conectividade de palavras funcionais é utilizada para complementar a caracterização da escrita dos autores, uma vez que essas palavras são relevantes para a tarefa. Finalmente, a maior contribuição deste trabalho consiste no desenvolvimento de classificadores híbridos, denominados labelled motifs, que combinam fatores tradicionais com as propriedades fornecidas pela análise topológica de redes complexas. A relevância desses classificadores é verificada no contexto de reconhecimento de autoria e identificação de translationese. Com esta abordagem híbrida, mostra-se que é possível melhorar o desempenho de técnicas baseadas em rede ao combiná-las com técnicas tradicionais em PLN. Através da adaptação, combinação e aperfeiçoamento da modelagem, não apenas o desempenho dos sistemas de reconhecimento de autoria foi melhorado, mas também foi possível entender melhor quais são os fatores quantitativos textuais (medidos via redes) que podem ser utilizados na área de estilometria. Os avanços obtidos durante este projeto podem ser utilizados para estudar aplicações relacionadas, como é o caso da análise de inconsistências estilísticas e plagiarismos, e análise da complexidade textual. Além disso, muitos dos métodos propostos neste trabalho podem ser facilmente aplicados em diversas línguas naturais.
|
9 |
Propriedades de redes aplicadas à atribuição de autoria / Network features for authorship attributionValencia, Camilo Akimushkin 22 May 2017 (has links)
O reconhecimento de autoria é uma área de pesquisa efervescente, com muitas aplicações, incluindo detecção de plágio, análise de textos históricos, reconhecimento de mensagens terroristas ou falsificação de documentos. Modelos teóricos de redes complexas já são usados para o reconhecimento de autoria, mas alguns aspectos importantes têm sido ignorados. Neste trabalho, exploramos a dinâmica de redes de co-ocorrência e a relação com as palavras que representam os nós e descobrimos que ambas são claras assinaturas de autoria. Com otimização dos descritores da topologia das redes e de algoritmos de aprendizado de máquina, foi possível obter taxas de acerto maiores que 85%, sendo atingida uma taxa de 98.75% em um caso específico, para coleções de 80 livros, cada uma compilada de 8 autores de língua inglesa com 10 livros por autor. Esta tese demonstra que existem ainda aspectos inexplorados das redes de co-ocorrência de textos, o que deve permitir avanços ainda maiores no futuro próximo. / Authorship attribution is an active research area with many applications, including detection of plagiarism, analysis of historical texts, terrorist message identification or document falsification. Theoretical models of complex networks are already used for authorship attribution, but some issues have been ignored. In this thesis, we explore the dynamics of co-occurrence networks and the role of words, and found that they are both clear signatures of authorship. Using optimized descriptors for the network topology and machine learning algorithms, it has been possible to achieve accuracy rates above 85%, with a rate of 98.75% being reached in a particular case, for collections of 80 books produced by 8 English-speaking writers with 10 books per author. It is also shown that there are still many unexplored aspects of co-occurrence networks of texts, which seems promising for near future developments.
|
10 |
A Study on the Efficacy of Sentiment Analysis in Author AttributionSchneider, Michael J 01 August 2015 (has links)
The field of authorship attribution seeks to characterize an author’s writing style well enough to determine whether he or she has written a text of interest. One subfield of authorship attribution, stylometry, seeks to find the necessary literary attributes to quantify an author’s writing style. The research presented here sought to determine the efficacy of sentiment analysis as a new stylometric feature, by comparing its performance in attributing authorship against the performance of traditional stylometric features. Experimentation, with a corpus of sci-fi texts, found sentiment analysis to have a much lower performance in assigning authorship than the traditional stylometric features.
|
Page generated in 0.0705 seconds