• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 14
  • 3
  • 1
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 27
  • 27
  • 12
  • 10
  • 8
  • 8
  • 8
  • 8
  • 6
  • 5
  • 5
  • 5
  • 5
  • 4
  • 4
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Authorship Attribution of Source Code

Tennyson, Matthew Francis 01 January 2013 (has links)
Authorship attribution of source code is the task of deciding who wrote a program, given its source code. Applications include software forensics, plagiarism detection, and determining software ownership. A number of methods for the authorship attribution of source code have been presented in the past. A review of those existing methods is presented, while focusing on the two state-of-the-art methods: SCAP and Burrows. The primary goal was to develop a new method for authorship attribution of source code that is even more effective than the current state-of-the-art methods. Toward that end, a comparative study of the methods was performed in order to determine their relative effectiveness and establish a baseline. A suitable set of test data was also established in a manner intended to support the vision of a universal data set suitable for standard use in authorship attribution experiments. A data set was chosen consisting of 7,231 open-source and textbook programs written in C++ and Java by thirty unique authors. The baseline study showed both the Burrows and SCAP methods were indeed state-of-the-art. The Burrows method correctly attributed 89% of all documents, while the SCAP method correctly attributed 95%. The Burrows method inherently anonymizes the data by stripping all comments and string literals, while the SCAP method does not. So the methods were also compared using anonymized data. The SCAP method correctly attributed 91% of the anonymized documents, compared to 89% by Burrows. The Burrows method was improved in two ways: the set of features used to represent programs was updated and the similarity metric was updated. As a result, the improved method successfully attributed nearly 94% of all documents, compared to 89% attributed in the baseline. The SCAP method was also improved in two ways: the technique used to anonymize documents was changed and the amount of information retained in the source code author profiles was determined differently. As a result, the improved method successfully attributed 97% of anonymized documents and 98% of non-anonymized documents, compared to 91% and 95% that were attributed in the baseline, respectively. The two improved methods were used to create an ensemble method based on the Bayes optimal classifier. The ensemble method successfully attributed nearly 99% of all documents in the data set.
2

"Art Made Tongue-tied By Authority?" : The Shakespeare Authorship Question

Lindholm, Lars January 2012 (has links)
The essay presents the scholarly controversy over the correct attribution of the works by “Shakespeare”. The main alternative author is Edward de Vere, 17th earl of Oxford. 16th century conventions allowed noblemen to write poetry or drama only for private circulation. To appear in print, such works had to be anonymous or under pseudonym. Overtly writing for public theatre, a profitable business, would have been a degrading conduct. Oxford’s contemporary fame as an author is little matched by known works. Great gaps in relevant sources indicate that documents concerning not only his person and authorship but also the life of Shakspere from Stratford, the alleged author, have been deliberately eliminated in order to transfer the authorship, for which the political authority of the Elizabethan and Jacobean autocratic society had motive and resources enough. A restored identity would imply radical redating of plays and poems.                       To what extent literature is autobiographical, or was in that age, and whether restoring a lost identity from written works is legitimate at all, are basic issues of the debate, always implying tradition without real proof versus circumstantial evidence. As such arguments are incompatible, both sides have incessantly missed their targets. The historical conditions for the sequence of events that created the fiction, and its main steps, are related. Oxford will be in focus, since most old and new evidence for making a case has reference to him. The views of the two parties on different points are presented by continual quoting from representative recent works by Shakespeare scholars, where the often scornful tone of the debate still echoes. It is claimed that the urge for concrete results will make the opinion veer to the side that proves productive and eventually can create a new coherent picture, but better communication between the parties’ scholars is called for. / Literary Degree Project
3

Using Style Markers for Detecting Plagiarism in Natural Language Documents

Kimler, Marco January 2003 (has links)
<p>Most of the existing plagiarism detection systems compare a text to a database of other texts. These external approaches, however, are vulnerable because texts not contained in the database cannot be detected as source texts. This paper examines an internal plagiarism detection method that uses style markers from authorship attribution studies in order to find stylistic changes in a text. These changes might pinpoint plagiarized passages. Additionally, a new style marker called specific words is introduced. A pre-study tests if the style markers can fingerprint an author s style and if they are constant with sample size. It is shown that vocabulary richness measures do not fulfil these prerequisites. The other style markers - simple ratio measures, readability scores, frequency lists, and entropy measures - have these characteristics and are, together with the new specific words measure, used in a main study with an unsupervised approach for detecting stylistic changes in plagiarized texts at sentence and paragraph levels. It is shown that at these small levels the style markers generally cannot detect plagiarized sections because of intra-authorial stylistic variations (i.e. noise), and that at bigger levels the results are strongly a ected by the sliding window approach. The specific words measure, however, can pinpoint single sentences written by another author.</p>
4

Using Style Markers for Detecting Plagiarism in Natural Language Documents

Kimler, Marco January 2003 (has links)
Most of the existing plagiarism detection systems compare a text to a database of other texts. These external approaches, however, are vulnerable because texts not contained in the database cannot be detected as source texts. This paper examines an internal plagiarism detection method that uses style markers from authorship attribution studies in order to find stylistic changes in a text. These changes might pinpoint plagiarized passages. Additionally, a new style marker called specific words is introduced. A pre-study tests if the style markers can fingerprint an author s style and if they are constant with sample size. It is shown that vocabulary richness measures do not fulfil these prerequisites. The other style markers - simple ratio measures, readability scores, frequency lists, and entropy measures - have these characteristics and are, together with the new specific words measure, used in a main study with an unsupervised approach for detecting stylistic changes in plagiarized texts at sentence and paragraph levels. It is shown that at these small levels the style markers generally cannot detect plagiarized sections because of intra-authorial stylistic variations (i.e. noise), and that at bigger levels the results are strongly a ected by the sliding window approach. The specific words measure, however, can pinpoint single sentences written by another author.
5

Characterization of Prose by Rhetorical Structure for Machine Learning Classification

Java, James 01 January 2015 (has links)
Measures of classical rhetorical structure in text can improve accuracy in certain types of stylistic classification tasks such as authorship attribution. This research augments the relatively scarce work in the automated identification of rhetorical figures and uses the resulting statistics to characterize an author's rhetorical style. These characterizations of style can then become part of the feature set of various classification models. Our Rhetorica software identifies 14 classical rhetorical figures in free English text, with generally good precision and recall, and provides summary measures to use in descriptive or classification tasks. Classification models trained on Rhetorica's rhetorical measures paired with lexical features typically performed better at authorship attribution than either set of features used individually. The rhetorical measures also provide new stylistic quantities for describing texts, authors, genres, etc.
6

Authorship Attribution with Function Word N-Grams

Johnson, Russell Clark 01 January 2013 (has links)
Prior research has considered the sequential order of function words, after the contextual words of the text have been removed, as a stylistic indicator of authorship. This research describes an effort to enhance authorship attribution accuracy based on this same information source with alternate classifiers, alternate n-gram construction methods, and a genetically tuned configuration. The approach is original in that it is the first time that probabilistic versions of Burrows's Delta have been used. Instead of using z-scores as an input for a classifier, the z-scores were converted to probabilistic equivalents (since z-scores cannot be subtracted, added, or divided without the possibility of distorting their probabilistic meaning); this adaptation enhanced accuracy. Multiple versions of Burrows's Delta were evaluated; this includes a hybrid of the Probabilistic Burrows's Delta and the version proposed by Smith & Aldridge (2011); in this case accuracy was enhanced when individual frequent words were evaluated as indicators of style. Other novel aspects include alternate n-gram construction methods; a reconciliation process that allows texts of various lengths from different authors to be compared; and a GA selection process that determines which function (or frequent) words (see Smith & Rickards, 2008; see also Shaker, Corne, & Everson, 2007) may be used in the construction of function word n-grams.
7

Development of new models for authorship recognition using complex networks / Desenvolvimento de novos modelos para reconhecimento de autoria com a utilização de redes complexas

Marinho, Vanessa Queiroz 14 July 2017 (has links)
Complex networks have been successfully applied to different fields, being the subject of study in different areas that include, for example, physics and computer science. The finding that methods of complex networks can be used to analyze texts in their different complexity levels has implied in advances in natural language processing (NLP) tasks. Examples of applications analyzed with the methods of complex networks are keyword identification, development of automatic summarizers, and authorship attribution systems. The latter task has been studied with some success through the representation of co-occurrence (or adjacency) networks that connect only the closest words in the text. Despite this success, only a few works have attempted to extend this representation or employ different ones. Moreover, many approaches use a similar set of measurements to characterize the networks and do not combine their techniques with the ones traditionally used for the authorship attribution task. This Masters research proposes some extensions to the traditional co-occurrence model and investigates new attributes and other representations (such as mesoscopic and named entity networks) for the task. The connectivity information of function words is used to complement the characterization of authors writing styles, as these words are relevant for the task. Finally, the main contribution of this research is the development of hybrid classifiers, called labelled motifs, that combine traditional factors with properties obtained with the topological analysis of complex networks. The relevance of these classifiers is verified in the context of authorship attribution and translationese identification. With this hybrid approach, we show that it is possible to improve the performance of networkbased techniques when they are combined with traditional ones usually employed in NLP. By adapting, combining and improving the model, not only the performance of authorship attribution systems was improved, but also it was possible to better understand what are the textual quantitative factors (measured through networks) that can be used in stylometry studies. The advances obtained during this project may be useful to study related applications, such as the analysis of stylistic inconsistencies and plagiarism, and the analysis of text complexity. Furthermore, most of the methods proposed in this work can be easily applied to many natural languages. / Redes complexas vem sendo aplicadas com sucesso em diferentes domínios, sendo o tema de estudo de distintas áreas que incluem, por exemplo, a física e a computação. A descoberta de que métodos de redes complexas podem ser utilizados para analisar textos em seus distintos níveis de complexidade proporcionou avanços em tarefas de processamento de línguas naturais (PLN). Exemplos de aplicações analisadas com os métodos de redes complexas são a detecção de palavras-chave, a criação de sumarizadores automáticos e o reconhecimento de autoria. Esta última tarefa tem sido estudada com certo sucesso através da representação de redes de co-ocorrência (ou adjacência) de palavras que conectam apenas as palavras mais próximas no texto. Apesar deste sucesso, poucos trabalhos tentaram estender essas redes ou utilizar diferentes representações. Além disso, muitas das abordagens utilizam um conjunto semelhante de medidas de redes complexas e não combinam suas técnicas com as utilizadas tradicionalmente na tarefa de reconhecimento de autoria. Esta pesquisa de mestrado propõe extensões à modelagem tradicional de co-ocorrência e investiga a adequabilidade de novos atributos e de outras modelagens (como as redes mesoscópicas e de entidades nomeadas) para a tarefa. A informação de conectividade de palavras funcionais é utilizada para complementar a caracterização da escrita dos autores, uma vez que essas palavras são relevantes para a tarefa. Finalmente, a maior contribuição deste trabalho consiste no desenvolvimento de classificadores híbridos, denominados labelled motifs, que combinam fatores tradicionais com as propriedades fornecidas pela análise topológica de redes complexas. A relevância desses classificadores é verificada no contexto de reconhecimento de autoria e identificação de translationese. Com esta abordagem híbrida, mostra-se que é possível melhorar o desempenho de técnicas baseadas em rede ao combiná-las com técnicas tradicionais em PLN. Através da adaptação, combinação e aperfeiçoamento da modelagem, não apenas o desempenho dos sistemas de reconhecimento de autoria foi melhorado, mas também foi possível entender melhor quais são os fatores quantitativos textuais (medidos via redes) que podem ser utilizados na área de estilometria. Os avanços obtidos durante este projeto podem ser utilizados para estudar aplicações relacionadas, como é o caso da análise de inconsistências estilísticas e plagiarismos, e análise da complexidade textual. Além disso, muitos dos métodos propostos neste trabalho podem ser facilmente aplicados em diversas línguas naturais.
8

Propriedades de redes aplicadas à atribuição de autoria / Network features for authorship attribution

Valencia, Camilo Akimushkin 22 May 2017 (has links)
O reconhecimento de autoria é uma área de pesquisa efervescente, com muitas aplicações, incluindo detecção de plágio, análise de textos históricos, reconhecimento de mensagens terroristas ou falsificação de documentos. Modelos teóricos de redes complexas já são usados para o reconhecimento de autoria, mas alguns aspectos importantes têm sido ignorados. Neste trabalho, exploramos a dinâmica de redes de co-ocorrência e a relação com as palavras que representam os nós e descobrimos que ambas são claras assinaturas de autoria. Com otimização dos descritores da topologia das redes e de algoritmos de aprendizado de máquina, foi possível obter taxas de acerto maiores que 85%, sendo atingida uma taxa de 98.75% em um caso específico, para coleções de 80 livros, cada uma compilada de 8 autores de língua inglesa com 10 livros por autor. Esta tese demonstra que existem ainda aspectos inexplorados das redes de co-ocorrência de textos, o que deve permitir avanços ainda maiores no futuro próximo. / Authorship attribution is an active research area with many applications, including detection of plagiarism, analysis of historical texts, terrorist message identification or document falsification. Theoretical models of complex networks are already used for authorship attribution, but some issues have been ignored. In this thesis, we explore the dynamics of co-occurrence networks and the role of words, and found that they are both clear signatures of authorship. Using optimized descriptors for the network topology and machine learning algorithms, it has been possible to achieve accuracy rates above 85%, with a rate of 98.75% being reached in a particular case, for collections of 80 books produced by 8 English-speaking writers with 10 books per author. It is also shown that there are still many unexplored aspects of co-occurrence networks of texts, which seems promising for near future developments.
9

A Study on the Efficacy of Sentiment Analysis in Author Attribution

Schneider, Michael J 01 August 2015 (has links)
The field of authorship attribution seeks to characterize an author’s writing style well enough to determine whether he or she has written a text of interest. One subfield of authorship attribution, stylometry, seeks to find the necessary literary attributes to quantify an author’s writing style. The research presented here sought to determine the efficacy of sentiment analysis as a new stylometric feature, by comparing its performance in attributing authorship against the performance of traditional stylometric features. Experimentation, with a corpus of sci-fi texts, found sentiment analysis to have a much lower performance in assigning authorship than the traditional stylometric features.
10

Effective Authorship Attribution in Large Document Collections

Zhao, Ying, ying.zhao@rmit.edu.au January 2008 (has links)
Techniques that can effectively identify authors of texts are of great importance in scenarios such as detecting plagiarism, and identifying a source of information. A range of attribution approaches has been proposed in recent years, but none of these are particularly satisfactory; some of them are ad hoc and most have defects in terms of scalability, effectiveness, and computational cost. Good test collections are critical for evaluation of authorship attribution (AA) techniques. However, there are no standard benchmarks available in this area; it is almost always the case that researchers have their own test collections. Furthermore, collections that have been explored in AA are usually small, and thus whether the existing approaches are reliable or scalable is unclear. We develop several AA collections that are substantially larger than those in literature; machine learning methods are used to establish the value of using such corpora in AA. The results, also used as baseline results in this thesis, show that the developed text collections can be used as standard benchmarks, and are able to clearly distinguish between different approaches. One of the major contributions is that we propose use of the Kullback-Leibler divergence, a measure of how different two distributions are, to identify authors based on elements of writing style. The results show that our approach is at least as effective as, if not always better than, the best existing attribution methods-that is, support vector machines-for two-class AA, and is superior for multi-class AA. Moreover our proposed method has much lower computational cost and is cheaper to train. Style markers are the key elements of style analysis. We explore several approaches to tokenising documents to extract style markers, examining which marker type works the best. We also propose three systems that boost the AA performance by combining evidence from various marker types, motivated from the observation that there is no one type of marker that can satisfy all AA scenarios. To address the scalability of AA, we propose the novel task of authorship search (AS), inspired by document search and intended for large document collections. Our results show that AS is reasonably effective to find documents by a particular author, even within a collection consisting of half a million documents. Beyond search, we also propose the AS-based method to identify authorship. Our method is substantially more scalable than any method published in prior AA research, in terms of the collection size and the number of candidate authors; the discrimination is scaled up to several hundred authors.

Page generated in 0.1057 seconds