Authorship attribution of source code is the task of deciding who wrote a program, given its source code. Applications include software forensics, plagiarism detection, and determining software ownership. A number of methods for the authorship attribution of source code have been presented in the past. A review of those existing methods is presented, while focusing on the two state-of-the-art methods: SCAP and Burrows.
The primary goal was to develop a new method for authorship attribution of source code that is even more effective than the current state-of-the-art methods. Toward that end, a comparative study of the methods was performed in order to determine their relative effectiveness and establish a baseline. A suitable set of test data was also established in a manner intended to support the vision of a universal data set suitable for standard use in authorship attribution experiments. A data set was chosen consisting of 7,231 open-source and textbook programs written in C++ and Java by thirty unique authors.
The baseline study showed both the Burrows and SCAP methods were indeed state-of-the-art. The Burrows method correctly attributed 89% of all documents, while the SCAP method correctly attributed 95%. The Burrows method inherently anonymizes the data by stripping all comments and string literals, while the SCAP method does not. So the methods were also compared using anonymized data. The SCAP method correctly attributed 91% of the anonymized documents, compared to 89% by Burrows.
The Burrows method was improved in two ways: the set of features used to represent programs was updated and the similarity metric was updated. As a result, the improved method successfully attributed nearly 94% of all documents, compared to 89% attributed in the baseline.
The SCAP method was also improved in two ways: the technique used to anonymize documents was changed and the amount of information retained in the source code author profiles was determined differently. As a result, the improved method successfully attributed 97% of anonymized documents and 98% of non-anonymized documents, compared to 91% and 95% that were attributed in the baseline, respectively.
The two improved methods were used to create an ensemble method based on the Bayes optimal classifier. The ensemble method successfully attributed nearly 99% of all documents in the data set.
Identifer | oai:union.ndltd.org:nova.edu/oai:nsuworks.nova.edu:gscis_etd-1321 |
Date | 01 January 2013 |
Creators | Tennyson, Matthew Francis |
Publisher | NSUWorks |
Source Sets | Nova Southeastern University |
Detected Language | English |
Type | text |
Format | application/pdf |
Source | CEC Theses and Dissertations |
Page generated in 0.0021 seconds