This diploma thesis deals with using of explicit semantic analysis for detection similarities in source codes in the context of plagiarism. For building a semantic interpreter 40 829 Wikipedia articles were used and the analysis was tested on 25 specially created documents using plagiarism techniques and 5 downloaded documents. The dataset was consisted of five languages: Java, Javascript, PHP, C++ and Python. Another dataset of 15 documents was used for testing random matches. It was demonstrated that the analysis is capable for the given dataset do detect similarities among different languages. Greedy String Tiling algorithm was used to refine the results and together with the explicit semantic analysis is implemented in the system Anton.
Identifer | oai:union.ndltd.org:nusl.cz/oai:invenio.nusl.cz:428881 |
Date | January 2019 |
Creators | Všianský, Richard |
Source Sets | Czech ETDs |
Language | Czech |
Detected Language | English |
Type | info:eu-repo/semantics/masterThesis |
Rights | info:eu-repo/semantics/restrictedAccess |
Page generated in 0.002 seconds