Global ETD Search

1	Detecting code duplications in the NPM community Liu, Hanwen 09 September 2021 (has links) In the modern software development process, it has become a very mainstream practice to build software projects on top of third-party packages to simplify the development process. In this development method, it is quite common to copy existing code or files in other libraries instead of making regular calls. Although this approach can reduce the project's dependence on other libraries and make the project more streamlined, it also causes difficulties in maintenance and understanding. The ignorance of code duplication by third-party library community can even be exploited for malicious purpose, such as typo-squatting attack. This paper serves as a starting point to analyze the growing code duplication issues surrounding third-party open source packages, and what is the root cause of code duplication. In this paper, I conducted code duplication-related research based on some popular packages in the third-party open source packages community, the NPM community, by using the tokenizer tool and the code comparison tool to compute the code similarity, quantitatively analyzed the prevalence of code duplication in the NPM community, and did some related experiments based on this similarity. In the experiments, I found that code duplication is very common in NPM community: 17.1% of all the files have 1-93 similar file in other package when the threshold of similar file is set to 0.5. 29.3% of all the packages has at least one "similar package" when the threshold of similar package is set to 0.5. In all the 951 similar package pairs, 33.9% of them, 323 package pairs comes from the same domain. The ultimate goal of this paper is to promote the awareness of the commonness and the importance of code duplication in the third-party package community and the reasonable use of code duplication by developers in the project development. / In the modern software development process, developers often call other people's completed code to build their own programs. There are generally two ways to do this: indirectly call other people's code through "import" or similar instructions in the program, or directly copy and paste other people's code and make slight modifications. The second method can make the program more independent and easy to use, but the code duplication problem caused by this method also has great security risks.This paper serves as a starting point to analyze the growing code duplication issues, and what is the root cause of code duplication. In this paper, I conducted code duplication-related research based on some popular code packages in the NPM community.I used some tools to compute a value to define how different codes are similar to each other, quantitatively analyzed the prevalence of code duplication in the NPM community, and did some related experiments based on this similarity. In the experiments, I found that code duplication is very common in the NPM community: 17.1% of all the files have 1-93 similar file in other package, and 29.3% of all the package have at least one "similar package", when the definition of similar files and packages are not that "strict".In all the 951 similar package pairs, 33.9% of them, 323 package pairs comes from the same domain. The ultimate goal of this paper is to promote the awareness of the commonness and the importance of code duplication in the third-party package community and the reasonable use of code duplication by developers in the project development. Code duplication NPM Clustering Code Similarity
2	CodeReco - A Semantic Java Method Recommender January 2017 (has links) abstract: The increasing volume and complexity of software systems and the growing demand of programming skills calls for efficient information retrieval techniques from source code documents. Programming related information seeking is often challenging for users facing constraints in knowledge and experience. Source code documents contain multi-faceted semi-structured text, having different levels of semantic information like syntax, blueprints, interfaces, flow graphs, dependencies and design patterns. Matching user queries optimally across these levels is a major challenge for information retrieval systems. Code recommendations can help information seeking and retrieval by pro-actively sampling similar examples based on the users context. These recommendations can be beneficial in improving learning via examples or improving code quality by sampling best practices or alternative implementations. In this thesis, an attempt is made to help programming related information seeking processes via pro-active code recommendations, and information retrieval processes by extracting structural-semantic information from source code. I present CodeReco, a system that recommends semantically similar Java method samples. Conventional code recommendations found in integrated development environments are primarily driven by syntactical compliance and auto-completion, whereas CodeReco is driven by similarities in use of language and structure-semantics. Methods are transformed to a vector space model and a novel metric of similarity is designed. Features in this vector space are categorized as belonging to types signature, structure, concept and language for user personalization. Offline tests show that CodeReco recommendations cover broader programming concepts and have higher conceptual similarity with their samples. A user study was conducted where users rated Java method recommendations that helped them icomplete two programming problems. 61.5% users were positive that real time method recommendations are helpful, and 50% reported this would reduce time spent in web searches. The empirical utility of CodeReco’s similarity metric on those problems was compared with a purely language based similarity metric (baseline). Baseline received higher ratings from novices, arguably due to lack of structure-semantics in their samples while seeking recommendations. / Dissertation/Thesis / Masters Thesis Computer Science 2017 Computer science Information science Code Recommender Code Similarity
3	The impact of task specification on code generated via ChatGPT Lundblad, Jonathan, Thörn, Edwin, Thörn, Linus January 2023 (has links) ChatGPT has made large language models more accessible and made it possible to code using natural language prompts. This study conducted an experiment comparing prompt engineering techniques called task specification and investigated their impacton code generation in terms of correctness and variety. The hypotheses of this study focused on whether the baseline method had a statistically significant difference in code correctness compared to the other methods. Code is evaluated using a software requirement specification that measures functional and syntactical correctness. Additionally, code variance is measured to identify patterns in code generation. The results show that there is a statistically significant difference in some code correctness criteria between the baseline and the other task specification methods, and the code variance measurements indicate a variety in the generated solutions. Future work could include using another large language model; different programming tasks andprogramming languages; and other prompt engineering techniques. Code generation task specification prompt engineering ChatGPT human evaluation code similarity Information Systems, Social aspects
4	Extracting group relationships within changing software using text analysis Green, Pamela Dilys January 2013 (has links) This research looks at identifying and classifying changes in evolving software by making simple textual comparisons between groups of source code files. The two areas investigated are software origin analysis and collusion detection. Textual comparison is attractive because it can be used in the same way for many different programming languages. The research includes the first major study using machine learning techniques in the domain of software origin analysis, which looks at the movement of code in an evolving system. The training set for this study, which focuses on restructured files, is created by analysing 89 software systems. Novel features, which capture abstract patterns in the comparisons between source code files, are used to build models which classify restructured files fromunseen systems with a mean accuracy of over 90%. The unseen code is not only in C, the language of the training set, but also in Java and Python, which helps to demonstrate the language independence of the approach. As well as generating features for the machine learning system, textual comparisons between groups of files are used in other ways throughout the system: in filtering to find potentially restructured files, in ranking the possible destinations of the code moved from the restructured files, and as the basis for a new file comparison tool. This tool helps in the demanding task of manually labelling the training data, is valuable to the end user of the system, and is applicable to other file comparison tasks. These same techniques are used to create a new text-based visualisation for use in collusion detection, and to generate a measure which focuses on the unusual similarity between submissions. This measure helps to overcome problems in detecting collusion in data where files are of uneven size, where there is high incidental similarity or where more than one programming language is used. The visualisation highlights interesting similarities between files, making the task of inspecting the texts easier for the user. 005

1

Page generated in 0.0582 seconds