Return to search

Detecting code duplications in the NPM community

In the modern software development process, it has become a very mainstream practice to build software projects on top of third-party packages to simplify the development process. In this development method, it is quite common to copy existing code or files in other libraries instead of making regular calls. Although this approach can reduce the project's dependence on other libraries and make the project more streamlined, it also causes difficulties in maintenance and understanding. The ignorance of code duplication by third-party library community can even be exploited for malicious purpose, such as typo-squatting attack. This paper serves as a starting point to analyze the growing code duplication issues surrounding third-party open source packages, and what is the root cause of code duplication. In this paper, I conducted code duplication-related research based on some popular packages in the third-party open source packages community, the NPM community, by using the tokenizer tool and the code comparison tool to compute the code similarity, quantitatively analyzed the prevalence of code duplication in the NPM community, and did some related experiments based on this similarity. In the experiments, I found that code duplication is very common in NPM community: 17.1% of all the files have 1-93 similar file in other package when the threshold of similar file is set to 0.5. 29.3% of all the packages has at least one "similar package" when the threshold of similar package is set to 0.5. In all the 951 similar package pairs, 33.9% of them, 323 package pairs comes from the same domain. The ultimate goal of this paper is to promote the awareness of the commonness and the importance of code duplication in the third-party package community and the reasonable use of code duplication by developers in the project development. / In the modern software development process, developers often call other people's completed code to build their own programs. There are generally two ways to do this: indirectly call other people's code through "import" or similar instructions in the program, or directly copy and paste other people's code and make slight modifications. The second method can make the program more independent and easy to use, but the code duplication problem caused by this method also has great security risks.This paper serves as a starting point to analyze the growing code duplication issues, and what is the root cause of code duplication. In this paper, I conducted code duplication-related research based on some popular code packages in the NPM community.I used some tools to compute a value to define how different codes are similar to each other, quantitatively analyzed the prevalence of code duplication in the NPM community, and did some related experiments based on this similarity. In the experiments, I found that code duplication is very common in the NPM community: 17.1% of all the files have 1-93 similar file in other package, and 29.3% of all the package have at least one "similar package", when the definition of similar files and packages are not that "strict".In all the 951 similar package pairs, 33.9% of them, 323 package pairs comes from the same domain. The ultimate goal of this paper is to promote the awareness of the commonness and the importance of code duplication in the third-party package community and the reasonable use of code duplication by developers in the project development.

Identiferoai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/104972
Date09 September 2021
CreatorsLiu, Hanwen
ContributorsComputer Science and Applications
PublisherVirginia Tech
Source SetsVirginia Tech Theses and Dissertation
Detected LanguageEnglish
TypeThesis
FormatETD, application/pdf
RightsIn Copyright, http://rightsstatements.org/vocab/InC/1.0/

Page generated in 0.0019 seconds