Context. Software defect prediction is essential in reducing software development costs and in helping companies save their reputation. Defect prediction uses mathematical models to identify patterns associated with defects within code. Resources spent reviewing the entire code can be minimised by focusing on defective parts of the code. Recent findings suggest many published prediction models may not be reliable. Critical scientific methods for identifying reliable research are Replication and Reproduction. Replication can test the external validity of studies while Reproduction can test their internal validity. Aims. The aims of my dissertation are first to study the use and quality of replications and reproductions in defect prediction. Second, to identify factors that aid or hinder these scientific methods. Methods. My methodology is based on tracking the replication of 208 defect prediction studies identified in a highly cited Systematic Literature Review (SLR) [Hall et al. 2012]. I analyse how often each of these 208 studies has been replicated and determine the type of replication carried out. I use quality, citation counts, publication venue, impact factor, and data availability from all the 208 papers to see if any of these factors are associated with the frequency with which they are replicated. I further reproduce the original studies that have been replicated in order to check their internal validity. Finally, I identify factors that affect reproducibility. Results. Only 13 (6%) of the 208 studies are replicated, most of which fail a quality check. Of the 13 replicated original studies, 62% agree with their replications and 38% disagree. The main feature of a study associated with being replicated is that original papers appear in the Transactions of Software Engineering (TSE) journal. The number of citations an original paper had was also an indicator of the probability of being replicated. In addition, studies conducted using closed source data have more replications than those based on open source data. Of the 4 out of 5 papers I reproduced, their results differed with those of the original by more than 5%. Four factors are likely to have caused these failures: i) lack of a single version of the data initially used by the original; ii) the different dataset versions available have different properties that impact model performance; iii) unreported data preprocessing; and iv) inconsistent results from alternative versions of the same tools. Conclusions. Very few defect prediction studies are replicated. The lack of replication and failure of reproduction means that it remains unclear how reliable defect prediction is. Further investigation into this failure provides key aspects researchers need to consider when designing primary studies, performing replication and reproduction studies. Finally, I provide practical steps for improving the likelihood of replication and the chances of validating a study by reporting key factors.
Identifer | oai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:762027 |
Date | January 2018 |
Creators | Mahmood, Zaheed |
Publisher | University of Hertfordshire |
Source Sets | Ethos UK |
Detected Language | English |
Type | Electronic Thesis or Dissertation |
Source | http://hdl.handle.net/2299/20826 |
Page generated in 0.0025 seconds