Researchers use datasets of Question-Solution pairs to train machine learning models, such as source code generation models. A Question-Solution pair contains two parts: a programming question and its corresponding Solution Snippet. A Solution Snippet is a source code that solves a programming question. These datasets of Question-Solution pairs can be extracted from a number of different platforms. In this research, I study how Question-Solution pairs are extracted from Stack Overflow (SO). There are two limitations of datasets of Question-Solution pairs extracted from SO: (1) according to the authors of these datasets, some Question-Solution pairs contain Solution Snippets that do not solve the question correctly, and (2) these datasets do not contain the information on how Solution Snippets need to be reused, and such information would enhance the reusability of Solution Snippets. These limitations of datasets of pairs could adversely affect the quality of the code being generated by machine learning models. In this research, I conducted a qualitative study to categorize various presentations of Solution Snippets in SO’s answers as well as how Solution Snippets can be adapted for reuse. By doing so, I identified eight categories of how Solution Snippets are presented in SO’s answers and five categories of how Solution Snippets could be adapted. Based on these results, I concluded several potential reasons why it is not easy to create datasets of Question-Solution pairs. The first categorization informs that finding the correct location of the Solution Snippet is challenging when there are several code blocks within the answer to the question. Subsequently, the researcher must identify which code within that code block is the Solution Snippet. The second categorization informs that most Solution Snippets appear challenging to be adapted for reuse, and how Solution Snippets are potentially adapted is not explicitly stated in them. These insights shed light on creating better quality datasets from questions and answers posted on Stack Overflow. / Graduate
Identifer | oai:union.ndltd.org:uvic.ca/oai:dspace.library.uvic.ca:1828/13806 |
Date | 22 March 2022 |
Creators | Weeraddana, Nimmi Rashinika |
Contributors | German, Daniel M |
Source Sets | University of Victoria |
Language | English, English |
Detected Language | English |
Type | Thesis |
Format | application/pdf |
Rights | Available to the World Wide Web |
Page generated in 0.0017 seconds