Return to search

Analysis of Security Findings and Reduction of False Positives through Large Language Models

This thesis investigates the integration of State-of-the-Art (SOTA) Large Language Models
(LLMs) into the process of reassessing security findings generated by Static Application
Security Testing (SAST) tools. The primary objective is to determine whether LLMs are
able to detect false positives (FPs) while maintaining a high true positive (TP) rate, thereby
enhancing the efficiency and effectiveness of security assessments.
Four consecutive experiments were conducted, each addressing specific research questions.
The initial experiment, using a dataset of security findings extracted from the OWASP Bench-
mark, identified the optimal combination of context items provided by the SAST tool Spot-
Bugs, which, when used with GPT-3.5 Turbo, reduced FPs while minimizing the loss of
TPs. The second experiment, conducted on the same dataset, demonstrated that advanced
prompting techniques, particularly few-shot Chain-of-Thought (CoT) prompting combined
with Self-Consistency (SC), further improved the reassessment process. The third experiment
compared both proprietary and open-source LLMs on an OWASP Benchmark dataset about
one-fourth the size of the previously used dataset. GPT-4o achieved the highest performance,
detecting 80 out of 128 FPs without missing any TPs, resulting in a perfect TPR of 100% and
a decrease in FPR by 41.27 percentage points. Meanwhile, Llama 3.1 70B detected 112 out
of the 128 FPs but missed 10 TPs, resulting in a TPR of 94.94% and a reduction in FPR by
56.62 percentage points. To validate these findings in a real-world context, the approach was
applied to a dataset generated from the open-source project Mnestix using multiple SAST
tools. GPT-4o again emerged as the top performer, detecting 26 out of 68 FPs while only
missing one TP, resulting in a TPR decreased by 2.22 percentage points but simultaneously
an FPR decreased 37.57 percentage points.:Table of Contents IV
List of Figures VI
List of Tables VIII
List of Source Codes IX
List of Abbreviations XI
1. Motivation 1
2. Background 3
3. Related Work 17
4. Concept 31
5. Preparing a Security Findings Dataset 39
6. Implementing a Workflow 51
7. Identifying Context Items 67
8. Comparing Prompting Techniques 85
9. Comparing Large Language Models 101
10.Evaluating Developed Approach 127
11.Discussion 141
12.Conclusion 145
A. Appendix: Figures 147
A.1. Repository Directory Tree 148
A.2. Precision-Recall Curve of Compared Large Language Models 149
A.3. Performance Metrics Self-Consistency on Mnestix Dataset 150
B. Appendix: Tables 151
B.1. Design Science Research Concept 151
C. Appendix: Code 153
C.1. Pydantic Base Config Documentation 153
C.2. Pydantic LLM Client Config Documentation 155
C.3. LLM BaseClient Class 157
C.4. Test Cases Removed From Dataset 158

Identiferoai:union.ndltd.org:DRESDEN/oai:qucosa:de:qucosa:94141
Date18 October 2024
CreatorsWagner, Jonas
ContributorsHochschule für Technik, Wirtschaft und Kultur Leipzig
Source SetsHochschulschriftenserver (HSSS) der SLUB Dresden
LanguageEnglish
Detected LanguageEnglish
Typeinfo:eu-repo/semantics/acceptedVersion, doc-type:masterThesis, info:eu-repo/semantics/masterThesis, doc-type:Text
Rightsinfo:eu-repo/semantics/openAccess

Page generated in 0.0022 seconds