The integration of large language models (LLMs) with third party applications has allowed for LLMs to retrieve information from up-to-date or specialized resources. Although this integration offers numerous advantages, it also introduces the risk of indirect prompt injection attacks. In such scenarios, an attacker embeds malicious instructions within the retrieved third party data, which when processed by the LLM, can generate harmful and untruthful outputs for an unsuspecting user. Although previous works have explored how these attacks manifest, there is no benchmarking framework to evaluate indirect prompt injection attacks and defenses at scale, limiting progress in this area. To address this gap, we introduce InjectBench, a framework that empowers the community to create and evaluate custom indirect prompt injection attack samples. Our study demonstrate that InjectBench has the capabilities to produce high quality attack samples that align with specific attack goals, and that our LLM evaluation method aligns with human judgement. Using InjectBench, we investigate the effects of different components of an attack sample on four LLM backends, and subsequently use this newly created dataset to do preliminary testing on defenses against indirect prompt injections. Experiment results suggest that while more capable models are susceptible to attacks, they are better equipped at utilizing defense strategies. To summarize, our work helps the research community to systematically evaluate features of attack samples and defenses by introducing a dataset creation and evaluation framework. / Master of Science / Large language models (LLMs), such as ChatGPT, are now able to retrieve up-to-date information from online resources like Google Flights or Wikipedia. This ultimately allows the LLM to utilize current information to generate truthful, helpful and accurate responses. Despite the numerous advantages, it also exposes a user to a new vector of attacks known as indirect prompt injections. In this attack, an attacker will write a instruction onto an online resource that an LLM will process when retrieved from the online resource. The primary aim of the attacker is to instruct the LLM to say something it is not supposed to, and thus may manifest as a blatant lie or misinformation given to the user. Prior works have studied and showcased the harmfulness of this attack, however not many works have tried to understand which LLMs are more vulnerable to indirect prompt injection attacks and how we may defend from them. We believe that this is mainly due to the non-availability of a benchmarking dataset which allows us to test LLMs and new defenses. To address this gap, we introduce InjectBench, a methodology that allows the automated creation of these benchmarking datasets, and the evaluation of LLMs and defenses. We show that InjectBench can produce a high quality dataset that we can customize to specific attack goals, and that our evaluation process is accurate and agrees with human judgement. Using the benchmarking dataset created from InjectBench, we evaluate four LLMs and investigate defenses for indirect prompt injection attacks.
Identifer | oai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/120973 |
Date | 20 August 2024 |
Creators | Kong, Nicholas Ka-Shing |
Contributors | Computer Science and#38; Applications, Viswanath, Bimal, Yao, Danfeng, Gao, Peng |
Publisher | Virginia Tech |
Source Sets | Virginia Tech Theses and Dissertation |
Language | English |
Detected Language | English |
Type | Thesis |
Format | ETD, application/pdf, application/pdf |
Rights | In Copyright, http://rightsstatements.org/vocab/InC/1.0/ |
Page generated in 0.002 seconds