What are LLM prompt injection attacks? (Part 2)

Introduction

In our previous blog post we discussed potential LLM Injection attacks. In this post we’ll explore potential mitigations.

TLDR; Mitigations include, but are not limited to, diverse training of models, pre-training data auditing, input validation, output filtering, and analysis using secondary models.

Important note: when discussing LLMs folks are often talking about the agent that interfaces a model. In the case of ChatGPT, ChatGPT is the agent, while GPT-3 or GPT-4 are the models. For simplicity, this post will use the term “model” to describe interfacing with an agent/model combination except where it’s important to differentiate.

In no particular order, AI designers can think about mitigating these risks using diverse training data, pre-training data auditing, input validation, output filtering, and analysis using secondary models. Let’s tackle them one-by-one.

Training LLMs on more diverse data sets

Larger data sets, that are more representative of the target use cases for a model, can reduce risk. In the previous post we talked about an imaginary chatbot that could be prompted to disclose someone's protected health information, specifically a fictional John Smith’s address in Chicago.  If the model was only trained on a few patients’ data there may have only been one “John Smith” who lived in Chicago in the training data. This resulted in the model giving away the PHI of John Smith. If the model was trained with lots of folks, living at many different addresses, the model would be less likely to reveal correlatable PHI.

Pre-training data auditing

In addition to expanding the diversity of data a model is trained on, the training data should also be audited and sanitized as necessary. A guiding principle that we try to instill in our clients is that you can’t disclose data that you do not have. What data needs to be accurate in the training set for your models to operate correctly? You can then remove extraneous data.

AI developers should also ask themselves “what is the exact question we’re trying to answer and how is our model arriving at the conclusion?”

We preach the importance of interpretability and explainability to our clients. Interpretability tells you what data caused a model to reach specific conclusions. Explainability doesn’t go as far and simply means a model can explain how it reached a conclusion, without the ability to reference specific data. Most AI models lack even explainability. Without these two abilities, and especially without interpretability, AI designers cannot understand what data is most important in their model. If you are designing an AI system you can apply these principles during pre-training to understand what data most influences your model. You can remove unnecessary data before training productions models.

In our chatbot example, prior to training their model, designers could have discovered PHI using manual and automated techniques. That data could have then been replaced with dummy data, or sanitized completely prior to training the production model.

Analysis using secondary models

What’s better than securing one model!? Having to secure two models! We jest. At times using secondary models can help make a primary model more resilient. A secondary mode is any model that a primary model interacts with.

For example, our hypothetical model that someone abused for hate speech generation could employ a secondary model and ask the secondary model “Is this request asking me to do something malicious?” or “Does this output contain hate speech?”. If either of the answers are affirmative the primary model termites execution or the agent does not return a response.

*secondary models may also be susceptible to all of the risks outlined in our previous post, and can likely benefit from the mitigations mentioned here 🙂.

Input and Output validation

Use input and output validation to prevent prompt injections from happening or stop tainted responses.This can be as simple as using a list of banned words, or employing secondary models. Our hypothetical HealthBot3000 could use allow/deny-lists for words and phrases that allude to specific patients such as “Name of someone” or “address”.

This is hard to accomplish effectively because humans are infinitely creative. A malicious prompter may misspell a word that the model still knows how to interpret, or they may use a phrase that is equivalent but not on the deny-list.

To further protect models from providing harmful outputs AI developers can apply allow/deny-lists for responses. This will catch the trivial cases when an output contains a banned word, such as a racial slur. Again, using allow/deny-lists is rarely a complete solution. For example, a malicious prompter could circumvent input controls to provide a malicious prompt, and then ask the model to translate their response into another language. This will bypass any output filter that doesn’t have coverage of the requested language.

Like most protections, there is no one silver bullet. When it comes to input/output validation, in addition to allow/deny-lists developers should consider using secondary models for analysis.

These models could use NLP techniques such as sentiment analysis or topic modeling to grade inputs and outputs. Combined with allow/deny-lists these validation techniques make it significantly harder for an adversary to engineer a malicious prompt.

Wrap-Up

Thank you for sticking with us through these two posts on LLM injection attacks. There are direct or partial corollaries of attacks and mitigations for other ML systems. Also, it’s important that all systems are designed not just to optimize ideal circumstances but made resilient against malicious interactions.

Have you built your models with proper mitigations in place? Have you red-teamed your models with an adversarial simulation to validate your protections? If not, get in touch now and learn how outflank.ai can help you build more resilient models. Reach out to us on Twitter (𝕏?) if you have comments or follow up questions!

Sources and Further Reading

Google’s Bard - used for outlining and summarizing

Prompt Injection Attacks: A New Frontier in Cybersecurity

Exploring Prompt Injection Attacks

What’s the worst that could happen

MLSecOps Podcast - Privacy Engineering: Safeguarding AI & ML Systems in a Data-Driven Era, The Intersection of MLSecOps and DataPrepOps

Next
Next

What are LLM prompt injection attacks? (Part 1)