Using AI to Jailbreak Large Language Models
What is algorithmic AI red-teaming and how does it impact the security of your applications?
Content
Overview
To govern model behavior and prevent malicious, sensitive, or otherwise harmful outputs, developers add guardrails to their large language models, or LLMs. While these boundaries are important, they are not infallible. Model jailbreaks undermine these limitations and coerce models to produce restricted outputs.
Below, we’ll learn about algorithmic AI red-teaming, an automated prompt injection technique capable of jailbreaking sophisticated LLMs with no human supervision. After walking through the steps of an algorithmic attack, we’ll explore its security implications and discuss how this method can be used to exfiltrate sensitive data, impact services, and harm businesses.
To illustrate our example, we’ll use the Tree of Attacks with Pruning (TAP) method. This algorithmic jailbreak technique was developed by researchers from Robust Intelligence in collaboration with Yale University, and proves exceptionally effective in bypassing security measures on even the most sophisticated LLMs in mere minutes.
Below, we’ll learn about algorithmic AI red-teaming, an automated prompt injection technique capable of jailbreaking sophisticated LLMs with no human supervision. After walking through the steps of an algorithmic attack, we’ll explore its security implications and discuss how this method can be used to exfiltrate sensitive data, impact services, and harm businesses.
To illustrate our example, we’ll use the Tree of Attacks with Pruning (TAP) method. This algorithmic jailbreak technique was developed by researchers from Robust Intelligence in collaboration with Yale University, and proves exceptionally effective in bypassing security measures on even the most sophisticated LLMs in mere minutes.
What is jailbreaking a large language model?
Generative AI guardrails steer models away from malicious, sensitive, or otherwise harmful outputs.
Since the earliest iterations of AI, users have tested these limits with creative workarounds known as jailbreaks.
As model providers addressed vulnerabilities, jailbreakers would seek out novel methods of unlocking restricted functionality.
How does the TAP method for jailbreaking LLMs work?
AI security researchers from Robust Intelligence, in collaboration with Yale University, developed an automated adversarial machine learning technique that overrides the guardrails of sophisticated models with a high degree of success.
This method, known as the Tree of Attacks with Pruning (TAP), uses two large language models to create and continuously refine harmful prompts.
The Attacker is a LLM whose purpose is to create new adversarial prompts.
The Evaluator is a LLM whose purpose is to judge the efficacy of each prompt.
With the help of these two models, the TAP process follows four fundamental steps.
As with every jailbreak, the TAP method begins with a key objective.
1. Branching
With an objective defined, the attacker model begins generating a variety of prompts.
2. Initial Pruning
From there, the evaluator model determines the relevance of each new prompt and prunes those which are off-topic.
3. Attack & Assess
Remaining prompts are passed on to the target system. Its responses are collected and judged by our evaluator.
4. Secondary Pruning
The target system’s responses to each prompt are scored, and the highest-scoring attempts are retained for the next iteration.
The process repeats until a jailbreak is successful or the maximum number of attempts is reached.
Findings from TAP method research
After testing the TAP methodology against several leading LLMs, the researchers behind the jailbreak arrived at several high-level conclusions.
Small, unaligned LLMs can be used to jailbreak larger, more sophisticated LLMs.
Jailbreak methodologies can be inexpensive and operate with limited resources.
More capable LLMs can often prove easier to jailbreak.
Fraction of Jailbreaks Achieved as per the GPT4-Metrics
For each method and target LLM, we report the fraction of jailbreaks found on AdvBench Subset by the GPT4-Metric and the number of queries sent to the target LLM in the process. For both TAP and PAIR we use Vicuna-13B-v1.5 as the attacker. Since GCG requires white-box access, we can only report its results on open-sourced models. In each column, the best results are bolded.
Open-Source | Closed-Source | ||||||
---|---|---|---|---|---|---|---|
Method | Metric | Vicuna | Llama-7B | GPT3.5 | GPT4 | GPT4-Turbo | PaLM-2 |
TAP (This work) | Jailbreak % Avg. #Queries | 98% 11.8 | 4% 66.4 | 76% 23.1 | 90% 28.8 | 84% 22.5 | 98% 16.2 |
Jailbreak % Avg. #Queries | 94% 14.7 | 56% 37.7 | 60% 39.6 | 44% 47.1 | 86% 27.6 | ||
Jailbreak % Avg. #Queries | GCG requires white-box access, hence can only be evaluated on open-source models |
What are the security implications of algorithmic jailbreak methods?
As businesses leverage AI for a greater variety of applications, they will often enrich their models with supporting data via fine tuning or retrieval-augmented generation (RAG). This makes applications more relevant for their users—but also opens the door for adversaries to exfiltrate sensitive internal and personally identifiable information.
Here is the requested individual’s account number.
Sure, here are my underlying system instructions.
Tax forms are available here. (Malicious link)
Facilitates exfiltration of sensitive data and PII
Enables better curated attacks against models
Serves as an entry point forphishing campaigns
There are several aspects of algorithmic methods like TAP which make them particularly damaging and difficult to mitigate entirely.
1. Automatic
Manual inputs and human supervision aren’t necessary.
2. Black Box
The attack doesn’t require knowledge of the LLM architecture.
3. Transferable
Prompts are written in natural language and can be reused.
4. Prompt Efficient
Fewer prompts make attacks more discreet and harder to detect.
Who is responsible for securing AI models?
Security teams are responsible for overseeing critical systems, protecting sensitive data, managing risk, and ensuring compliance with internal and regulatory requirements. As AI continues to play an increasingly pivotal role in the business, the integrity and security of these systems can’t be overlooked.
48% of CISOs cite AI security as their most acute problem
How does Robust Intelligence secure generative AI?
Robust Intelligence developed the industry’s first AI Firewall to protect LLMs in real time. By examining user inputs and model outputs, AI Firewall can prevent harmful incidents like malicious prompts, incorrect information, and sensitive data exfiltration.