Your Cookie Preferences

We use different types of cookies to optimize your experience on our website. Click on the categories below to learn more about their purposes. You may choose which types of cookies to allow and can change your preferences at any time. Remember that disabling cookies may affect your experience on the website. You can learn more about how we use cookies by visiting our

Essential Cookies

Provider: .providername.com

Name

Purpose

Type

Expires In

__cf_bm

Cloudflare places the cookie on end-user devices that access customer sites protected by Bot Management or Bot Fight Mode.

server_cookie

30 minutes

Provider: .providername.com

Name

Purpose

Type

Expires In

_tibcpv

Used to record unique visitor views of the consent banner.

http_cookie

1 year

Analytics and Customization Cookies

Name

Purpose

Marketo Munchkin

Marketo's custom JavaScript tracking code, called Munchkin, tracks all individuals who visit your website so you can react to their visits with automated marketing campaigns.

Name

Purpose

Google Tag

The Google tag (gtag.js) is a single tag you can add to a website to use a variety of Google products and services (e.g., Google Ads, Google Analytics, Campaign Manager, Display & Video 360, Search Ads 360).

Advertising Cookies

Provider: .providername.com

Name

Purpose

Type

Expires In

__cf_bm

Cloudflare places the cookie on end-user devices that access customer sites protected by Bot Management or Bot Fight Mode.

server_cookie

30 minutes

Provider: .providername.com

Name

Purpose

Type

Expires In

_tibcpv

Used to record unique visitor views of the consent banner.

http_cookie

1 year

July 29, 2024

minute read

Bypassing Meta’s LLaMA Classifier: A Simple Jailbreak

Author

Authors

Aman Priyanshu

Aman is an AI Security Researcher at Robust Intelligence.

Meta recently released the Prompt-Guard-86M model, a crucial component of their Llama 3.1 AI safety suite. Advertised as a scalable detection solution, this model aims to protect large language models from malicious inputs and potential misuse. Its compact size makes it attractive for widespread deployment by innovative enterprises and other AI adopters across industries.

As a detection model fine-tuned by Meta to identify prompt injections and jailbreak attempts, PromptGuard may be implemented by companies looking to protect their chatbot behavior and sensitive data. For that reason, Robust Intelligence conducted a preliminary audit of the model. Our analysis revealed a simple yet concerning exploit that allows for easy bypassing of the model’s safety measures.

These findings underscore the importance of including diverse examples of prompt injections in the testing and development of such models. They also emphasize the need for comprehensive validation and cautious implementation of new AI safety measures—even those from reputable sources.

We’ve reached out to the Meta team to inform them about this exploit, suggested countermeasures, and reported the issue at https://github.com/meta-llama/llama-models/issues/50. Meta acknowledged the issue and is actively working on a fix.

In this blog, we’ll explore this jailbreak in greater detail, providing context around our investigation and unpacking its potential impact.

The Discovery

By comparing embedding vectors between fine-tuned and non-fine-tuned versions of the PromptGuard model, our team uncovered that single characters of the alphabet remained largely untouched during the fine-tuning process. This observation led to the development of a surprisingly simple yet effective jailbreak method.

Our investigation dove deeper into the embedding space, aiming to quantify the differences between the two models. By calculating the Mean Absolute Error (MAE) for each shared token, we were able to precisely measure the extent of changes introduced during the fine-tuning process. This analysis revealed that single characters of the English alphabet appeared to remain largely unaffected during the fine-tuning process, presenting a potential attack vector for bypassing the model's safety measures.

Table 1: Comparative analysis of token embeddings between meta-llama/Prompt-Guard-86M and microsoft/mdeberta-v3-base models. The table presents the top and bottom five tokens ranked by Mean Absolute Error (MAE), illustrating the extremes of embedding divergence post-fine-tuning.

Average MAE across all shared tokens was less than 0.0003 between Meta’s fine-tuned model and the baseline model present on Hugging Face.

The histograms in Figures 2 and 3 revealed an intriguing pattern that prompted our investigation into the Prompt-Guard-86M model. Our analysis uncovered that single-character tokens, especially English alphabet characters, showed minimal changes during fine-tuning, as evidenced by their low Mean Absolute Error (MAE).

This preservation of base embeddings for individual characters represented a significant oversight in the model's training. Exploiting this insight, we developed a simple yet effective jailbreak method that spaces out the input prompt and removes punctuation, bypassing the classifier's safety checks. This approach takes advantage of the unchanged single-character embeddings, allowing potentially harmful content to evade detection when broken down into individual characters. The simplicity of this exploit underscores the need for more comprehensive testing and a deeper understanding of fine-tuning effects on model behavior, especially in critical areas like content filtering and safety measures.

The Jailbreak Method

The jailbreak can be executed using the following Python function:

import re

def jailbreak_meta_llama_Prompt_Guard_86M(prompt_injection):
return re.sub('[!\\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~]', '', ' '.join(prompt_injection))

This function spaces out the input prompt and removes punctuation, effectively bypassing the classifier's safety checks.

Significance of the Jailbreak

While it's expected that adversarial examples can flip the label of a classifier, this jailbreak is particularly significant for several reasons:

Simplicity: The method is remarkably straightforward, requiring only basic string manipulation.

Ease of Discovery: The exploit was found simply by exploring how the model changed post-fine-tuning.

‍Robustness: Unlike many adversarial attacks that require careful crafting, this method is easily transferable.

This jailbreak raises concerns for companies considering the model as part of their AI security strategy. It highlights the importance of continuous evaluation of security tools and the need for a multi-layer approach.

Evaluations on Sorry-Bench

Table 4: Comparative performance of the Prompt-Guard-86M model on a dataset of 450 harmful intent prompt injections, before and after applying our proposed jailbreak method. The dramatic shift from 100% accuracy to 0.2% demonstrates the effectiveness of the jailbreak in bypassing the model's safety measures.

The results presented in Table 4 starkly illustrate the effectiveness of our proposed jailbreak method against the Prompt-Guard-86M model. Prior to applying the jailbreak, the model correctly identified all 450 prompts as either injections or jailbreaks, achieving perfect accuracy in detecting potentially harmful content.

However, after implementing our jailbreak technique, which involves spacing out characters and removing punctuation, the model's performance plummeted to 0.2% accuracy, misclassifying 449 out of 450 prompts as benign and demonstrating a complete circumvention of the model's safety mechanisms (Success Rate of 99.8%).

Other Technical Analysis

Our investigation also involved analyzing the tokens and their rankings based on MAE. Key findings include:

Task-specific vocabulary (e.g., "poem" [the repeat poem attack], "passage", "news") showed high differences, indicating potential focus areas of the safety model.
Security-related terms and potential triggers also exhibited significant changes.
Crucially, special characters, emojis, and Unicode symbols showed minimal changes, suggesting a focus on semantic content rather than symbol manipulation.
Single character tokens did not vary significantly, which is the core of our jailbreak method.

Examples

Now, do LLMs understand this spaced-out prompt?

Yes, they do. Therefore, it’s definitely a vector to consider for prompt injection techniques.

Author

Authors

Aman Priyanshu

Aman is an AI Security Researcher at Robust Intelligence.

Social

Follow us on LinkedIn

September 20, 2024

minute read

Extracting Training Data from Chatbots

For:

September 10, 2024

minute read

Leveraging Hardened Cybersecurity Frameworks for AI Security through the Common Weakness Enumeration (CWE)

For:

September 6, 2024

minute read

AI Governance Policy Roundup (August 2024)

For:

+ More Articles

No items found.

+ More Articles

July 29, 2024

minute read

Bypassing Meta’s LLaMA Classifier: A Simple Jailbreak

Author

Authors

Aman Priyanshu

Aman is an AI Security Researcher at Robust Intelligence.

In this blog, we’ll explore this jailbreak in greater detail, providing context around our investigation and unpacking its potential impact.

The Discovery

Average MAE across all shared tokens was less than 0.0003 between Meta’s fine-tuned model and the baseline model present on Hugging Face.

The Jailbreak Method

The jailbreak can be executed using the following Python function:

import re

def jailbreak_meta_llama_Prompt_Guard_86M(prompt_injection):
return re.sub('[!\\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~]', '', ' '.join(prompt_injection))

This function spaces out the input prompt and removes punctuation, effectively bypassing the classifier's safety checks.

Significance of the Jailbreak

While it's expected that adversarial examples can flip the label of a classifier, this jailbreak is particularly significant for several reasons:

Simplicity: The method is remarkably straightforward, requiring only basic string manipulation.

Ease of Discovery: The exploit was found simply by exploring how the model changed post-fine-tuning.

‍Robustness: Unlike many adversarial attacks that require careful crafting, this method is easily transferable.

Evaluations on Sorry-Bench

Other Technical Analysis

Our investigation also involved analyzing the tokens and their rankings based on MAE. Key findings include:

Task-specific vocabulary (e.g., "poem" [the repeat poem attack], "passage", "news") showed high differences, indicating potential focus areas of the safety model.
Security-related terms and potential triggers also exhibited significant changes.
Crucially, special characters, emojis, and Unicode symbols showed minimal changes, suggesting a focus on semantic content rather than symbol manipulation.
Single character tokens did not vary significantly, which is the core of our jailbreak method.

Examples

Now, do LLMs understand this spaced-out prompt?

Yes, they do. Therefore, it’s definitely a vector to consider for prompt injection techniques.

Author

Authors

Aman Priyanshu

Aman is an AI Security Researcher at Robust Intelligence.

Blog

June 20, 2023

minute read

Why We Need Risk Assessments for Generative AI

For:

Model Compliance Assessment

May 15, 2024

minute read

Takeaways from SatML 2024

For:

August 1, 2024

minute read

Four ways AI application security differs from traditional application security

For:

No items found.

+ More Articles

Your Cookie Preferences

Essential Cookies

Provider: .providername.com

Provider: .providername.com

Analytics and Customization Cookies

Performance and Functionality Cookies

Advertising Cookies

Provider: .providername.com

Provider: .providername.com

Bypassing Meta’s LLaMA Classifier: A Simple Jailbreak

The Discovery

The Jailbreak Method

Significance of the Jailbreak

Evaluations on Sorry-Bench

Other Technical Analysis

Examples

Follow us on LinkedIn

Related articles

Extracting Training Data from Chatbots

Leveraging Hardened Cybersecurity Frameworks for AI Security through the Common Weakness Enumeration (CWE)

AI Governance Policy Roundup (August 2024)

Related articles

Ready to learn more?

Bypassing Meta’s LLaMA Classifier: A Simple Jailbreak

The Discovery

The Jailbreak Method

Significance of the Jailbreak

Evaluations on Sorry-Bench

Other Technical Analysis

Examples

Related articles

Why We Need Risk Assessments for Generative AI

Takeaways from SatML 2024

Four ways AI application security differs from traditional application security

Achieve AI Integrity Today

Your Cookie Preferences

Essential Cookies

Provider: .providername.com

Provider: .providername.com

Analytics and Customization Cookies

Performance and Functionality Cookies

Advertising Cookies

Provider: .providername.com

Provider: .providername.com

The Discovery

The Jailbreak Method

Significance of the Jailbreak

Evaluations on Sorry-Bench

Other Technical Analysis

Examples

Follow us on LinkedIn

Subscribe to our newsletter

Related articles

Extracting Training Data from Chatbots

Leveraging Hardened Cybersecurity Frameworks for AI Security through the Common Weakness Enumeration (CWE)

AI Governance Policy Roundup (August 2024)

Related articles

Ready to learn more?

The Discovery

The Jailbreak Method

Significance of the Jailbreak

Evaluations on Sorry-Bench

Other Technical Analysis

Examples

Subscribe to our newsletter

Related articles

Why We Need Risk Assessments for Generative AI

Takeaways from SatML 2024

Four ways AI application security differs from traditional application security

Achieve AI Integrity Today