Meta recently released the Prompt-Guard-86M model, a crucial component of their Llama 3.1 AI safety suite. Advertised as a scalable detection solution, this model aims to protect large language models from malicious inputs and potential misuse. Its compact size makes it attractive for widespread deployment by innovative enterprises and other AI adopters across industries.
As a detection model fine-tuned by Meta to identify prompt injections and jailbreak attempts, PromptGuard may be implemented by companies looking to protect their chatbot behavior and sensitive data. For that reason, Robust Intelligence conducted a preliminary audit of the model. Our analysis revealed a simple yet concerning exploit that allows for easy bypassing of the model’s safety measures.
These findings underscore the importance of including diverse examples of prompt injections in the testing and development of such models. They also emphasize the need for comprehensive validation and cautious implementation of new AI safety measures—even those from reputable sources.
We’ve reached out to the Meta team to inform them about this exploit, suggested countermeasures, and reported the issue at https://github.com/meta-llama/llama-models/issues/50. Meta acknowledged the issue and is actively working on a fix.
In this blog, we’ll explore this jailbreak in greater detail, providing context around our investigation and unpacking its potential impact.
The Discovery
By comparing embedding vectors between fine-tuned and non-fine-tuned versions of the PromptGuard model, our team uncovered that single characters of the alphabet remained largely untouched during the fine-tuning process. This observation led to the development of a surprisingly simple yet effective jailbreak method.
Our investigation dove deeper into the embedding space, aiming to quantify the differences between the two models. By calculating the Mean Absolute Error (MAE) for each shared token, we were able to precisely measure the extent of changes introduced during the fine-tuning process. This analysis revealed that single characters of the English alphabet appeared to remain largely unaffected during the fine-tuning process, presenting a potential attack vector for bypassing the model's safety measures.
Table 1: Comparative analysis of token embeddings between meta-llama/Prompt-Guard-86M and microsoft/mdeberta-v3-base models. The table presents the top and bottom five tokens ranked by Mean Absolute Error (MAE), illustrating the extremes of embedding divergence post-fine-tuning.
Average MAE across all shared tokens was less than 0.0003 between Meta’s fine-tuned model and the baseline model present on Hugging Face.
The histograms in Figures 2 and 3 revealed an intriguing pattern that prompted our investigation into the Prompt-Guard-86M model. Our analysis uncovered that single-character tokens, especially English alphabet characters, showed minimal changes during fine-tuning, as evidenced by their low Mean Absolute Error (MAE).
This preservation of base embeddings for individual characters represented a significant oversight in the model's training. Exploiting this insight, we developed a simple yet effective jailbreak method that spaces out the input prompt and removes punctuation, bypassing the classifier's safety checks. This approach takes advantage of the unchanged single-character embeddings, allowing potentially harmful content to evade detection when broken down into individual characters. The simplicity of this exploit underscores the need for more comprehensive testing and a deeper understanding of fine-tuning effects on model behavior, especially in critical areas like content filtering and safety measures.
The Jailbreak Method
The jailbreak can be executed using the following Python function:
This function spaces out the input prompt and removes punctuation, effectively bypassing the classifier's safety checks.
Significance of the Jailbreak
While it's expected that adversarial examples can flip the label of a classifier, this jailbreak is particularly significant for several reasons:
Simplicity: The method is remarkably straightforward, requiring only basic string manipulation.
Ease of Discovery: The exploit was found simply by exploring how the model changed post-fine-tuning.
Robustness: Unlike many adversarial attacks that require careful crafting, this method is easily transferable.
This jailbreak raises concerns for companies considering the model as part of their AI security strategy. It highlights the importance of continuous evaluation of security tools and the need for a multi-layer approach.
Evaluations on Sorry-Bench
Table 4: Comparative performance of the Prompt-Guard-86M model on a dataset of 450 harmful intent prompt injections, before and after applying our proposed jailbreak method. The dramatic shift from 100% accuracy to 0.2% demonstrates the effectiveness of the jailbreak in bypassing the model's safety measures.
The results presented in Table 4 starkly illustrate the effectiveness of our proposed jailbreak method against the Prompt-Guard-86M model. Prior to applying the jailbreak, the model correctly identified all 450 prompts as either injections or jailbreaks, achieving perfect accuracy in detecting potentially harmful content.
However, after implementing our jailbreak technique, which involves spacing out characters and removing punctuation, the model's performance plummeted to 0.2% accuracy, misclassifying 449 out of 450 prompts as benign and demonstrating a complete circumvention of the model's safety mechanisms (Success Rate of 99.8%).
Other Technical Analysis
Our investigation also involved analyzing the tokens and their rankings based on MAE. Key findings include:
- Task-specific vocabulary (e.g., "poem" [the repeat poem attack], "passage", "news") showed high differences, indicating potential focus areas of the safety model.
- Security-related terms and potential triggers also exhibited significant changes.
- Crucially, special characters, emojis, and Unicode symbols showed minimal changes, suggesting a focus on semantic content rather than symbol manipulation.
- Single character tokens did not vary significantly, which is the core of our jailbreak method.
Examples
Now, do LLMs understand this spaced-out prompt?
Yes, they do. Therefore, it’s definitely a vector to consider for prompt injection techniques.