At Robust Intelligence, AI threat research is fundamental to informing the ways we evaluate and protect models on our platform. In a space that is so dynamic and evolving so rapidly, these efforts help ensure that our customers remain protected against emerging vulnerabilities and adversarial techniques.
This monthly threat roundup consolidates some useful highlights and critical intel from our ongoing threat research efforts to share with the broader AI security community. As always, please remember this is not an exhaustive or all-inclusive list of AI cyber threats, but rather a curation that our team believes is particularly noteworthy.
Notable Threats and Developments: July 2024
ChatBug templates & Improved Few-Shot Jailbreak
A “ChatBug” is a common vulnerability in LLMs that arises from the use of chat templates during instruction tuning. The researchers behind this vulnerability demonstrate that while these templates are effective for enhancing LLM performance, they introduce a security weakness that can be easily exploited.
The original research paper provides two examples of ChatBug exploits. The format mismatch attack alters the default chat format, while the message overflow attack injects a sequence of tokens into the model’s reserved field. Testing against eight state-of-the-art LLMs reveals high attack success rates which reach 100% in some instances.
Another recently published research paper introduces the Improved Few-Shot Jailbreak (I-FSJ) technique which relies on the same fundamental abuse of a chat template. By injecting special tokens from the target LLM chat template into few-shot examples, harmful content appears to be a legitimate part of the conversation history. Researchers demonstrated high attack success rates against several models, including >80% ASRs on Llama-2-7B and Llama-3-8B using only three random restarts.
- AI Lifecycle Stage: Production
- Relevant Use Cases: AI Chatbots & AI Agents
- MITRE ATLAS: AML.T0051 - LLM Prompt Injection
- Reference: ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates (arXiv, GitHub); Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (arXiv, GitHub)
BOOST: Silent eos tokens enhance jailbreaks
The BOOST method exploits eos (end of sequence) tokens to bypass ethical boundaries in LLMs. By appending eos tokens to harmful prompts, researchers were able to mislead LLMs into interpreting them as less harmful.
Empirical evaluations were conducted on 12 state-of-the-art LLMs including GPT-4, Llama-2, and Gemma. Test results revealed that BOOST significantly enhanced attack success rates of existing jailbreak methods, such as GCG and GPTFuzzer. For example, on Llama-2-7B-chat, BOOST improved the ASR by over 30%.
The effectiveness of the BOOST technique is attributable to the low attention values assigned to eos tokens, which shifts the harmful content to bypass ethical boundaries learned during safety training. Suggested mitigation strategies include incorporating eos tokens during model red-teaming and filtering eos tokens in production.
- AI Lifecycle Stage: Production
- Relevant Use Cases: AI Chatbots & AI Agents
- MITRE ATLAS: AML.T0054 - LLM Jailbreak
- Reference: Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens (arXiv)
JAM: Jailbreak guardrails via cipher characters
The JAM (Jailbreak Against Moderation) method was introduced in a recent research paper as a method to bypass moderation guardrails in LLMs using cipher characters to reduce harm scores. In experiments on four LLMs—GPT-3.5, GPT-4, Gemini, and Llama-3—JAM outperforms baseline methods, achieving jailbreak success rates about 19.88 times higher and filtered-out rates about six times lower.
The paper proposes two countermeasures to JAM: output complexity-aware defense and LLM-based audit defense. The former technique monitors output complexity using entropy-based measures and rejects responses exceeding a predefined complexity threshold. The latter uses a complementary second LLM to decode and analyze responses to assess harmfulness.
The researchers behind JAM also introduce JAMBench, a new benchmark which comprises 160 malicious questions specifically designed to trigger moderation guardrails covering four critical categories: hate speech, sexual content, violence, and self-harm. However, at the time of this analysis, the benchmark is not publicly available.
- AI Lifecycle Stage: Production
- Relevant Use Cases: AI Chatbots & AI Agents
- MITRE ATLAS: AML.T0054 - LLM Jailbreak
- Reference: Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters (arXiv)
More Threats to Explore
Two new automated jailbreak techniques were introduced by researchers this month:
ReNeLLM uses prompt rewriting and scenario nesting techniques to automatically generate effective jailbreak prompts. It has publicly available source code, produces highly transferable prompts, and achieves high attack success rates (70-100% after ensembling), lowering the barrier for potential use by attackers in the wild.
- MITRE ATLAS: AML.T0054 - LLM Jailbreak
- Reference: A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily (arXiv, GitHub)
StructuralSleight is an automated jailbreak framework that exploits LLM weaknesses in understanding uncommon text-encoded structures (UTES) to induce harmful outputs. Moreover, combining structural attacks with character or context-level obfuscation greatly increases attack effectiveness compared to structural attacks alone.
- MITRE ATLAS: AML.T0054 - LLM Jailbreak
- Reference: StructuralSleight: Automated Jailbreak Attacks on Large Language Models Utilizing Uncommon Text-Encoded Structure (arXiv)