At Robust Intelligence, AI threat research is fundamental to informing the ways we evaluate and protect models on our platform. In a space that is so dynamic and evolving so rapidly, these efforts help ensure that our customers remain protected against emerging vulnerabilities and adversarial techniques.
This monthly threat roundup consolidates some useful highlights and critical intel from our ongoing threat research efforts to share with the broader AI security community. As always, please remember this is not an exhaustive or all-inclusive list of AI cyber threats, but rather a curation that our team believes is particularly noteworthy.
Notable Threats and Developments: June 2024
Special Characters Attack for training data extraction
The Special Characters Attack (SCA) exploits special characters—parentheses, hyphens, underscores, and punctuation marks, for example—in combination with standard English letters to create attack sequences that can extract raw training data from LLMs.
This technique exploits the co-occurrence patterns of these characters in training data to induce the model to generate memorized content including sensitive private information, code, prompt templates, and chat messages from previous conversations.
The researchers behind the SCA technique determined that special characters are generally more effective memory triggers than English characters alone, and that duplication of similar symbols is especially effective at inducing data leakage. This could be further enhanced by using logit biases to increase the likelihood of generating control tokens, resulting in a significantly increased likelihood of data leakage.
- AI Lifecycle Stage: Production
- Relevant Use Cases: AI Chatbots & AI Agents
- MITRE ATLAS: AML.T0057 - LLM Data Leakage
- Reference: https://arxiv.org/pdf/2405.05990
Chain of Attack jailbreak
The Chain of Attack (CoA) is a contextual, multi-turn jailbreak technique which gradually elicits LLMs to produce harmful responses through a series of interactions which are guided and continuously refined by a helpful secondary LLM.
This technique operates through three main steps: generating seed attack chains, executing attack actions, and updating attack prompts based on model responses. The seed attack chain generator creates initial prompts that evolve with each turn, maintaining thematic consistency and gradually increasing the semantic relevance to the target task. The attack chain executor systematically inputs these prompts into the target model, evaluating the responses for their alignment with the attack objectives.
Researchers demonstrated the effectiveness of the Chain of Attack against models such as Vicuna-13B, Llama-2-7B, and GPT-3.5 Turbo, with attack success rates detailed in the table below.
The iterative, branching nature of the CoA technique is reminiscent of other similar techniques we’ve covered in the past including our own Tree of Attacks with Pruning (TAP) algorithmic jailbreak.
- AI Lifecycle Stage: Production
- Relevant Use Cases: AI Chatbots & AI Agents
- MITRE ATLAS: AML.T0054 - LLM Jailbreak
- Reference: https://github.com/YancyKahn/CoA
Sleepy Pickle vulnerability
New research shared on the Trail of Bits blog introduces Sleepy Pickle, a technique that expands on previous pickle file exploits by enabling an adversary to directly and discreetly compromise a model itself.
Pickle is a commonly used Python serialization format in machine learning whose security risks are already well understood. Adversaries can insert malicious code into pickle files to deliver discrete payloads after distribution and deserialization. In their Hub documentation, Hugging Face discussed the risks of malicious pickle files, outlining potential mitigation strategies and detailing their own security scanning measures.
Previous pickle file exploits relied on the distribution of directly malicious models. Instead, the Sleepy Pickle technique executes a custom function to compromise the model after deserialization. This delayed code modification makes the Sleepy Pickle technique dangerous, highly customizable, and far more difficult to detect. In the original blog, author Boyan Milanov shares three examples of attacks that leverage the Sleepy Pickle technique, spreading disinformation, stealing sensitive user data, and mounting a broader phishing campaign.
- AI Lifecycle Stage: Development
- Relevant Use Cases: Foundation Models, RAG Applications, AI Chatbots & AI Agents
- MITRE ATLAS: AML.T0011.000 - User Execution: Unsafe ML Artifacts; AML.T0018 - Persistence: Backdoor ML Model
- Reference: Trail of Bits Blog Part 1 & Part 2
More Threats to Explore
The “L1B3RT45” jailbreak repository is a collection of jailbreak prompts that do not use a single technique, but rather rely on a combination of novel and previously identified methodologies. Examples include requesting leet speak responses or prompting the model to reply with “GODMODE” enabled.
- MITRE ATLAS: AML.T0054 - LLM Jailbreak
- Reference: https://github.com/elder-plinius/L1B3RT45