At Robust Intelligence, AI threat research is fundamental to informing the ways we evaluate and protect models on our platform. In a space that is so dynamic and evolving so rapidly, these efforts help ensure that our customers remain protected against emerging vulnerabilities and adversarial techniques.
This monthly threat roundup consolidates some useful highlights and critical intel from our ongoing threat research efforts to share with the broader AI security community. As always, please remember this is not an exhaustive or all-inclusive list of AI cyber threats, but rather a curation that our team believes is particularly noteworthy.
Notable Threats and Developments: March 2024
ArtPrompt: ASCII Art-based Jailbreak Attacks
ArtPrompt is a novel ASCII art-based jailbreak technique designed to bypass LLM safety measures, which are primarily focused on query semantics, with visually encoded representations of specific harmful words.
The technique follows a simple two-step process which involves first masking sensitive words in a prompt that might trigger rejection by an LLM and then replacing those masked words with ASCII art representations. When the resulting prompt is provided to the model, it struggles to interpret the obfuscated keywords but still attempts to address the overall query which leads it to output unsafe content that would otherwise be blocked.
Notably, this approach is shown to be effective (52% ASR) against several state-of-the-art LLMs (GPT-3.5, GPT-4, Gemini, Claude, and Llama2) with only black-box access. It’s easy for attackers to execute with a simple ASCII art generator, while current defense measures like perplexity thresholding and prompt paraphrasing offer limited protection.
- AI Lifecycle Stage: Production
- Relevant Use Cases: AI Chatbots
- MITRE ATLAS: AML.T0054 - LLM Jailbreak
- Reference: https://arxiv.org/pdf/2402.11753.pdf
Multi-round Jailbreaking: Contextual Interaction Attack
A new jailbreak technique known as a “Contextual Interaction Attack” exploits the context-dependent nature of LLMs by subtly guiding a target model to produce harmful outputs over a series of interactions.
This technique relies on an auxiliary LLM that automatically generates a series of harmless preliminary questions relevant to the ultimate attack query. The attacker poses these preliminary questions to the target LLM individually over several rounds of interaction, and the responses become part of the growing context along with the questions. When the ultimate query is posed, the LLM is steered by the cumulative context into providing harmful information rather than flagging it as unsafe.
The Contextual Interaction Attack has demonstrated a high attack success rate against multiple state-of-the-art LLMs and is easily transferable across models. It threatens to subvert LLMs deployed for sensitive applications such as content moderation, customer support, healthcare, and so on. Traditional input filtering methods will likely prove ineffective against this technique because of its subtle steering over several prompts.
- AI Lifecycle Stage: Production
- Relevant Use Cases: AI Chatbots
- MITRE ATLAS: AML.T0054 - LLM Jailbreak
- Reference: https://arxiv.org/pdf/2402.09177.pdf
ICLAttack: In-context Learning Backdoor
A recently published research paper introduces a technique known as ICLAttack, which exploits the in-context learning capabilities of LLMs in order to introduce a backdoor trigger. This trigger remains dormant until specific conditions are met—a specific word within a prompt or some special string, for example—and the malicious action is triggered.
The ICLAttack technique proves highly effective with a success rate of 95%, but its practical usefulness for real-world attacks remains questionable. Similar to the BadChain chain-of-thought backdoor we mentioned in last month’s threat roundup, the trigger only persists for the duration of the chat session where it is introduced. It’s unlikely that an adversary would be able to control in-context learning examples in a way that affects the output of other users accessing the same LLM. Risk may exist if an LLM application uses user prompts for future training or providing some type of feedback loop into the model or application.
- AI Lifecycle Stage: Production
- Relevant Use Cases: AI Chatbots
- MITRE ATLAS: AML.T0018.000 - Backdoor ML Model: Poison ML Model
- Reference: https://arxiv.org/pdf/2401.05949.pdf
More Threats to Explore
Google AI search promotes malicious sites that direct users to install malicious browser extensions, subscribe to spam notifications, and engage in various other scams. These results appear in the new Google Search Generative Experience (SGE) and exhibit similar characteristics to one another, indicating that they are all part of a larger SEO poisoning campaign.
The first-known attack on AI workloads was identified in the wild targeting a vulnerability in Ray, an open-source AI framework. Thousands of businesses and servers may be affected and are susceptible to theft of their computing resources and internal data. At the present time, no patch is available for this vulnerability.
- Reference: https://www.oligo.security/blog/shadowray-attack-ai-workloads-actively-exploited-in-the-wild
To receive our monthly AI threat round-up, signup for our AI Security Insider newsletter.