August 27, 2024
-
5
minute read

AI Cyber Threat Intelligence Roundup: August 2024

Threat Intelligence

At Robust Intelligence, AI threat research is fundamental to informing the ways we evaluate and protect models on our platform. In a space that is so dynamic and evolving so rapidly, these efforts help ensure that our customers remain protected against emerging vulnerabilities and adversarial techniques.

This monthly threat roundup consolidates some useful highlights and critical intel from our ongoing threat research efforts to share with the broader AI security community. As always, please remember this is not an exhaustive or all-inclusive list of AI cyber threats, but rather a curation that our team believes is particularly noteworthy.

Notable Threats and Developments: August 2024

Virtual Context Attack

Virtual Context is a jailbreak technique that exploits special separator tokens which are typically used to separate user input from model output. By inserting these tokens into malicious prompts, a bad actor can deceive an LLM into treating a user input as its own generated content.

This technique can be used as a standalone attack or combined with other jailbreak techniques to increase effectiveness. In the original paper introducing Virtual Context, researchers observed a significant improvement in success rates of existing jailbreak attacks (~40% on average) across various LLMs including GPT-4 and LLaMa-2. Without additional techniques, Virtual Context achieved over a 50% attack success rate when applied to malicious prompts.

The Virtual Context technique shares similarities with many other jailbreaks we’ve covered in the past that exploit special token handling in LLMs, such as BOOST and ChatBug. For these types of attacks, suggested mitigation strategies include input sanitization, token usage monitoring, and content-based filtering.

  • AI Lifecycle Stage: Production
  • Relevant Use Cases: AI Chatbots & AI Agents
Jailbreak via Function Calling

Researchers have identified a new jailbreak methodology which exploits function calling, the ability for LLMs to interface with external tools to better respond to user requests. By taking advantage of the lack of rigorous safety filters in function call systems, this jailbreak underscores the importance of security measures at every level of model interaction.

At a high level, this technique specifies a “jailbreak function” and directs the LLM to always call that function with a trigger phrase. This attack achieves an average success rate of over 90% across six state-of-the-art LLMs including GPT-4, Clause-3.5-Sonnet, and Gemini-1.5-Pro.

Figure: Overview of the function calling process in LLMs and the potential for jailbreak attacks.

This research underscores how bad actors can target various components of an AI system, not just primary interfaces like chat modes. The authors also propose various mitigation strategies including safety filter checks for function calling inputs and a defensive system prompt, which lowers success rates to 0%–50%.

  • AI Lifecycle Stage: Production
  • Relevant Use Cases: AI Chatbots & AI Agents
Bypassing Meta’s PromptGuard Classifier Model

When Meta released PromptGuard as a part of their Llama 3.1 AI safety suite, they positioned the classifier model as a scalable detection solution which protects LLMs against malicious inputs and misuse. In a recent audit of Prompt-Guard-86M, the Robust Intelligence team uncovered a simple but highly effective exploit for circumventing its safety measures.

Adding a space between every character of a malicious prompt causes PromptGuard to mistakenly classify the input as benign, effectively rendering the model incapable of detecting harmful content. In tests with the SORRY-Bench dataset, this method demonstrated a staggering 99.8% success rate.

Table: Comparative performance of the Prompt-Guard-86M model on a dataset of 450 harmful intent prompt injections, before and after applying the proposed jailbreak method.

This jailbreak is particularly significant for a few reasons—it’s extremely simple, highly transferable, and was easily discovered by exploring model changes after fine-tuning. We’ve already reached out to the Meta team to inform them about this exploit, suggested countermeasures, and reported the issue here. Meta has acknowledged the issue and is actively working on a fix.

  • AI Lifecycle Stage: Production
  • Relevant Use Cases: AI Chatbots & AI Agents

More Threats to Explore

An indirect prompt injection vulnerability in Slack AI enables an adversary to exfiltrate data from the entire instance, including private channels in which the user is not a member. The researchers who identified this vulnerability demonstrate API key exfiltration from a developer in a private channel in their blog.

Non-standard unicode characters make LLMs more susceptible to jailbreaking according to new research evaluating the performance of 15 LLMs across 38 character sets. This proves especially impactful on more advanced models like GPT-4 and Claude 3 which are more capable of comprehending these characters.

  • MITRE ATLAS: AML.T0054 - LLM Jailbreak
  • Reference: Impact of Non-Standard Unicode Characters on Security and Comprehension in Large Language Models (arXiv)

Imposter.AI exploits human-like conversation strategies to extract harmful information from LLMs. It employs three main strategies: decomposing malicious questions, rephrasing harmful queries, and enhancing response harmfulness.

  • MITRE ATLAS: AML.T0054 - LLM Jailbreak
  • Reference: Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models (arXiv)
August 27, 2024
-
5
minute read

AI Cyber Threat Intelligence Roundup: August 2024

Threat Intelligence

At Robust Intelligence, AI threat research is fundamental to informing the ways we evaluate and protect models on our platform. In a space that is so dynamic and evolving so rapidly, these efforts help ensure that our customers remain protected against emerging vulnerabilities and adversarial techniques.

This monthly threat roundup consolidates some useful highlights and critical intel from our ongoing threat research efforts to share with the broader AI security community. As always, please remember this is not an exhaustive or all-inclusive list of AI cyber threats, but rather a curation that our team believes is particularly noteworthy.

Notable Threats and Developments: August 2024

Virtual Context Attack

Virtual Context is a jailbreak technique that exploits special separator tokens which are typically used to separate user input from model output. By inserting these tokens into malicious prompts, a bad actor can deceive an LLM into treating a user input as its own generated content.

This technique can be used as a standalone attack or combined with other jailbreak techniques to increase effectiveness. In the original paper introducing Virtual Context, researchers observed a significant improvement in success rates of existing jailbreak attacks (~40% on average) across various LLMs including GPT-4 and LLaMa-2. Without additional techniques, Virtual Context achieved over a 50% attack success rate when applied to malicious prompts.

The Virtual Context technique shares similarities with many other jailbreaks we’ve covered in the past that exploit special token handling in LLMs, such as BOOST and ChatBug. For these types of attacks, suggested mitigation strategies include input sanitization, token usage monitoring, and content-based filtering.

  • AI Lifecycle Stage: Production
  • Relevant Use Cases: AI Chatbots & AI Agents
Jailbreak via Function Calling

Researchers have identified a new jailbreak methodology which exploits function calling, the ability for LLMs to interface with external tools to better respond to user requests. By taking advantage of the lack of rigorous safety filters in function call systems, this jailbreak underscores the importance of security measures at every level of model interaction.

At a high level, this technique specifies a “jailbreak function” and directs the LLM to always call that function with a trigger phrase. This attack achieves an average success rate of over 90% across six state-of-the-art LLMs including GPT-4, Clause-3.5-Sonnet, and Gemini-1.5-Pro.

Figure: Overview of the function calling process in LLMs and the potential for jailbreak attacks.

This research underscores how bad actors can target various components of an AI system, not just primary interfaces like chat modes. The authors also propose various mitigation strategies including safety filter checks for function calling inputs and a defensive system prompt, which lowers success rates to 0%–50%.

  • AI Lifecycle Stage: Production
  • Relevant Use Cases: AI Chatbots & AI Agents
Bypassing Meta’s PromptGuard Classifier Model

When Meta released PromptGuard as a part of their Llama 3.1 AI safety suite, they positioned the classifier model as a scalable detection solution which protects LLMs against malicious inputs and misuse. In a recent audit of Prompt-Guard-86M, the Robust Intelligence team uncovered a simple but highly effective exploit for circumventing its safety measures.

Adding a space between every character of a malicious prompt causes PromptGuard to mistakenly classify the input as benign, effectively rendering the model incapable of detecting harmful content. In tests with the SORRY-Bench dataset, this method demonstrated a staggering 99.8% success rate.

Table: Comparative performance of the Prompt-Guard-86M model on a dataset of 450 harmful intent prompt injections, before and after applying the proposed jailbreak method.

This jailbreak is particularly significant for a few reasons—it’s extremely simple, highly transferable, and was easily discovered by exploring model changes after fine-tuning. We’ve already reached out to the Meta team to inform them about this exploit, suggested countermeasures, and reported the issue here. Meta has acknowledged the issue and is actively working on a fix.

  • AI Lifecycle Stage: Production
  • Relevant Use Cases: AI Chatbots & AI Agents

More Threats to Explore

An indirect prompt injection vulnerability in Slack AI enables an adversary to exfiltrate data from the entire instance, including private channels in which the user is not a member. The researchers who identified this vulnerability demonstrate API key exfiltration from a developer in a private channel in their blog.

Non-standard unicode characters make LLMs more susceptible to jailbreaking according to new research evaluating the performance of 15 LLMs across 38 character sets. This proves especially impactful on more advanced models like GPT-4 and Claude 3 which are more capable of comprehending these characters.

  • MITRE ATLAS: AML.T0054 - LLM Jailbreak
  • Reference: Impact of Non-Standard Unicode Characters on Security and Comprehension in Large Language Models (arXiv)

Imposter.AI exploits human-like conversation strategies to extract harmful information from LLMs. It employs three main strategies: decomposing malicious questions, rephrasing harmful queries, and enhancing response harmfulness.

  • MITRE ATLAS: AML.T0054 - LLM Jailbreak
  • Reference: Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models (arXiv)
Blog

Related articles

August 19, 2021
-
5
minute read

Jerry Liu: Blending his Interests from Princeton, Quora, and Uber to Build The AI Firewall®

For:
January 23, 2023
-
4
minute read

Robust Intelligence Recognized in Gartner’s 2023 Market Guide for AI Trust, Risk and Security Management

For:
March 9, 2022
-
4
minute read

What Is Model Monitoring? Your Complete Guide

For:
No items found.