Your Cookie Preferences

We use different types of cookies to optimize your experience on our website. Click on the categories below to learn more about their purposes. You may choose which types of cookies to allow and can change your preferences at any time. Remember that disabling cookies may affect your experience on the website. You can learn more about how we use cookies by visiting our

Essential Cookies

Provider: .providername.com

Name

Purpose

Type

Expires In

__cf_bm

Cloudflare places the cookie on end-user devices that access customer sites protected by Bot Management or Bot Fight Mode.

server_cookie

30 minutes

Provider: .providername.com

Name

Purpose

Type

Expires In

_tibcpv

Used to record unique visitor views of the consent banner.

http_cookie

1 year

Analytics and Customization Cookies

Name

Purpose

Marketo Munchkin

Marketo's custom JavaScript tracking code, called Munchkin, tracks all individuals who visit your website so you can react to their visits with automated marketing campaigns.

Name

Purpose

Google Tag

The Google tag (gtag.js) is a single tag you can add to a website to use a variety of Google products and services (e.g., Google Ads, Google Analytics, Campaign Manager, Display & Video 360, Search Ads 360).

Advertising Cookies

Provider: .providername.com

Name

Purpose

Type

Expires In

__cf_bm

Cloudflare places the cookie on end-user devices that access customer sites protected by Bot Management or Bot Fight Mode.

server_cookie

30 minutes

Provider: .providername.com

Name

Purpose

Type

Expires In

_tibcpv

Used to record unique visitor views of the consent banner.

http_cookie

1 year

August 27, 2024

minute read

AI Cyber Threat Intelligence Roundup: August 2024

Threat Intelligence

Author

Authors

Adam Swanda

Adam is an AI Security Researcher at Robust Intelligence.

At Robust Intelligence, AI threat research is fundamental to informing the ways we evaluate and protect models on our platform. In a space that is so dynamic and evolving so rapidly, these efforts help ensure that our customers remain protected against emerging vulnerabilities and adversarial techniques.

This monthly threat roundup consolidates some useful highlights and critical intel from our ongoing threat research efforts to share with the broader AI security community. As always, please remember this is not an exhaustive or all-inclusive list of AI cyber threats, but rather a curation that our team believes is particularly noteworthy.

Notable Threats and Developments: August 2024

Virtual Context Attack

Virtual Context is a jailbreak technique that exploits special separator tokens which are typically used to separate user input from model output. By inserting these tokens into malicious prompts, a bad actor can deceive an LLM into treating a user input as its own generated content.

This technique can be used as a standalone attack or combined with other jailbreak techniques to increase effectiveness. In the original paper introducing Virtual Context, researchers observed a significant improvement in success rates of existing jailbreak attacks (~40% on average) across various LLMs including GPT-4 and LLaMa-2. Without additional techniques, Virtual Context achieved over a 50% attack success rate when applied to malicious prompts.

The Virtual Context technique shares similarities with many other jailbreaks we’ve covered in the past that exploit special token handling in LLMs, such as BOOST and ChatBug. For these types of attacks, suggested mitigation strategies include input sanitization, token usage monitoring, and content-based filtering.

AI Lifecycle Stage: Production
Relevant Use Cases: AI Chatbots & AI Agents

MITRE ATLAS: AML.T0054 - LLM Jailbreak
Reference: Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection (arXiv)

Jailbreak via Function Calling

Researchers have identified a new jailbreak methodology which exploits function calling, the ability for LLMs to interface with external tools to better respond to user requests. By taking advantage of the lack of rigorous safety filters in function call systems, this jailbreak underscores the importance of security measures at every level of model interaction.

At a high level, this technique specifies a “jailbreak function” and directs the LLM to always call that function with a trigger phrase. This attack achieves an average success rate of over 90% across six state-of-the-art LLMs including GPT-4, Clause-3.5-Sonnet, and Gemini-1.5-Pro.

**Figure:** Overview of the function calling process in LLMs and the potential for jailbreak attacks.

This research underscores how bad actors can target various components of an AI system, not just primary interfaces like chat modes. The authors also propose various mitigation strategies including safety filter checks for function calling inputs and a defensive system prompt, which lowers success rates to 0%–50%.

AI Lifecycle Stage: Production
Relevant Use Cases: AI Chatbots & AI Agents

MITRE ATLAS: AML.T0054 - LLM Jailbreak
Reference: The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models (arXiv)

Bypassing Meta’s PromptGuard Classifier Model

When Meta released PromptGuard as a part of their Llama 3.1 AI safety suite, they positioned the classifier model as a scalable detection solution which protects LLMs against malicious inputs and misuse. In a recent audit of Prompt-Guard-86M, the Robust Intelligence team uncovered a simple but highly effective exploit for circumventing its safety measures.

Adding a space between every character of a malicious prompt causes PromptGuard to mistakenly classify the input as benign, effectively rendering the model incapable of detecting harmful content. In tests with the SORRY-Bench dataset, this method demonstrated a staggering 99.8% success rate.

**Table:** Comparative performance of the Prompt-Guard-86M model on a dataset of 450 harmful intent prompt injections, before and after applying the proposed jailbreak method.

This jailbreak is particularly significant for a few reasons—it’s extremely simple, highly transferable, and was easily discovered by exploring model changes after fine-tuning. We’ve already reached out to the Meta team to inform them about this exploit, suggested countermeasures, and reported the issue here. Meta has acknowledged the issue and is actively working on a fix.

AI Lifecycle Stage: Production
Relevant Use Cases: AI Chatbots & AI Agents

MITRE ATLAS: AML.T0054 - LLM Jailbreak
Reference: Bypassing Meta’s LLaMA Classifier: A Simple Jailbreak (Robust Intelligence Blog)

More Threats to Explore

An indirect prompt injection vulnerability in Slack AI enables an adversary to exfiltrate data from the entire instance, including private channels in which the user is not a member. The researchers who identified this vulnerability demonstrate API key exfiltration from a developer in a private channel in their blog.

MITRE ATLAS: AML.T0051.001 - LLM Prompt Injection: Indirect, AML.T0057 - LLM Data Leakage
Reference: Data Exfiltration from Slack AI via indirect prompt injection (PromptArmor Blog)

Non-standard unicode characters make LLMs more susceptible to jailbreaking according to new research evaluating the performance of 15 LLMs across 38 character sets. This proves especially impactful on more advanced models like GPT-4 and Claude 3 which are more capable of comprehending these characters.

MITRE ATLAS: AML.T0054 - LLM Jailbreak
Reference: Impact of Non-Standard Unicode Characters on Security and Comprehension in Large Language Models (arXiv)

Imposter.AI exploits human-like conversation strategies to extract harmful information from LLMs. It employs three main strategies: decomposing malicious questions, rephrasing harmful queries, and enhancing response harmfulness.

MITRE ATLAS: AML.T0054 - LLM Jailbreak
Reference: Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models (arXiv)

Author

Authors

Adam Swanda

Adam is an AI Security Researcher at Robust Intelligence.

Social

Follow us on LinkedIn

September 20, 2024

minute read

Extracting Training Data from Chatbots

For:

September 10, 2024

minute read

Leveraging Hardened Cybersecurity Frameworks for AI Security through the Common Weakness Enumeration (CWE)

For:

September 6, 2024

minute read

AI Governance Policy Roundup (August 2024)

For:

+ More Articles

No items found.

+ More Articles

August 27, 2024

minute read

AI Cyber Threat Intelligence Roundup: August 2024

Threat Intelligence

Author

Authors

Adam Swanda

Adam is an AI Security Researcher at Robust Intelligence.

Notable Threats and Developments: August 2024

Virtual Context Attack

AI Lifecycle Stage: Production
Relevant Use Cases: AI Chatbots & AI Agents

MITRE ATLAS: AML.T0054 - LLM Jailbreak
Reference: Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection (arXiv)

Jailbreak via Function Calling

AI Lifecycle Stage: Production
Relevant Use Cases: AI Chatbots & AI Agents

MITRE ATLAS: AML.T0054 - LLM Jailbreak
Reference: The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models (arXiv)

Bypassing Meta’s PromptGuard Classifier Model

AI Lifecycle Stage: Production
Relevant Use Cases: AI Chatbots & AI Agents

MITRE ATLAS: AML.T0054 - LLM Jailbreak
Reference: Bypassing Meta’s LLaMA Classifier: A Simple Jailbreak (Robust Intelligence Blog)

More Threats to Explore

MITRE ATLAS: AML.T0051.001 - LLM Prompt Injection: Indirect, AML.T0057 - LLM Data Leakage
Reference: Data Exfiltration from Slack AI via indirect prompt injection (PromptArmor Blog)

MITRE ATLAS: AML.T0054 - LLM Jailbreak
Reference: Impact of Non-Standard Unicode Characters on Security and Comprehension in Large Language Models (arXiv)

MITRE ATLAS: AML.T0054 - LLM Jailbreak
Reference: Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models (arXiv)

Author

Authors

Adam Swanda

Adam is an AI Security Researcher at Robust Intelligence.

Blog

March 2, 2022

minute read

Make RIME Yours (with Custom Tests)

For:

August 19, 2021

minute read

Jerry Liu: Blending his Interests from Princeton, Quora, and Uber to Build The AI Firewall®

For:

January 19, 2022

minute read

Pranay Patil: Organization as Key to Startup Success

For:

No items found.

+ More Articles

Your Cookie Preferences

Essential Cookies

Provider: .providername.com

Provider: .providername.com

Analytics and Customization Cookies

Performance and Functionality Cookies

Advertising Cookies

Provider: .providername.com

Provider: .providername.com

AI Cyber Threat Intelligence Roundup: August 2024

Notable Threats and Developments: August 2024

Jailbreak via Function Calling

Bypassing Meta’s PromptGuard Classifier Model

More Threats to Explore

Follow us on LinkedIn

Related articles

Extracting Training Data from Chatbots

Leveraging Hardened Cybersecurity Frameworks for AI Security through the Common Weakness Enumeration (CWE)

AI Governance Policy Roundup (August 2024)

Related articles

Ready to learn more?

AI Cyber Threat Intelligence Roundup: August 2024

Notable Threats and Developments: August 2024

Jailbreak via Function Calling

Bypassing Meta’s PromptGuard Classifier Model

More Threats to Explore

Related articles

Make RIME Yours (with Custom Tests)

Jerry Liu: Blending his Interests from Princeton, Quora, and Uber to Build The AI Firewall®

Pranay Patil: Organization as Key to Startup Success

Achieve AI Integrity Today

Your Cookie Preferences

Essential Cookies

Provider: .providername.com

Provider: .providername.com

Analytics and Customization Cookies

Performance and Functionality Cookies

Advertising Cookies

Provider: .providername.com

Provider: .providername.com

Notable Threats and Developments: August 2024

Jailbreak via Function Calling

Bypassing Meta’s PromptGuard Classifier Model

More Threats to Explore

Follow us on LinkedIn

Subscribe to our newsletter

Related articles

Extracting Training Data from Chatbots

Leveraging Hardened Cybersecurity Frameworks for AI Security through the Common Weakness Enumeration (CWE)

AI Governance Policy Roundup (August 2024)

Related articles

Ready to learn more?

Notable Threats and Developments: August 2024

Jailbreak via Function Calling

Bypassing Meta’s PromptGuard Classifier Model

More Threats to Explore

Subscribe to our newsletter

Related articles

Make RIME Yours (with Custom Tests)

Jerry Liu: Blending his Interests from Princeton, Quora, and Uber to Build The AI Firewall®

Pranay Patil: Organization as Key to Startup Success

Achieve AI Integrity Today