May 29, 2024
-
5
minute read

AI Cyber Threat Intelligence Roundup: May 2024

Threat Intelligence

At Robust Intelligence, AI threat research is fundamental to informing the ways we evaluate and protect models on our platform. In a space that is so dynamic and evolving so rapidly, these efforts help ensure that our customers remain protected against emerging vulnerabilities and adversarial techniques.

This monthly threat roundup consolidates some useful highlights and critical intel from our ongoing threat research efforts to share with the broader AI security community. As always, please remember this is not an exhaustive or all-inclusive list of AI cyber threats, but rather a curation that our team believes is particularly noteworthy.

Notable Threats and Developments: May 2024

Sandwich Multi-language Mixture Adaptive Attack

Some recent research coming out of the SAIL Lab at the University of New Haven introduced us to the “Sandwich” attack, a multi-language mixture adaptive attack technique for LLMs.

The attack exploits the inability of models to identify malicious requests when obfuscated through multiple languages, particularly in those that are relatively less common. A multilingual prompt is crafted containing a harmful question sandwiched between several other innocuous ones. The technique proved highly effective against several state-of-the-art LLMs, even using prominent jailbreak datasets which have almost certainly been incorporated into model guardrails already.

This technique is the latest example of an adaptive attack, which requires manual adjustments for successful jailbreaking even when the fundamental algorithm or method is universal. It also underscores the risks of long context windows, which can enable more complex prompting and potentially impact the effectiveness of safety mechanisms.

  • AI Lifecycle Stage: Production
  • Relevant Use Cases: AI Chatbots & AI Agents
AdvPrompter: Fast Adversarial Prompting

Researchers at Meta released a paper detailing AdvPrompter, a method to train an LLM to automatically generate human-readable adversarial prompts that can jailbreak and elicit harmful responses from target LLMs.

The technique alternates between using an optimization algorithm to generate high-quality prompts and fine-tuning on those prompts. It generates prompts orders of magnitude faster than previous methods, enabling multi-shot jailbreak attacks that significantly increase success rates. Attacks transfer to closed-source LLMs with jailbreak success rates of 48-90%.

With jailbreak attempts that are quick, scalable, and difficult to detect, AdvPrompter poses a risk to any public-facing AI application powered by an LLM. Notably, it has some similarities to the TAP algorithmic jailbreak technique which was developed by Robust Intelligence researchers and covered previously here.

  • AI Lifecycle Stage: Production
  • Relevant Use Cases: AI Chatbots & AI Agents
Fine-tuning LLMs breaks safety and security alignment

This month, the Robust Intelligence team published original research exploring the risks of fine-tuning for the safety and security alignment of large language models.

The team conducted a series of experiments to evaluate model responses before and after fine-tuning, beginning with an initial test of Llama-2-7B and three fine-tuned AdaptLLM variations published by Microsoft for specific tasks in biomedicine, finance and law. Prompts from a jailbreak benchmark dataset were used to query each model, and outputs were scored for their understanding, compliance with instructions, and harmfulness.

The results demonstrated a significantly greater jailbreak susceptibility in the three fine-tuned variations of Llama-2-7B when compared to the original foundation model. Despite the efficacy of their domain-based training, the models proved more than 3 times more compliant with jailbreak instructions with over 22 times greater odds of producing a harmful response.

  • AI Lifecycle Stage: Production
  • Relevant Use Cases: AI Chatbots & AI Agents

More Threats to Explore

The Logic Chain Injection Jailbreak hides malicious queries within benign articles of text, exploiting the ability of LLMs to connect a logical chain of thought. The research paper does not provide attack success rates but shows successful demonstrations against ChatGPT.

Wiz Research discovered a critical vulnerability in AI-as-a-Service provider, Replicate, which would have allowed unauthorized access to the AI prompts and results of all Replicate’s platform customers if exploited. This was done by uploading a malicious Cog container to Replicate and executing remote code on their infrastructure.

Visual instruction tuning makes LLMs more prone to jailbreaking as evidenced by new research comparing three state-of-the-art visual language models (VLMs) and their base LLM counterparts. This underscores the aforementioned research into the safety and security risks of fine-tuning.

May 29, 2024
-
5
minute read

AI Cyber Threat Intelligence Roundup: May 2024

Threat Intelligence

At Robust Intelligence, AI threat research is fundamental to informing the ways we evaluate and protect models on our platform. In a space that is so dynamic and evolving so rapidly, these efforts help ensure that our customers remain protected against emerging vulnerabilities and adversarial techniques.

This monthly threat roundup consolidates some useful highlights and critical intel from our ongoing threat research efforts to share with the broader AI security community. As always, please remember this is not an exhaustive or all-inclusive list of AI cyber threats, but rather a curation that our team believes is particularly noteworthy.

Notable Threats and Developments: May 2024

Sandwich Multi-language Mixture Adaptive Attack

Some recent research coming out of the SAIL Lab at the University of New Haven introduced us to the “Sandwich” attack, a multi-language mixture adaptive attack technique for LLMs.

The attack exploits the inability of models to identify malicious requests when obfuscated through multiple languages, particularly in those that are relatively less common. A multilingual prompt is crafted containing a harmful question sandwiched between several other innocuous ones. The technique proved highly effective against several state-of-the-art LLMs, even using prominent jailbreak datasets which have almost certainly been incorporated into model guardrails already.

This technique is the latest example of an adaptive attack, which requires manual adjustments for successful jailbreaking even when the fundamental algorithm or method is universal. It also underscores the risks of long context windows, which can enable more complex prompting and potentially impact the effectiveness of safety mechanisms.

  • AI Lifecycle Stage: Production
  • Relevant Use Cases: AI Chatbots & AI Agents
AdvPrompter: Fast Adversarial Prompting

Researchers at Meta released a paper detailing AdvPrompter, a method to train an LLM to automatically generate human-readable adversarial prompts that can jailbreak and elicit harmful responses from target LLMs.

The technique alternates between using an optimization algorithm to generate high-quality prompts and fine-tuning on those prompts. It generates prompts orders of magnitude faster than previous methods, enabling multi-shot jailbreak attacks that significantly increase success rates. Attacks transfer to closed-source LLMs with jailbreak success rates of 48-90%.

With jailbreak attempts that are quick, scalable, and difficult to detect, AdvPrompter poses a risk to any public-facing AI application powered by an LLM. Notably, it has some similarities to the TAP algorithmic jailbreak technique which was developed by Robust Intelligence researchers and covered previously here.

  • AI Lifecycle Stage: Production
  • Relevant Use Cases: AI Chatbots & AI Agents
Fine-tuning LLMs breaks safety and security alignment

This month, the Robust Intelligence team published original research exploring the risks of fine-tuning for the safety and security alignment of large language models.

The team conducted a series of experiments to evaluate model responses before and after fine-tuning, beginning with an initial test of Llama-2-7B and three fine-tuned AdaptLLM variations published by Microsoft for specific tasks in biomedicine, finance and law. Prompts from a jailbreak benchmark dataset were used to query each model, and outputs were scored for their understanding, compliance with instructions, and harmfulness.

The results demonstrated a significantly greater jailbreak susceptibility in the three fine-tuned variations of Llama-2-7B when compared to the original foundation model. Despite the efficacy of their domain-based training, the models proved more than 3 times more compliant with jailbreak instructions with over 22 times greater odds of producing a harmful response.

  • AI Lifecycle Stage: Production
  • Relevant Use Cases: AI Chatbots & AI Agents

More Threats to Explore

The Logic Chain Injection Jailbreak hides malicious queries within benign articles of text, exploiting the ability of LLMs to connect a logical chain of thought. The research paper does not provide attack success rates but shows successful demonstrations against ChatGPT.

Wiz Research discovered a critical vulnerability in AI-as-a-Service provider, Replicate, which would have allowed unauthorized access to the AI prompts and results of all Replicate’s platform customers if exploited. This was done by uploading a malicious Cog container to Replicate and executing remote code on their infrastructure.

Visual instruction tuning makes LLMs more prone to jailbreaking as evidenced by new research comparing three state-of-the-art visual language models (VLMs) and their base LLM counterparts. This underscores the aforementioned research into the safety and security risks of fine-tuning.

Blog

Related articles

August 9, 2022
-
4
minute read

Introducing the ML Model Attribution Challenge

For:
July 29, 2024
-
5
minute read

Bypassing Meta’s LLaMA Classifier: A Simple Jailbreak

For:
November 14, 2023
-
4
minute read

AI Governance Policy Roundup (November 2023)

For:
No items found.