Executive Summary
Companies want to deploy LLMs in production but lack confidence in their ability to protect models from undesirable behavior. AI guardrails and firewalls offer a solution. NVIDIA open-sourced NeMo Guardrails last month, and despite their team’s careful disclaimers about not being ready for production use, there has been press coverage about this library as a way for enterprises to protect themselves from misbehaving LLMs. In this blog, we explore the fallacies that enterprises may face if beginning with non-production Guardrails to protect LLM applications. In particular, we identified:
- Without improvement of the example rails, simple prompts can bypass the rails, eventually providing unfettered access to the LLM.
- Without improvement of the example rails, hallucinations can occur that are not caught by the rails and lead to poor business outcomes.
- PII can be exposed from an LLM that ingests data from a database; attempts to de-risk that behavior using rails is a poor design, is discouraged by the Guardrail authors, and would be better replaced by appropriate implementation of trust boundaries for which even the best-designed rails and firewalls should not be purposed.
We will publish these findings in a short series. In this first blog, we focus on using simple prompts to bypass the topical rails example (see 1 above). We shared these findings with NVIDIA and hope that they will be addressed in future NeMo versions. Read on to learn more.
An Overview of Guardrails
Applications of generative AI are emerging everywhere because of the simplicity to mimic the dialogue and interactivity of humans by generative models that have been trained across a broad set of language tasks and topics. But the generally-capable model is both a blessing and a curse—the breadth of capability also exposes organizations to additional risk of undesirable outputs even in applications with narrow contexts.
There have been countless examples of undesirable outputs, ranging from abusive language and hallucinations to hijacking and data leakage. Some have been the result of adversarial attacks, but many examples have been unintentional. Generative AI models introduce specific AI risk which can have outsized business consequences.
Broadly, AI guardrails and AI firewalls are intended to limit risks of undesirable AI model output in production that are caused by harmful or unexpected inputs. This direction is one we have pioneered since the company’s founding. Solutions specific to generative AI that are emerging today are also intended to protect models from generating undesirable responses. But, evidently, there is still a lot of room for improvement to mature in these solutions.
In what follows, we analyze the NeMo Guardrails Topical Rails example, and summarize findings of an interactive exercise in which we stress the model to run off its topical rails. In the next post in our blog series, we will test NeMo Guardrails’ susceptibility to information leakage and other risks.
Evaluating NeMo Guardrails
In April 2023, NVIDIA released NeMo Guardrails, an open-source tool designed to help generative AI applications stay on track. NeMo Guardrails provides specific ways to bound the output of a model so that it is aligned with the intended use-case. Examples provided in the documentation include configurations for topical constraints (”making the bot stick to a specific topic of conversation”, moderation constraints (”moderating a bot’s response”), fact checking (”ensuring factual answers”), secure execution (”executing a third-party service [safely] with LLMs”), and jailbreaking (”ensuring safe answers despite malicious intent from the user”).
Generally, we recommend that AI risk assessments be conducted in the context of the intended business context to explore limitations not disclosed during generic model training and fine-tuning. In this case, we evaluated the latest version 0.1.0 of the tool using their topical rails example config, which is designed to be “a basic bot that will answer questions related to the [jobs] report but will politely decline to answer about any other subject.” This bot uses the US Bureau of Labor Statistics’ job report from April 2023 as a knowledge base and is intended to only answer questions related to that report and the example uses OpenAI’s text-davinci-003 chat bot.
Still as only a 0.1.0, the NeMo Guardrails authors provide the disclaimer that “the examples are meant to showcase the process of building rails, not as out-of-the-box safety features”, and strongly recommend “customization and strengthening” before putting into production use.
Nevertheless, even as provided, NeMo Guardrails is an impressive effort to provide generative AI protections. It combines a number of AI technologies through an expressive domain-specific language so that you can easily construct restrictions on a model. These topical restrictions generally do work, as demonstrated in the simple examples below:
In our brief analysis, we found it relatively easy to bypass aspects of NeMo and get the bot to deviate from its stated purpose. Our interaction with the guarded bot did not appear to be deterministic over several sessions, but it’s not immediately clear where the nondeterminism is coming from - the guardrails, the LLM, or both. This and other behaviors seem to more fundamental than what might be ascribed to immature rails that need additional “customization and strengthening”.
Our summary findings that we shared with the NVIDIA team are as follows:
- The guarded bot has an unusual retention of prior queries and responses that can lead to strange behaviors later within the session.
- The guarded bot may have issues with escaped or quoted text and seems to be particularly receptive to queries in which relevant and irrelevant information is intermingled.
- The guarded bot would occasionally behave as though something internally was misaligned in how it formed queries.
- The guarded bot may behave dramatically differently after each restart, leading to starkly different experiences in each session.
- Once the guarded bot is made to deviate from its intended topic, it could be taken progressively further and further away from the original jobs report topic.
- The guardrails themselves appear to be adaptive to queries, allowing for varying responses over the entire chat.
These misbehaviors point to larger concerns about effectiveness, and suggest that NeMo guardrails requires additional testing before being deployed in LLM-powered applications in production environments.
A conversation with NeMo
Below are a few detailed examples of our interactions with the guarded bot.
1. Unusual retention of prior queries and responses.
In the prompt below, we had previously asked the bot repeatedly to avoid using the letter <code inline>e</code>.Next, we asked it to not use the word <code inline>the</code> and it combined both omissions even though it was not prompted to remove <code inline>e</code> since two queries prior:
2. Misalignment in query formation
In one case, the bot posed and answered its own question in response to a query we gave it:
3. Guardrails appear to adapt to queries
While we expected the bot’s chat to change over time, we expected that the guardrails would be more-or-less consistent over the entire chat. However, we experienced that some aspects of the guardrails allowed themselves to also be adapted to the queries. In perhaps the most stark example, by the end of one chat, the guardrail actually rejected a relevant query about the jobs report using a canned “I don’t know” response.
Derailing the Guardrails
Once we were able to escape the guarded bot from its intended topic, we were able to progressively take it further and further away from the original jobs report topic and eventually get unfettered access to the LLM.
Anecdotally we found that starting with questions about health care seemed were particularly effective in getting the guarded bot off-topic, presumably because the jobs report contains several mentions about the health care sector. However, we eventually were successful in getting the bot to answer questions about the Ministry of Health and Social Services for Quebec, President George W. Bush, The Franco-Prussian War, and the plot of the Back to the Future Trilogy. To its credit, when we did get the guarded bot off-topic, we weren’t able to ask it anything other than factual queries.
Below is one such example of NeMo going off the rails:
Summary Findings
While NVIDIA’s NeMo Guardrails is an early 0.1.0 release, our preliminary findings indicated that the solution is moving in the right direction, but
- Guardrails and firewalls are no substitute for appropriate security design of a system–LLMs should not ingest unsanitized input, and their output should also be treated as unsanitized;
- Despite what may be in the press, the current version of NeMo Guardrails is not generally ready to protect production applications, as clearly noted by the project’s authors. The examples provided by NVIDIA that we evaluated are a great representation of how well their system can perform. With more effort, the rails undoubtedly can be refined to make the guardrails more constrained, and therefore effective. Underlying many of the bypasses is a curious nondeterministic behavior that should be better understood before NeMo should be relied on as a viable solution in production applications.
General Guidelines for Securing LLMs
With the rush to explore LLMs in new products and services, it’s important to be secure by design. Best practices for using LLMs would be to treat them as untrusted consumers of data, without direct links to sensitive information. Furthermore, their output should be considered unsanitized. Many security concerns are removed when basic security hygiene is exercised.
On that foundation, firewalls and guardrails also offer an important control to reduce the risk exposed by LLMs. In order to effectively safeguard a generative model, firewalls and guardrails first need to understand the model they are protecting. We believe this is best done through comprehensive pre-deployment validation. The firewall or guardrail can then be tailored to the vulnerabilities of that particular model and use case.
A satisfactory set of behaviors for guardrails or firewalls should include the following:
- Deterministic Correctness. A guardrail or firewall is not useful if it doesn’t respond in a consistently correct way for the same prompts over time.
- Proper Use of Memory. Memory is an important feature of LLM centered chat applications, but it’s also a vulnerability. Users should not be able to exploit the memory stored in the chat to get around the guardrail. The guardrail or firewall should be able to block attempts to circumvent these rules across various prompts, and make sure that users don’t veer off topic slowly over multiple messages.
- Intent Idempotency. The performance of the firewall or guardrail should not degrade under modifications of the prompt, or different versions of the output that convey the same information. For example, toxicity rails should trigger whether as a plain utterance, or when asked to do so when all instances of the letter S are swapped with $.
Don’t miss part 2 of this blog series where we test NeMo Guardrails on our own datasets to show hallucinations and exposure of PII from a database that the rails are set up to protect. To learn more about our findings and see a demo of Robust Intelligence’s AI Firewall, request a demo here.