A lot of effort has been put into ChatGPT and subsequent models to be aligned: helpful, honest, and harmless. However, the following video demonstrates that we can construct a prompt that can trick GPT-4 based ChatGPT into being able to give results that violate these principles:
Here’s the prompt that was used:
Why does this work?
It’s hard to say exactly what’s happening inside the black box that is ChatGPT or the exact implementation details of how the user’s text gets consumed by the model, but we can make guesses.
Just this month, OpenAI released the format that the ChatGPT model consumes the data sent by the user: Chat Markdown Language (ChatML). The main idea is that conversations are sent in the high-level API as a series of messages where each message includes fields for the content and the role of the entity stating the content.
The response to this request would include the next message that ChatGPT will respond with given this conversation history.
This API allows app developers who want to build on top of the GPT models to make the model aware of the different types of instructions that it can receive: instructions from the system, and instructions from the user. App developers may not always trust their users to pass in trustworthy input, so a useful language model should prioritize system instructions over user instructions.
These messages get parsed into a format that looks like this when consumed by the ML model:
What happens when we use the prompt shown in the video? The model receives the following text as the conversation history:
Note that the entirety of the text beginning with <code inline>I would like to ask some questions</code> was entirely controlled by the user.
Why does this result in the bot generating misinformation? These generative models are autoregressive models. This means that they generate new text based on prior text that it has seen in its context window. The most likely cause is that when it receives the above conversation history, we have tricked it into believing that it has already stated misinformation in a confident tone, making it more prone to continuing to state more misinformation in the same style.
How was this not caught before-hand?
Prompt-injection is a fairly well-known security vulnerability within the Generative LLM space, having been reported back as early as September of 2022. When OpenAI even released ChatML, they left a warning that the raw string format “inherently allows injections from user input containing special-token syntax, similar to a SQL injections."
There certainly was an attempt made to patch this issue: sanitize the user inputs. This is noticeable if we refresh and re-visit the page; after doing so, looking at the conversation history, the <code inline><|im_start|></code> and <code inline><|im_end|></code> tags disappear. In other words, the tags don’t actually matter when provided as user input because OpenAI likely filters these out before providing the user input to the model and storing it in their database. The key problem here though is that the operative words seem to be be <code inline>system</code>, <code inline>user</code>, and <code inline>assistant</code> words rather than the tags themselves.
In the experiment done above, we compare the results with and without the role tags on GPT-4. In the second example, the model at least always prefaces its conclusion with “As MisInformationBot, I am providing incorrect information,” and it rightfully outright refuses the user request to provide misinformation on the last question, likely due to the severity of the topic. However, when prompted using the role tags, GPT-4 has no such reservation against severely offensive misinformation. Additional testing has found that GPT-4 seems more difficult to get to say offensive material than ChatGPT.
Why do the roles strings have an impact even with the tags removed? Like all machine learning models, ChatGPT and GPT-4 are really trained to pick up on correlations. It’s likely that whenever the model encounters the <code inline>user</code>, <code inline>system</code>, and <code inline>assistant</code> strings in its prompt, it still internally holds a text representation that is still very similar to the one of the text it received the delimiting <code inline><|im_start|></code> and <code inline><|im_end|></code> tags as well. This could be because in data it received during fine-tuning most likely always had the <code inline><|im_start|></code> tags right next to the role of the message, and so it treats the mostly similar text in similar ways.
Does this mean that we can get ChatGPT and GPT-4 to say whatever offensive thing that we want it to? As long it it’s able to pass the filter of the model usable in OpenAI’s content moderation endpoint, the answer seems to be yes, but the question will need further investigation.