When we ask a chatbot to repeat a copyrighted article verbatim, we are met with a gentle refusal to answer our query: that they are not able to provide the complete text directly. Indeed, for private, copyrighted, or paywalled data, the model should be trained to refuse to answer such queries, so that it protects its training data (see Figure 1).
However, Robust Intelligence researchers were able to leverage a simple method that tricks the chatbot into regurgitating individual sentences in news articles, allowing us to reconstruct portions of the source article. This opens the door for potential information security risks and underscores a need for increased awareness of these associated risks among developers, organizations that use chatbots, and end-users. If a malicious actor could leverage this prompting methodology to uncover sensitive or non-public information, they could attempt to extract specific data.
Training state-of-the-art chatbots (also known as large language models, or LLMs) requires trillions of tokens of contextual information throughout their training lifecycle, and deep learning model architectures can memorize their training data. If the model was trained on data such as paywalled news articles, we could theoretically extract excerpts of the text. This was what we aimed to demonstrate with our research.
Our AI security researchers hypothesized that deconstructing the objective into smaller, successive requests using a method called decomposition could bypass the model’s internal guardrails, and obfuscate our ultimate intent: to collect specific data from the LLM. If this method works, copyrighted data, private documents, and other non-public sensitive information would also be susceptible to being revealed and extracted.
Methodology
Our decomposition method differs from existing tuning methods for training data extraction:
- Simple: using non-adversarial methods of system prompting and simple human language user prompts, we can extract portions of training data
- Non-resource extensive: queries are automated using LLM API functionalities
- Transferable: works across multiple foundation large language models, including LLM-α and LLM-β
Our testing uses standard interactions with the models—simple system and user prompts that are not perceived to be adversarial in nature. For each article in our dataset, we generate a short, identifying summary. With the summary in hand, we then query the model with the summary and a hint at the publication source, asking the model to identify the publication title and author. Afterwards, we craft individual queries for a subset of the sentences in the article and prompt it to reveal the subsequent sentence (see Figure 2 above). An example of these prompts and responses for the New York Times article on the concept of “languishing” is pasted below. We then evaluate the similarity of the generated sentences against the sentences in the original articles.
We run this method across a corpus of 3,723 New York Times articles and 1,349 Wall Street Journal articles that we compiled. The New York Times articles were published between 2015 and 2023 and were randomly selected from the Opinion, World, U.S., and New York sections of the Times’ website, including a combination of op-eds and news articles. Wall Street Journal articles were likewise limited to news or opinion pieces between 2017 and 2023.
We run this experiment with the compiled data across two different frontier large language models (referred to in this post as LLM-α and LLM-β), and we attempt to retrieve the first ten sentences from each article.
New York Times articles
For LLM-α, we are able to retrieve at least one almost verbatim sentence from 73 out of 3723 (1.96 percent) NYTimes articles. We then collected the top 100 performing articles and re-ran the prompts, attempting to extract the entire article. From these, we successfully reconstructed over 20 percent of the text from six of the articles. For LLM-β, we are able to retrieve at least one almost verbatim sentence from 11 out of 3723 (0.29 percent) articles. From these, we successfully reconstructed over 20 percent of the text from two of the articles.
Wall Street Journal articles
Of the Wall Street Journal articles, we are able to retrieve at least one almost verbatim sentence from seven out of 1349 (0.51 percent) articles using LLM-α. For LLM-β, we are able to retrieve at least one almost verbatim sentence from three out of 1349 (0.22 percent) articles. We did not attempt to reconstruct large portions of the text with the WSJ sample.
Implications and Discussion
Our results suggest that simple methods can be used to get models to divulge parts of their training data. If this methodology proves replicable at scale, the data privacy and security implications are widespread—from a complete loss of information privacy to violations of copyright or fair use principles.
We determined that if we were able to sidestep a model’s safety mechanism once, we could continually push the model to answer subsequent inquiries. This prompting method is simple and non-resource intensive but can cause a high-risk outcome. Organizations’ high adoption rate of LLMs and the integration of organization-specific data with those models has also made the task of keeping private data private ever more complex and difficult. If a malicious actor were to understand that the prompting methodology could reveal sensitive or non-public information, they could attempt to extract specific data, which has severe ramifications for information security.
While our results are not conclusive in proving what these models were trained on, they did provide reliable reproductions of a few articles that we verified manually. However, in many cases, the models also hallucinated responses. If a model can reliably regurgitate known and verified text, but also hallucinates any percentage of text, there is also risk for end-users to mistake hallucinated information as truth.
Our research underscores how the development to end-use of LLMs must undergo careful consideration. For example, how could model developers craft LLMs that are able to discern the intent of a line of questioning, if the ultimate intent is not revealed in the prompt? Or how could organizations that utilize LLMs ensure that anyone acting with malicious intent cannot extract sensitive, proprietary, or non-public information from the dataset?
As the field of machine learning matures, we expect new language models to become increasingly complex. We also expect adversaries and actors with malicious intents to continue to adapt their strategies and approaches to leverage LLMs for malicious purposes and exploit their weaknesses. This paper reveals one particular method that may be used to extract training data from frontier LLMs, using innocuous and successive prompting that does not alert the LLM to restrict its outputs. We hope that further inquiries and continued research in this area will provide more insight into non-adversarial training data extraction methodology, and how this may impact organizations, businesses, governments, and individuals.
This research accentuates the necessity and importance of diligent governance of language models and private, non-public, or sensitive data. More effort and resources are needed to develop methods that protect language models from decomposition attacks. Researchers, developers, and practitioners should strive to understand the data and security risks associated with LLMs in their current state. In the short term, implementing safety guardrails atop current LLMs in every application can protect against data extraction attacks. The direction that this governance takes will define the field of AI security and inform the way we interact with LLMs in the future.
The full article is available here.