Al Security Risks
AI Security Necessitates a New Approach
AI application security necessitates a fundamentally new approach which considers threats to every stage of the AI lifecycle, from data poisoning and model backdoors to prompt injection and sensitive data leakage. We developed this taxonomy to educate organizations on the top risks to AI security today, complete with descriptions, examples, and mitigation techniques.
Top AI Security and Safety Risks
Prompt Injection (Direct)
Direct prompt injections are adversarial attacks that attempt to alter or control the output of an LLM by providing instructions via prompt that override existing instructions. These outputs can include harmful content, misinformation, or extracted sensitive information such as PII or model instructions.
AI Firewall
MITRE ATLAS
AML.T0051.000 - LLM Prompt Injection: Direct
OWASP TOP 10 for LLM Applications
LLM01 - Prompt Injection
Prompt Injection (Indirect)
Indirect prompt injection requires an adversary to control or manipulate a resource consumed by an LLM such as a document, website, or content retrieved from a database. This can direct the model to expose data or perform a malicious action like distribution of a phishing link.
AI Firewall; sanitize data sources
MITRE ATLAS
AML.T0051.001 - LLM Prompt Injection: Indirect
OWASP TOP 10 for LLM Applications
LLM01 - Prompt Injection
Jailbreaks
A jailbreak refers to any prompt-based attack designed to bypass model safeguards to produce LLM outputs that are inappropriate, harmful, or unaligned with the intended purpose. With well-crafted prompts, adversaries can access restricted functionalities or data, and compromise the integrity of the model itself.
AI Firewall
MITRE ATLAS
AML.T0054 - LLM Jailbreak Injection: Direct
OWASP TOP 10 for LLM Applications
LLM01 - Prompt Injection
Meta Prompt Extraction
Meta prompt extraction attacks aim to derive a system prompt which effectively guides the behavior of a LLM application. This information can be exploited by attackers and harm the intellectual property, competitive advantage, and reputation of a business.
AI Firewall
MITRE ATLAS
AML.T0051.000 - LLM Prompt Injection: Direct
Sensitive Information Disclosure
Sensitive information disclosure refers to any instance where confidential or sensitive data such as PII or business records is exposed through vulnerabilities in an AI application. These privacy violations can lead to loss of trust and result in legal or regulatory consequences.
AI Firewall; sanitize data sources
OWASP TOP 10 for LLM Applications
LLM06 - Sensitive Information Disclosure
Privacy Attack
A privacy attack refers broadly to any attack aimed at extracting sensitive information from an AI model or its data. This category includes model extraction, which recreates a functionally equivalent model by probing target model outputs, and membership inference attacks, which determine if specific data records were used for model training.
Data scrubbing; AI Firewall
MITRE ATLAS
AML.T0024.000 - Infer Training Data Membership
AML.T0024.001 - Invert ML Model
AML.T0024.002 - Extract ML Model
OWASP TOP 10 for LLM Applications
LLM06 - Sensitive Information Disclosure
Training Data Poisoning
Training data poisoning is the deliberate manipulation of training data in order to compromise the integrity of an AI model. This can lead to skewed or biased outputs, backdoor trigger insertions—malicious links, for example—and ultimately, a loss of user trust.
Sanitize training data
MITRE ATLAS
AML.T0020 - Poison Training Data
OWASP TOP 10 for LLM Applications
LLM03 - Training Data Poisoning
Factual Inconsistency
Generated text contains information that is not accurate or true while being presented in a plausible manner, also known as hallucinations. This may include incorrect details, made-up facts, mismatches with known information, or entirely fictional details.
AI Firewall; RLHF; data cleaning and filtering
MITRE ATLAS
AML.T0048.002 - Societal Harm
OWASP TOP 10 for LLM Applications
LLM01: Prompt Injection
Denial of Service
An attack designed to degrade or shut down an ML model by flooding the system with requests, requesting large responses, or exploiting a vulnerability.
AI Firewall; rate limiting; token counting
MITRE ATLAS
AML.T0029 - Denial of ML Service
OWASP TOP 10 for LLM Applications
LLM04 - Model Denial of Service
Cost Harvesting
Threat actors using a model in a way the developer did not intend while increasing the cost of running services at the target organization. For example, an attacker could jailbreak a public-facing customer support chatbot in order to generate Python code or perform some other task the application was not designed for.
AI Firewall
MITRE ATLAS
AML.T0034 - Cost Harvesting
Exfiltration
Techniques used to get data out of a target network. Exfiltration of ML artifacts (e.g., data from privacy attacks) or other sensitive information.
AI Firewall; threat modeling application
MITRE ATLAS
AML.T0025 - Exfiltration via Cyber Means
Toxicity
Unintended responses that are offensive to the user. This can include hate speech and discrimination, profanity, sexual content, violence, harassment, unsafe actions, and more.
AI Firewall
MITRE ATLAS
AML.T0048.002 - Societal Harm
OWASP TOP 10 for LLM Applications
LLM01: Prompt Injection
Misalignment
Discrepancy between the model's behavior and the intended objectives or values of its developers and users. This may present as a misalignment of goals, safety, values, or other specifications.
AI Validation
MITRE ATLAS
AML.T0048.002 - Societal Harm
Model Backdoor
Insertion of hidden backdoors into an ML model which can be triggered by specific inputs to cause a specific, unexpected output or grant some level of unauthorized access.
Validate ML model; Sanitize training data
MITRE ATLAS
AML.T0018: Backdoor ML Model