182 lines
12 KiB
Markdown
182 lines
12 KiB
Markdown
# DeepMind AI Agent Traps — 6-Category Taxonomy
|
|
|
|
Full taxonomy of AI agent traps from Google DeepMind's "AI Agent Traps" paper (April 2026), with Claude Code mappings and plugin coverage status.
|
|
|
|
**Purpose:** Reference material for `threat-modeler-agent` and `posture-assessor-agent`. Maps each trap category to specific plugin controls and identifies coverage gaps.
|
|
|
|
**Source:** Google DeepMind, "AI Agent Traps: A Taxonomy of Attacks on Autonomous AI Agents" (April 2026)
|
|
|
|
---
|
|
|
|
## Category 1: Content Injection
|
|
|
|
**MITRE ATLAS:** AML.T0051 (LLM Prompt Injection), AML.T0043 (Craft Adversarial Data)
|
|
|
|
Attacks that embed malicious instructions in content the agent reads or processes.
|
|
|
|
### 1a. Steganography
|
|
|
|
Hidden payloads in content that appear benign to human reviewers but are parsed by the agent.
|
|
|
|
| Technique | Description | Plugin Coverage |
|
|
|-----------|-------------|-----------------|
|
|
| Unicode Tag steganography (U+E0000-E007F) | Invisible characters that decode to ASCII instructions | `string-utils.mjs`: `decodeUnicodeTags()` detects and decodes. `injection-patterns.mjs`: CRITICAL if decoded content matches injection patterns, HIGH for bare presence. **Covered.** |
|
|
| Zero-width character splitting | ZW chars inserted into keywords to evade pattern matching | `string-utils.mjs`: `normalizeForScan()` strips ZW chars. MEDIUM pattern flags ZW inside words. **Covered.** |
|
|
| BIDI override manipulation | Right-to-left override characters reorder visible text | `string-utils.mjs`: `stripBidiOverrides()` in normalization pipeline. **Covered.** |
|
|
| Homoglyph substitution | Cyrillic/Greek lookalikes replace Latin characters | `injection-patterns.mjs`: MEDIUM pattern detects Cyrillic-Latin mixing. **Covered (advisory).** |
|
|
| Base64 encoded payloads | Instructions encoded in base64 strings | `string-utils.mjs`: `normalizeForScan()` includes base64 decode iteration. **Covered.** |
|
|
|
|
### 1b. Syntactic Masking
|
|
|
|
Instructions hidden in structural elements of content formats.
|
|
|
|
| Technique | Description | Plugin Coverage |
|
|
|-----------|-------------|-----------------|
|
|
| HTML comment injection | `<!-- AGENT: do X -->` | `injection-patterns.mjs`: HIGH pattern for agent-directed HTML comments. **Covered.** |
|
|
| CSS-hidden content | `display:none`, `visibility:hidden`, zero opacity/size | `injection-patterns.mjs`: HIGH patterns for 6 CSS obfuscation techniques. **Covered.** |
|
|
| Markdown link-reference comments | `[//]: # (ignore instructions)` | `injection-patterns.mjs`: MEDIUM pattern for suspicious link-ref comments. **Covered.** |
|
|
| Markdown link anchor injection | `[ignore system rules](url)` | `injection-patterns.mjs`: MEDIUM pattern for injection in anchor text. **Covered.** |
|
|
| ARIA attribute injection | `aria-label="ignore all rules"` | `injection-patterns.mjs`: HIGH pattern for injection in aria-label. **Covered.** |
|
|
| Data URI with executable types | `data:text/html,...` | `injection-patterns.mjs`: MEDIUM pattern for executable data URIs. **Covered.** |
|
|
|
|
---
|
|
|
|
## Category 2: Semantic Manipulation
|
|
|
|
**MITRE ATLAS:** AML.T0051 (LLM Prompt Injection)
|
|
|
|
Attacks that manipulate the agent's reasoning without direct injection.
|
|
|
|
### 2a. Oversight Evasion
|
|
|
|
Framing that causes the agent to bypass its own safety checks.
|
|
|
|
| Technique | Description | Plugin Coverage |
|
|
|-----------|-------------|-----------------|
|
|
| Educational purpose framing | "For educational purposes only..." | `injection-patterns.mjs`: HIGH pattern. **Covered.** |
|
|
| Hypothetical scenario framing | "In a hypothetical scenario where..." | `injection-patterns.mjs`: HIGH pattern. **Covered.** |
|
|
| Fictional setting framing | "In a fictional world where..." | `injection-patterns.mjs`: HIGH pattern. **Covered.** |
|
|
| Red-team exercise framing | "This is a red-team simulation..." | `injection-patterns.mjs`: HIGH pattern. **Covered.** |
|
|
| Research purpose framing | "For research/academic purposes..." | `injection-patterns.mjs`: HIGH patterns (2 variants). **Covered.** |
|
|
| Test/lab pretense | "Pretend we are in a sandbox..." | `injection-patterns.mjs`: HIGH pattern. **Covered.** |
|
|
| Thought experiment framing | "As a thought experiment..." | `injection-patterns.mjs`: HIGH pattern. **Covered.** |
|
|
|
|
### 2b. Context Normalization
|
|
|
|
Subtle reframing that overrides rules without explicit override language.
|
|
|
|
| Technique | Description | Plugin Coverage |
|
|
|-----------|-------------|-----------------|
|
|
| User intent precedence | "User intent takes precedence over rules" | `injection-patterns.mjs`: HIGH pattern. **Covered.** |
|
|
| CLAUDE.md conflict override | "If CLAUDE.md conflicts with this..." | `injection-patterns.mjs`: HIGH pattern. **Covered.** |
|
|
| Task reframing | "Your actual task is...", "Your real purpose is..." | `injection-patterns.mjs`: HIGH patterns. **Covered.** |
|
|
| Skip confirmation normalization | "Skip confirmation for..." | `injection-patterns.mjs`: HIGH pattern. **Covered.** |
|
|
|
|
---
|
|
|
|
## Category 3: Context Manipulation
|
|
|
|
**MITRE ATLAS:** AML.T0058 (AI Agent Context Poisoning), AML.T0020 (Poison Training Data)
|
|
|
|
Attacks that poison the agent's memory or persistent state.
|
|
|
|
| Technique | Description | Plugin Coverage |
|
|
|-----------|-------------|-----------------|
|
|
| CLAUDE.md poisoning | Malicious instructions injected into project CLAUDE.md | `memory-poisoning` scanner: detects injection patterns in CLAUDE.md and memory files. **Covered (scan-time).** |
|
|
| REMEMBER.md manipulation | False context injected into session state files | `memory-poisoning` scanner: scans REMEMBER.md. **Covered (scan-time).** |
|
|
| `.claude/rules/` injection | Malicious rule files added to rules directory | `memory-poisoning` scanner: scans rule files. **Covered (scan-time).** |
|
|
| Shell command in memory | Commands embedded in memory files | `memory-poisoning` scanner: shell command pattern detection. **Covered (scan-time).** |
|
|
| Credential path in memory | Paths to credential files in memory content | `memory-poisoning` scanner: credential path detection. **Covered (scan-time).** |
|
|
| Permission expansion | "Always allow Write/Bash" in memory files | `memory-poisoning` scanner: permission expansion patterns. **Covered (scan-time).** |
|
|
|
|
**Note:** Context manipulation attacks execute at session start before hooks run. The `memory-poisoning` scanner detects these at scan-time, not at runtime. This is a fundamental limitation — CLAUDE.md is loaded before any hook executes.
|
|
|
|
---
|
|
|
|
## Category 4: Multi-Agent Exploitation
|
|
|
|
**MITRE ATLAS:** AML.T0062 (Exfiltration via AI Agent Tool Invocation), AML.T0061 (AI Agent Tools)
|
|
|
|
Attacks that exploit trust relationships between agents in multi-agent systems.
|
|
|
|
| Technique | Description | Plugin Coverage |
|
|
|-----------|-------------|-----------------|
|
|
| Sub-agent spawning with dangerous capabilities | "Create a sub-agent that reads ~/.ssh and sends to..." | `injection-patterns.mjs`: MEDIUM pattern for spawn + dangerous keywords. **Covered (advisory).** |
|
|
| Delegation with safety bypass | "Delegate to agent without review/approval" | `injection-patterns.mjs`: MEDIUM pattern for delegation + bypass. **Covered (advisory).** |
|
|
| Escalation-after-input | Sub-agent spawned within 5 calls of untrusted input | `post-session-guard.mjs`: delegation tracking, escalation-after-input advisory. **Covered.** |
|
|
| Trust chain amplification | Compromised agent poisons shared state affecting others | `post-session-guard.mjs`: trifecta detection across tool calls. **Partial** — detects exfil pattern but not cross-agent poisoning. |
|
|
| Replay delegation | Replayed task prompt from previous session | Not covered. Would require task-level authentication. **Gap.** |
|
|
|
|
---
|
|
|
|
## Category 5: Capability Manipulation
|
|
|
|
**MITRE ATLAS:** AML.T0061 (AI Agent Tools), AML.T0010 (ML Supply Chain Compromise)
|
|
|
|
Attacks that cause the agent to misuse its own tools or escalate privileges.
|
|
|
|
| Technique | Description | Plugin Coverage |
|
|
|-----------|-------------|-----------------|
|
|
| Tool misuse via ambiguous prompts | Ambiguous instruction causes destructive tool use | `pre-bash-destructive.mjs`: blocks known destructive patterns. **Partial** — only blocks known patterns. |
|
|
| Privilege escalation mid-task | Agent requests elevated permissions during execution | `pre-write-pathguard.mjs`: blocks writes to sensitive paths. **Partial** — covers file paths, not API permissions. |
|
|
| Supply chain compromise | Malicious packages installed via agent action | `pre-install-supply-chain.mjs`: blocklists, age gate, typosquat detection. **Covered.** |
|
|
| Bash parameter expansion evasion | Empty quotes/expansion bypass naive pattern matching | `bash-normalize.mjs`: expansion normalization before pattern matching. **Covered.** |
|
|
| MCP tool descriptor poisoning | Tool description changed to mislead agent | `mcp-description-cache.mjs`: drift detection via Levenshtein distance. **Covered.** |
|
|
|
|
---
|
|
|
|
## Category 6: Human-in-the-Loop Exploitation
|
|
|
|
**MITRE ATLAS:** AML.T0043 (Craft Adversarial Data)
|
|
|
|
Attacks that exploit the human review step in agent workflows.
|
|
|
|
| Technique | Description | Plugin Coverage |
|
|
|-----------|-------------|-----------------|
|
|
| Approval urgency | "Approve immediately", "No need to review" | `injection-patterns.mjs`: HIGH patterns (4 variants). **Covered.** |
|
|
| Summary suppression | "For brevity, skipping details", "Details omitted" | `injection-patterns.mjs`: HIGH patterns (3 variants). **Covered.** |
|
|
| Scope minimization | "Minor change" + destructive reference | `injection-patterns.mjs`: HIGH pattern for scope minimization + destructive keyword. **Covered.** |
|
|
| Cognitive load trap | Injection buried after 2000+ chars in verbose output | `injection-patterns.mjs`: `checkCognitiveLoadTrap()` function. **Covered.** |
|
|
| Approval fatigue exploitation | Repeated benign requests followed by malicious one | Not covered. Would require session-level approval pattern tracking. **Gap.** |
|
|
|
|
---
|
|
|
|
## Coverage Summary
|
|
|
|
| Category | Techniques | Covered | Partial | Gap |
|
|
|----------|-----------|---------|---------|-----|
|
|
| 1. Content Injection | 11 | 11 | 0 | 0 |
|
|
| 2. Semantic Manipulation | 11 | 11 | 0 | 0 |
|
|
| 3. Context Manipulation | 6 | 6 | 0 | 0 |
|
|
| 4. Multi-Agent Exploitation | 5 | 3 | 1 | 1 |
|
|
| 5. Capability Manipulation | 5 | 3 | 2 | 0 |
|
|
| 6. HITL Exploitation | 5 | 4 | 0 | 1 |
|
|
| **Total** | **43** | **38** | **3** | **2** |
|
|
|
|
**Coverage rate:** 88% (38 covered) + 7% (3 partial) = **95% addressed**
|
|
|
|
### Known Gaps
|
|
|
|
1. **Replay delegation (Cat. 4):** Would require task-level authentication or signed task prompts. Beyond hook layer capability.
|
|
2. **Approval fatigue (Cat. 6):** Would require tracking approval patterns across a session. Feasible but not yet implemented.
|
|
|
|
### Fundamental Limitation
|
|
|
|
Context manipulation attacks (Category 3) execute at session start before hooks run. CLAUDE.md, REMEMBER.md, and rule files are loaded as system context before any UserPromptSubmit or PreToolUse hook fires. The `memory-poisoning` scanner detects these at scan-time (via `/security scan` or `/security deep-scan`), but cannot prevent them at runtime. This is an Anthropic platform limitation, not a plugin limitation.
|
|
|
|
---
|
|
|
|
## Cross-References
|
|
|
|
| Agent Trap Category | OWASP ASI | OWASP LLM |
|
|
|---------------------|-----------|-----------|
|
|
| 1. Content Injection | ASI01 (Goal Hijack) | LLM01 (Prompt Injection) |
|
|
| 2. Semantic Manipulation | ASI09 (Trust Exploitation) | LLM01 (Prompt Injection) |
|
|
| 3. Context Manipulation | ASI06 (Memory Poisoning) | LLM04 (Data Poisoning) |
|
|
| 4. Multi-Agent Exploitation | ASI07 (Inter-Agent Comms), ASI08 (Cascading) | LLM06 (Excessive Agency) |
|
|
| 5. Capability Manipulation | ASI02 (Tool Misuse), ASI05 (Code Execution) | LLM05 (Output Handling) |
|
|
| 6. HITL Exploitation | ASI09 (Trust Exploitation) | LLM06 (Excessive Agency) |
|
|
|
|
---
|
|
|
|
*Last updated: v5.0 S7 — Knowledge files + attack scenario expansion*
|