12 KiB
DeepMind AI Agent Traps — 6-Category Taxonomy
Full taxonomy of AI agent traps from Google DeepMind's "AI Agent Traps" paper (April 2026), with Claude Code mappings and plugin coverage status.
Purpose: Reference material for threat-modeler-agent and posture-assessor-agent. Maps each trap category to specific plugin controls and identifies coverage gaps.
Source: Google DeepMind, "AI Agent Traps: A Taxonomy of Attacks on Autonomous AI Agents" (April 2026)
Category 1: Content Injection
MITRE ATLAS: AML.T0051 (LLM Prompt Injection), AML.T0043 (Craft Adversarial Data)
Attacks that embed malicious instructions in content the agent reads or processes.
1a. Steganography
Hidden payloads in content that appear benign to human reviewers but are parsed by the agent.
| Technique | Description | Plugin Coverage |
|---|---|---|
| Unicode Tag steganography (U+E0000-E007F) | Invisible characters that decode to ASCII instructions | string-utils.mjs: decodeUnicodeTags() detects and decodes. injection-patterns.mjs: CRITICAL if decoded content matches injection patterns, HIGH for bare presence. Covered. |
| Zero-width character splitting | ZW chars inserted into keywords to evade pattern matching | string-utils.mjs: normalizeForScan() strips ZW chars. MEDIUM pattern flags ZW inside words. Covered. |
| BIDI override manipulation | Right-to-left override characters reorder visible text | string-utils.mjs: stripBidiOverrides() in normalization pipeline. Covered. |
| Homoglyph substitution | Cyrillic/Greek lookalikes replace Latin characters | injection-patterns.mjs: MEDIUM pattern detects Cyrillic-Latin mixing. Covered (advisory). |
| Base64 encoded payloads | Instructions encoded in base64 strings | string-utils.mjs: normalizeForScan() includes base64 decode iteration. Covered. |
1b. Syntactic Masking
Instructions hidden in structural elements of content formats.
| Technique | Description | Plugin Coverage |
|---|---|---|
| HTML comment injection | <!-- AGENT: do X --> |
injection-patterns.mjs: HIGH pattern for agent-directed HTML comments. Covered. |
| CSS-hidden content | display:none, visibility:hidden, zero opacity/size |
injection-patterns.mjs: HIGH patterns for 6 CSS obfuscation techniques. Covered. |
| Markdown link-reference comments | [//]: # (ignore instructions) |
injection-patterns.mjs: MEDIUM pattern for suspicious link-ref comments. Covered. |
| Markdown link anchor injection | [ignore system rules](url) |
injection-patterns.mjs: MEDIUM pattern for injection in anchor text. Covered. |
| ARIA attribute injection | aria-label="ignore all rules" |
injection-patterns.mjs: HIGH pattern for injection in aria-label. Covered. |
| Data URI with executable types | data:text/html,... |
injection-patterns.mjs: MEDIUM pattern for executable data URIs. Covered. |
Category 2: Semantic Manipulation
MITRE ATLAS: AML.T0051 (LLM Prompt Injection)
Attacks that manipulate the agent's reasoning without direct injection.
2a. Oversight Evasion
Framing that causes the agent to bypass its own safety checks.
| Technique | Description | Plugin Coverage |
|---|---|---|
| Educational purpose framing | "For educational purposes only..." | injection-patterns.mjs: HIGH pattern. Covered. |
| Hypothetical scenario framing | "In a hypothetical scenario where..." | injection-patterns.mjs: HIGH pattern. Covered. |
| Fictional setting framing | "In a fictional world where..." | injection-patterns.mjs: HIGH pattern. Covered. |
| Red-team exercise framing | "This is a red-team simulation..." | injection-patterns.mjs: HIGH pattern. Covered. |
| Research purpose framing | "For research/academic purposes..." | injection-patterns.mjs: HIGH patterns (2 variants). Covered. |
| Test/lab pretense | "Pretend we are in a sandbox..." | injection-patterns.mjs: HIGH pattern. Covered. |
| Thought experiment framing | "As a thought experiment..." | injection-patterns.mjs: HIGH pattern. Covered. |
2b. Context Normalization
Subtle reframing that overrides rules without explicit override language.
| Technique | Description | Plugin Coverage |
|---|---|---|
| User intent precedence | "User intent takes precedence over rules" | injection-patterns.mjs: HIGH pattern. Covered. |
| CLAUDE.md conflict override | "If CLAUDE.md conflicts with this..." | injection-patterns.mjs: HIGH pattern. Covered. |
| Task reframing | "Your actual task is...", "Your real purpose is..." | injection-patterns.mjs: HIGH patterns. Covered. |
| Skip confirmation normalization | "Skip confirmation for..." | injection-patterns.mjs: HIGH pattern. Covered. |
Category 3: Context Manipulation
MITRE ATLAS: AML.T0058 (AI Agent Context Poisoning), AML.T0020 (Poison Training Data)
Attacks that poison the agent's memory or persistent state.
| Technique | Description | Plugin Coverage |
|---|---|---|
| CLAUDE.md poisoning | Malicious instructions injected into project CLAUDE.md | memory-poisoning scanner: detects injection patterns in CLAUDE.md and memory files. Covered (scan-time). |
| REMEMBER.md manipulation | False context injected into session state files | memory-poisoning scanner: scans REMEMBER.md. Covered (scan-time). |
.claude/rules/ injection |
Malicious rule files added to rules directory | memory-poisoning scanner: scans rule files. Covered (scan-time). |
| Shell command in memory | Commands embedded in memory files | memory-poisoning scanner: shell command pattern detection. Covered (scan-time). |
| Credential path in memory | Paths to credential files in memory content | memory-poisoning scanner: credential path detection. Covered (scan-time). |
| Permission expansion | "Always allow Write/Bash" in memory files | memory-poisoning scanner: permission expansion patterns. Covered (scan-time). |
Note: Context manipulation attacks execute at session start before hooks run. The memory-poisoning scanner detects these at scan-time, not at runtime. This is a fundamental limitation — CLAUDE.md is loaded before any hook executes.
Category 4: Multi-Agent Exploitation
MITRE ATLAS: AML.T0062 (Exfiltration via AI Agent Tool Invocation), AML.T0061 (AI Agent Tools)
Attacks that exploit trust relationships between agents in multi-agent systems.
| Technique | Description | Plugin Coverage |
|---|---|---|
| Sub-agent spawning with dangerous capabilities | "Create a sub-agent that reads ~/.ssh and sends to..." | injection-patterns.mjs: MEDIUM pattern for spawn + dangerous keywords. Covered (advisory). |
| Delegation with safety bypass | "Delegate to agent without review/approval" | injection-patterns.mjs: MEDIUM pattern for delegation + bypass. Covered (advisory). |
| Escalation-after-input | Sub-agent spawned within 5 calls of untrusted input | post-session-guard.mjs: delegation tracking, escalation-after-input advisory. Covered. |
| Trust chain amplification | Compromised agent poisons shared state affecting others | post-session-guard.mjs: trifecta detection across tool calls. Partial — detects exfil pattern but not cross-agent poisoning. |
| Replay delegation | Replayed task prompt from previous session | Not covered. Would require task-level authentication. Gap. |
Category 5: Capability Manipulation
MITRE ATLAS: AML.T0061 (AI Agent Tools), AML.T0010 (ML Supply Chain Compromise)
Attacks that cause the agent to misuse its own tools or escalate privileges.
| Technique | Description | Plugin Coverage |
|---|---|---|
| Tool misuse via ambiguous prompts | Ambiguous instruction causes destructive tool use | pre-bash-destructive.mjs: blocks known destructive patterns. Partial — only blocks known patterns. |
| Privilege escalation mid-task | Agent requests elevated permissions during execution | pre-write-pathguard.mjs: blocks writes to sensitive paths. Partial — covers file paths, not API permissions. |
| Supply chain compromise | Malicious packages installed via agent action | pre-install-supply-chain.mjs: blocklists, age gate, typosquat detection. Covered. |
| Bash parameter expansion evasion | Empty quotes/expansion bypass naive pattern matching | bash-normalize.mjs: expansion normalization before pattern matching. Covered. |
| MCP tool descriptor poisoning | Tool description changed to mislead agent | mcp-description-cache.mjs: drift detection via Levenshtein distance. Covered. |
Category 6: Human-in-the-Loop Exploitation
MITRE ATLAS: AML.T0043 (Craft Adversarial Data)
Attacks that exploit the human review step in agent workflows.
| Technique | Description | Plugin Coverage |
|---|---|---|
| Approval urgency | "Approve immediately", "No need to review" | injection-patterns.mjs: HIGH patterns (4 variants). Covered. |
| Summary suppression | "For brevity, skipping details", "Details omitted" | injection-patterns.mjs: HIGH patterns (3 variants). Covered. |
| Scope minimization | "Minor change" + destructive reference | injection-patterns.mjs: HIGH pattern for scope minimization + destructive keyword. Covered. |
| Cognitive load trap | Injection buried after 2000+ chars in verbose output | injection-patterns.mjs: checkCognitiveLoadTrap() function. Covered. |
| Approval fatigue exploitation | Repeated benign requests followed by malicious one | Not covered. Would require session-level approval pattern tracking. Gap. |
Coverage Summary
| Category | Techniques | Covered | Partial | Gap |
|---|---|---|---|---|
| 1. Content Injection | 11 | 11 | 0 | 0 |
| 2. Semantic Manipulation | 11 | 11 | 0 | 0 |
| 3. Context Manipulation | 6 | 6 | 0 | 0 |
| 4. Multi-Agent Exploitation | 5 | 3 | 1 | 1 |
| 5. Capability Manipulation | 5 | 3 | 2 | 0 |
| 6. HITL Exploitation | 5 | 4 | 0 | 1 |
| Total | 43 | 38 | 3 | 2 |
Coverage rate: 88% (38 covered) + 7% (3 partial) = 95% addressed
Known Gaps
- Replay delegation (Cat. 4): Would require task-level authentication or signed task prompts. Beyond hook layer capability.
- Approval fatigue (Cat. 6): Would require tracking approval patterns across a session. Feasible but not yet implemented.
Fundamental Limitation
Context manipulation attacks (Category 3) execute at session start before hooks run. CLAUDE.md, REMEMBER.md, and rule files are loaded as system context before any UserPromptSubmit or PreToolUse hook fires. The memory-poisoning scanner detects these at scan-time (via /security scan or /security deep-scan), but cannot prevent them at runtime. This is an Anthropic platform limitation, not a plugin limitation.
Cross-References
| Agent Trap Category | OWASP ASI | OWASP LLM |
|---|---|---|
| 1. Content Injection | ASI01 (Goal Hijack) | LLM01 (Prompt Injection) |
| 2. Semantic Manipulation | ASI09 (Trust Exploitation) | LLM01 (Prompt Injection) |
| 3. Context Manipulation | ASI06 (Memory Poisoning) | LLM04 (Data Poisoning) |
| 4. Multi-Agent Exploitation | ASI07 (Inter-Agent Comms), ASI08 (Cascading) | LLM06 (Excessive Agency) |
| 5. Capability Manipulation | ASI02 (Tool Misuse), ASI05 (Code Execution) | LLM05 (Output Handling) |
| 6. HITL Exploitation | ASI09 (Trust Exploitation) | LLM06 (Excessive Agency) |
Last updated: v5.0 S7 — Knowledge files + attack scenario expansion