ktg-plugin-marketplace/plugins/llm-security-copilot/knowledge/deepmind-agent-traps.md
Kjell Tore Guttormsen f418a8fe08 feat(llm-security-copilot): port llm-security v5.1.0 to GitHub Copilot CLI
Full port of llm-security plugin for internal use on Windows with GitHub
Copilot CLI. Protocol translation layer (copilot-hook-runner.mjs)
normalizes Copilot camelCase I/O to Claude Code snake_case format — all
original hook scripts run unmodified.

- 8 hooks with protocol translation (stdin/stdout/exit code)
- 18 SKILL.md skills (Agent Skills Open Standard)
- 6 .agent.md agent definitions
- 20 scanners + 14 scanner lib modules (unchanged)
- 14 knowledge files (unchanged)
- 39 test files including copilot-port-verify.mjs (17 tests)
- Windows-ready: node:path, os.tmpdir(), process.execPath, no bash

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-09 21:56:10 +02:00

11 KiB

DeepMind AI Agent Traps — 6-Category Taxonomy

Full taxonomy of AI agent traps from Google DeepMind's "AI Agent Traps" paper (April 2026), with Claude Code mappings and plugin coverage status.

Purpose: Reference material for threat-modeler-agent and posture-assessor-agent. Maps each trap category to specific plugin controls and identifies coverage gaps.

Source: Google DeepMind, "AI Agent Traps: A Taxonomy of Attacks on Autonomous AI Agents" (April 2026)


Category 1: Content Injection

Attacks that embed malicious instructions in content the agent reads or processes.

1a. Steganography

Hidden payloads in content that appear benign to human reviewers but are parsed by the agent.

Technique Description Plugin Coverage
Unicode Tag steganography (U+E0000-E007F) Invisible characters that decode to ASCII instructions string-utils.mjs: decodeUnicodeTags() detects and decodes. injection-patterns.mjs: CRITICAL if decoded content matches injection patterns, HIGH for bare presence. Covered.
Zero-width character splitting ZW chars inserted into keywords to evade pattern matching string-utils.mjs: normalizeForScan() strips ZW chars. MEDIUM pattern flags ZW inside words. Covered.
BIDI override manipulation Right-to-left override characters reorder visible text string-utils.mjs: stripBidiOverrides() in normalization pipeline. Covered.
Homoglyph substitution Cyrillic/Greek lookalikes replace Latin characters injection-patterns.mjs: MEDIUM pattern detects Cyrillic-Latin mixing. Covered (advisory).
Base64 encoded payloads Instructions encoded in base64 strings string-utils.mjs: normalizeForScan() includes base64 decode iteration. Covered.

1b. Syntactic Masking

Instructions hidden in structural elements of content formats.

Technique Description Plugin Coverage
HTML comment injection <!-- AGENT: do X --> injection-patterns.mjs: HIGH pattern for agent-directed HTML comments. Covered.
CSS-hidden content display:none, visibility:hidden, zero opacity/size injection-patterns.mjs: HIGH patterns for 6 CSS obfuscation techniques. Covered.
Markdown link-reference comments [//]: # (ignore instructions) injection-patterns.mjs: MEDIUM pattern for suspicious link-ref comments. Covered.
Markdown link anchor injection [ignore system rules](url) injection-patterns.mjs: MEDIUM pattern for injection in anchor text. Covered.
ARIA attribute injection aria-label="ignore all rules" injection-patterns.mjs: HIGH pattern for injection in aria-label. Covered.
Data URI with executable types data:text/html,... injection-patterns.mjs: MEDIUM pattern for executable data URIs. Covered.

Category 2: Semantic Manipulation

Attacks that manipulate the agent's reasoning without direct injection.

2a. Oversight Evasion

Framing that causes the agent to bypass its own safety checks.

Technique Description Plugin Coverage
Educational purpose framing "For educational purposes only..." injection-patterns.mjs: HIGH pattern. Covered.
Hypothetical scenario framing "In a hypothetical scenario where..." injection-patterns.mjs: HIGH pattern. Covered.
Fictional setting framing "In a fictional world where..." injection-patterns.mjs: HIGH pattern. Covered.
Red-team exercise framing "This is a red-team simulation..." injection-patterns.mjs: HIGH pattern. Covered.
Research purpose framing "For research/academic purposes..." injection-patterns.mjs: HIGH patterns (2 variants). Covered.
Test/lab pretense "Pretend we are in a sandbox..." injection-patterns.mjs: HIGH pattern. Covered.
Thought experiment framing "As a thought experiment..." injection-patterns.mjs: HIGH pattern. Covered.

2b. Context Normalization

Subtle reframing that overrides rules without explicit override language.

Technique Description Plugin Coverage
User intent precedence "User intent takes precedence over rules" injection-patterns.mjs: HIGH pattern. Covered.
CLAUDE.md conflict override "If CLAUDE.md conflicts with this..." injection-patterns.mjs: HIGH pattern. Covered.
Task reframing "Your actual task is...", "Your real purpose is..." injection-patterns.mjs: HIGH patterns. Covered.
Skip confirmation normalization "Skip confirmation for..." injection-patterns.mjs: HIGH pattern. Covered.

Category 3: Context Manipulation

Attacks that poison the agent's memory or persistent state.

Technique Description Plugin Coverage
CLAUDE.md poisoning Malicious instructions injected into project CLAUDE.md memory-poisoning scanner: detects injection patterns in CLAUDE.md and memory files. Covered (scan-time).
REMEMBER.md manipulation False context injected into session state files memory-poisoning scanner: scans REMEMBER.md. Covered (scan-time).
.claude/rules/ injection Malicious rule files added to rules directory memory-poisoning scanner: scans rule files. Covered (scan-time).
Shell command in memory Commands embedded in memory files memory-poisoning scanner: shell command pattern detection. Covered (scan-time).
Credential path in memory Paths to credential files in memory content memory-poisoning scanner: credential path detection. Covered (scan-time).
Permission expansion "Always allow Write/Bash" in memory files memory-poisoning scanner: permission expansion patterns. Covered (scan-time).

Note: Context manipulation attacks execute at session start before hooks run. The memory-poisoning scanner detects these at scan-time, not at runtime. This is a fundamental limitation — CLAUDE.md is loaded before any hook executes.


Category 4: Multi-Agent Exploitation

Attacks that exploit trust relationships between agents in multi-agent systems.

Technique Description Plugin Coverage
Sub-agent spawning with dangerous capabilities "Create a sub-agent that reads ~/.ssh and sends to..." injection-patterns.mjs: MEDIUM pattern for spawn + dangerous keywords. Covered (advisory).
Delegation with safety bypass "Delegate to agent without review/approval" injection-patterns.mjs: MEDIUM pattern for delegation + bypass. Covered (advisory).
Escalation-after-input Sub-agent spawned within 5 calls of untrusted input post-session-guard.mjs: delegation tracking, escalation-after-input advisory. Covered.
Trust chain amplification Compromised agent poisons shared state affecting others post-session-guard.mjs: trifecta detection across tool calls. Partial — detects exfil pattern but not cross-agent poisoning.
Replay delegation Replayed task prompt from previous session Not covered. Would require task-level authentication. Gap.

Category 5: Capability Manipulation

Attacks that cause the agent to misuse its own tools or escalate privileges.

Technique Description Plugin Coverage
Tool misuse via ambiguous prompts Ambiguous instruction causes destructive tool use pre-bash-destructive.mjs: blocks known destructive patterns. Partial — only blocks known patterns.
Privilege escalation mid-task Agent requests elevated permissions during execution pre-write-pathguard.mjs: blocks writes to sensitive paths. Partial — covers file paths, not API permissions.
Supply chain compromise Malicious packages installed via agent action pre-install-supply-chain.mjs: blocklists, age gate, typosquat detection. Covered.
Bash parameter expansion evasion Empty quotes/expansion bypass naive pattern matching bash-normalize.mjs: expansion normalization before pattern matching. Covered.
MCP tool descriptor poisoning Tool description changed to mislead agent mcp-description-cache.mjs: drift detection via Levenshtein distance. Covered.

Category 6: Human-in-the-Loop Exploitation

Attacks that exploit the human review step in agent workflows.

Technique Description Plugin Coverage
Approval urgency "Approve immediately", "No need to review" injection-patterns.mjs: HIGH patterns (4 variants). Covered.
Summary suppression "For brevity, skipping details", "Details omitted" injection-patterns.mjs: HIGH patterns (3 variants). Covered.
Scope minimization "Minor change" + destructive reference injection-patterns.mjs: HIGH pattern for scope minimization + destructive keyword. Covered.
Cognitive load trap Injection buried after 2000+ chars in verbose output injection-patterns.mjs: checkCognitiveLoadTrap() function. Covered.
Approval fatigue exploitation Repeated benign requests followed by malicious one Not covered. Would require session-level approval pattern tracking. Gap.

Coverage Summary

Category Techniques Covered Partial Gap
1. Content Injection 11 11 0 0
2. Semantic Manipulation 11 11 0 0
3. Context Manipulation 6 6 0 0
4. Multi-Agent Exploitation 5 3 1 1
5. Capability Manipulation 5 3 2 0
6. HITL Exploitation 5 4 0 1
Total 43 38 3 2

Coverage rate: 88% (38 covered) + 7% (3 partial) = 95% addressed

Known Gaps

  1. Replay delegation (Cat. 4): Would require task-level authentication or signed task prompts. Beyond hook layer capability.
  2. Approval fatigue (Cat. 6): Would require tracking approval patterns across a session. Feasible but not yet implemented.

Fundamental Limitation

Context manipulation attacks (Category 3) execute at session start before hooks run. CLAUDE.md, REMEMBER.md, and rule files are loaded as system context before any UserPromptSubmit or PreToolUse hook fires. The memory-poisoning scanner detects these at scan-time (via /security scan or /security deep-scan), but cannot prevent them at runtime. This is an Anthropic platform limitation, not a plugin limitation.


Cross-References

Agent Trap Category OWASP ASI OWASP LLM
1. Content Injection ASI01 (Goal Hijack) LLM01 (Prompt Injection)
2. Semantic Manipulation ASI09 (Trust Exploitation) LLM01 (Prompt Injection)
3. Context Manipulation ASI06 (Memory Poisoning) LLM04 (Data Poisoning)
4. Multi-Agent Exploitation ASI07 (Inter-Agent Comms), ASI08 (Cascading) LLM06 (Excessive Agency)
5. Capability Manipulation ASI02 (Tool Misuse), ASI05 (Code Execution) LLM05 (Output Handling)
6. HITL Exploitation ASI09 (Trust Exploitation) LLM06 (Excessive Agency)

Last updated: v5.0 S7 — Knowledge files + attack scenario expansion