Kjell Tore Guttormsen f418a8fe08 feat(llm-security-copilot): port llm-security v5.1.0 to GitHub Copilot CLI

Full port of llm-security plugin for internal use on Windows with GitHub
Copilot CLI. Protocol translation layer (copilot-hook-runner.mjs)
normalizes Copilot camelCase I/O to Claude Code snake_case format — all
original hook scripts run unmodified.

- 8 hooks with protocol translation (stdin/stdout/exit code)
- 18 SKILL.md skills (Agent Skills Open Standard)
- 6 .agent.md agent definitions
- 20 scanners + 14 scanner lib modules (unchanged)
- 14 knowledge files (unchanged)
- 39 test files including copilot-port-verify.mjs (17 tests)
- Windows-ready: node:path, os.tmpdir(), process.execPath, no bash

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-04-09 21:56:10 +02:00

11 KiB

Raw Blame History

DeepMind AI Agent Traps — 6-Category Taxonomy

Full taxonomy of AI agent traps from Google DeepMind's "AI Agent Traps" paper (April 2026), with Claude Code mappings and plugin coverage status.

Purpose: Reference material for threat-modeler-agent and posture-assessor-agent. Maps each trap category to specific plugin controls and identifies coverage gaps.

Source: Google DeepMind, "AI Agent Traps: A Taxonomy of Attacks on Autonomous AI Agents" (April 2026)

Category 1: Content Injection

Attacks that embed malicious instructions in content the agent reads or processes.

1a. Steganography

Hidden payloads in content that appear benign to human reviewers but are parsed by the agent.

Technique	Description	Plugin Coverage
Unicode Tag steganography (U+E0000-E007F)	Invisible characters that decode to ASCII instructions	`string-utils.mjs`: `decodeUnicodeTags()` detects and decodes. `injection-patterns.mjs`: CRITICAL if decoded content matches injection patterns, HIGH for bare presence. Covered.
Zero-width character splitting	ZW chars inserted into keywords to evade pattern matching	`string-utils.mjs`: `normalizeForScan()` strips ZW chars. MEDIUM pattern flags ZW inside words. Covered.
BIDI override manipulation	Right-to-left override characters reorder visible text	`string-utils.mjs`: `stripBidiOverrides()` in normalization pipeline. Covered.
Homoglyph substitution	Cyrillic/Greek lookalikes replace Latin characters	`injection-patterns.mjs`: MEDIUM pattern detects Cyrillic-Latin mixing. Covered (advisory).
Base64 encoded payloads	Instructions encoded in base64 strings	`string-utils.mjs`: `normalizeForScan()` includes base64 decode iteration. Covered.

1b. Syntactic Masking

Instructions hidden in structural elements of content formats.

Technique	Description	Plugin Coverage
HTML comment injection	`<!-- AGENT: do X -->`	`injection-patterns.mjs`: HIGH pattern for agent-directed HTML comments. Covered.
CSS-hidden content	`display:none`, `visibility:hidden`, zero opacity/size	`injection-patterns.mjs`: HIGH patterns for 6 CSS obfuscation techniques. Covered.
Markdown link-reference comments	`[//]: # (ignore instructions)`	`injection-patterns.mjs`: MEDIUM pattern for suspicious link-ref comments. Covered.
Markdown link anchor injection	`[ignore system rules](url)`	`injection-patterns.mjs`: MEDIUM pattern for injection in anchor text. Covered.
ARIA attribute injection	`aria-label="ignore all rules"`	`injection-patterns.mjs`: HIGH pattern for injection in aria-label. Covered.
Data URI with executable types	`data:text/html,...`	`injection-patterns.mjs`: MEDIUM pattern for executable data URIs. Covered.

Category 2: Semantic Manipulation

Attacks that manipulate the agent's reasoning without direct injection.

2a. Oversight Evasion

Framing that causes the agent to bypass its own safety checks.

Technique	Description	Plugin Coverage
Educational purpose framing	"For educational purposes only..."	`injection-patterns.mjs`: HIGH pattern. Covered.
Hypothetical scenario framing	"In a hypothetical scenario where..."	`injection-patterns.mjs`: HIGH pattern. Covered.
Fictional setting framing	"In a fictional world where..."	`injection-patterns.mjs`: HIGH pattern. Covered.
Red-team exercise framing	"This is a red-team simulation..."	`injection-patterns.mjs`: HIGH pattern. Covered.
Research purpose framing	"For research/academic purposes..."	`injection-patterns.mjs`: HIGH patterns (2 variants). Covered.
Test/lab pretense	"Pretend we are in a sandbox..."	`injection-patterns.mjs`: HIGH pattern. Covered.
Thought experiment framing	"As a thought experiment..."	`injection-patterns.mjs`: HIGH pattern. Covered.

2b. Context Normalization

Subtle reframing that overrides rules without explicit override language.

Technique	Description	Plugin Coverage
User intent precedence	"User intent takes precedence over rules"	`injection-patterns.mjs`: HIGH pattern. Covered.
CLAUDE.md conflict override	"If CLAUDE.md conflicts with this..."	`injection-patterns.mjs`: HIGH pattern. Covered.
Task reframing	"Your actual task is...", "Your real purpose is..."	`injection-patterns.mjs`: HIGH patterns. Covered.
Skip confirmation normalization	"Skip confirmation for..."	`injection-patterns.mjs`: HIGH pattern. Covered.

Category 3: Context Manipulation

Attacks that poison the agent's memory or persistent state.

Technique	Description	Plugin Coverage
CLAUDE.md poisoning	Malicious instructions injected into project CLAUDE.md	`memory-poisoning` scanner: detects injection patterns in CLAUDE.md and memory files. Covered (scan-time).
REMEMBER.md manipulation	False context injected into session state files	`memory-poisoning` scanner: scans REMEMBER.md. Covered (scan-time).
`.claude/rules/` injection	Malicious rule files added to rules directory	`memory-poisoning` scanner: scans rule files. Covered (scan-time).
Shell command in memory	Commands embedded in memory files	`memory-poisoning` scanner: shell command pattern detection. Covered (scan-time).
Credential path in memory	Paths to credential files in memory content	`memory-poisoning` scanner: credential path detection. Covered (scan-time).
Permission expansion	"Always allow Write/Bash" in memory files	`memory-poisoning` scanner: permission expansion patterns. Covered (scan-time).

Note: Context manipulation attacks execute at session start before hooks run. The memory-poisoning scanner detects these at scan-time, not at runtime. This is a fundamental limitation — CLAUDE.md is loaded before any hook executes.

Category 4: Multi-Agent Exploitation

Attacks that exploit trust relationships between agents in multi-agent systems.

Technique	Description	Plugin Coverage
Sub-agent spawning with dangerous capabilities	"Create a sub-agent that reads ~/.ssh and sends to..."	`injection-patterns.mjs`: MEDIUM pattern for spawn + dangerous keywords. Covered (advisory).
Delegation with safety bypass	"Delegate to agent without review/approval"	`injection-patterns.mjs`: MEDIUM pattern for delegation + bypass. Covered (advisory).
Escalation-after-input	Sub-agent spawned within 5 calls of untrusted input	`post-session-guard.mjs`: delegation tracking, escalation-after-input advisory. Covered.
Trust chain amplification	Compromised agent poisons shared state affecting others	`post-session-guard.mjs`: trifecta detection across tool calls. Partial — detects exfil pattern but not cross-agent poisoning.
Replay delegation	Replayed task prompt from previous session	Not covered. Would require task-level authentication. Gap.

Category 5: Capability Manipulation

Attacks that cause the agent to misuse its own tools or escalate privileges.

Technique	Description	Plugin Coverage
Tool misuse via ambiguous prompts	Ambiguous instruction causes destructive tool use	`pre-bash-destructive.mjs`: blocks known destructive patterns. Partial — only blocks known patterns.
Privilege escalation mid-task	Agent requests elevated permissions during execution	`pre-write-pathguard.mjs`: blocks writes to sensitive paths. Partial — covers file paths, not API permissions.
Supply chain compromise	Malicious packages installed via agent action	`pre-install-supply-chain.mjs`: blocklists, age gate, typosquat detection. Covered.
Bash parameter expansion evasion	Empty quotes/expansion bypass naive pattern matching	`bash-normalize.mjs`: expansion normalization before pattern matching. Covered.
MCP tool descriptor poisoning	Tool description changed to mislead agent	`mcp-description-cache.mjs`: drift detection via Levenshtein distance. Covered.

Category 6: Human-in-the-Loop Exploitation

Attacks that exploit the human review step in agent workflows.

Technique	Description	Plugin Coverage
Approval urgency	"Approve immediately", "No need to review"	`injection-patterns.mjs`: HIGH patterns (4 variants). Covered.
Summary suppression	"For brevity, skipping details", "Details omitted"	`injection-patterns.mjs`: HIGH patterns (3 variants). Covered.
Scope minimization	"Minor change" + destructive reference	`injection-patterns.mjs`: HIGH pattern for scope minimization + destructive keyword. Covered.
Cognitive load trap	Injection buried after 2000+ chars in verbose output	`injection-patterns.mjs`: `checkCognitiveLoadTrap()` function. Covered.
Approval fatigue exploitation	Repeated benign requests followed by malicious one	Not covered. Would require session-level approval pattern tracking. Gap.

Coverage Summary

Category	Techniques	Covered	Partial	Gap
1. Content Injection	11	11	0	0
2. Semantic Manipulation	11	11	0	0
3. Context Manipulation	6	6	0	0
4. Multi-Agent Exploitation	5	3	1	1
5. Capability Manipulation	5	3	2	0
6. HITL Exploitation	5	4	0	1
Total	43	38	3	2

Coverage rate: 88% (38 covered) + 7% (3 partial) = 95% addressed

Known Gaps

Replay delegation (Cat. 4): Would require task-level authentication or signed task prompts. Beyond hook layer capability.
Approval fatigue (Cat. 6): Would require tracking approval patterns across a session. Feasible but not yet implemented.

Fundamental Limitation

Context manipulation attacks (Category 3) execute at session start before hooks run. CLAUDE.md, REMEMBER.md, and rule files are loaded as system context before any UserPromptSubmit or PreToolUse hook fires. The memory-poisoning scanner detects these at scan-time (via /security scan or /security deep-scan), but cannot prevent them at runtime. This is an Anthropic platform limitation, not a plugin limitation.

Cross-References

Agent Trap Category	OWASP ASI	OWASP LLM
1. Content Injection	ASI01 (Goal Hijack)	LLM01 (Prompt Injection)
2. Semantic Manipulation	ASI09 (Trust Exploitation)	LLM01 (Prompt Injection)
3. Context Manipulation	ASI06 (Memory Poisoning)	LLM04 (Data Poisoning)
4. Multi-Agent Exploitation	ASI07 (Inter-Agent Comms), ASI08 (Cascading)	LLM06 (Excessive Agency)
5. Capability Manipulation	ASI02 (Tool Misuse), ASI05 (Code Execution)	LLM05 (Output Handling)
6. HITL Exploitation	ASI09 (Trust Exploitation)	LLM06 (Excessive Agency)

Last updated: v5.0 S7 — Knowledge files + attack scenario expansion

11 KiB Raw Blame History