Kjell Tore Guttormsen f93d6abdae feat: initial open marketplace with llm-security, config-audit, ultraplan-local

2026-04-06 18:47:49 +02:00

10 KiB

Raw Blame History

Prompt Injection Research 2025-2026

Research summary for the llm-security plugin. Documents what the field has learned about prompt injection, what can and cannot be defended deterministically, and how each finding maps to plugin controls.

Purpose: Reference material for posture-assessor-agent, threat-modeler-agent, and the "Known Limitations" section of documentation. Not loaded by default — only referenced when deep context is needed.

1. OpenAI — "Continuously Hardening ChatGPT Atlas" (December 2025)

Key findings:

RL-trained attacker agent discovered multi-step injection chains spanning hundreds of tool calls
Long-horizon attacks evade sliding-window detectors that only examine recent calls
More capable models are NOT inherently more robust to injection
Indirect injection via tool outputs (files, web pages, API responses) remains the primary attack vector

Implications for hook defenses:

Sliding-window trifecta detection (20 calls) is insufficient for long-horizon attacks
Extended 100-call window (v5.0 S3) addresses the gap but cannot catch attacks spread over 200+ calls
Behavioral drift detection (Jensen-Shannon divergence) provides a complementary signal
No deterministic defense can fully prevent multi-hundred-step attack chains

Plugin controls:

post-session-guard.mjs: 100-call long-horizon window, slow-burn trifecta detection
post-session-guard.mjs: Behavioral drift via Jensen-Shannon divergence on tool distributions
Gap: Attacks exceeding 100 calls without detectable pattern remain undefended

2. Joint Paper — "The Attacker Moves Second" (arXiv 2510.09023, October 2025)

Authors: 14 researchers from Google DeepMind, ETH Zurich, MIRI, and others

Key findings:

Tested 12 proposed defenses against adaptive attackers
All 12 defenses broken with 95-100% attack success rate (ASR)
Defenses tested include: instruction hierarchy, delimiters, input/output filtering, sandwich defense, XML tagging, spotlighting, signed prompts, LLM-as-judge, known-answer detection, prompt shield, task-oriented, and repeat-back
Fundamental result: any defense that operates within the same token space as the attacker can be bypassed by a sufficiently adaptive attacker

Implications for hook defenses:

Pattern-matching hooks (regex-based) are a necessary but insufficient layer
No single defense mechanism achieves reliable protection against adaptive attackers
Defense-in-depth is the only viable strategy: raise attack cost, not prevent attacks
Fixed payloads in red-team testing give false confidence; adaptive testing essential

Plugin controls:

attack-simulator.mjs --adaptive: 5 mutation rounds test evasion resistance
All hooks: defense-in-depth layers (input scan + output scan + session monitoring + supply chain)
Gap: Novel synonym substitutions and semantic-level evasions bypass regex patterns

3. Meta — "Agents Rule of Two" (October 2025)

Key findings:

Formalized the "lethal trifecta" as a constraint: untrusted input (A) + sensitive data (B) + state change/exfiltration (C)
Rule of Two: an agent should never simultaneously hold all three capabilities
Proposed architectural constraint rather than detection-based defense
Block mode enforces constraint at runtime; warn mode provides monitoring

Implications for hook defenses:

Trifecta detection transitions from advisory to enforceable constraint
MCP-concentrated trifecta (all legs from same server) warrants elevated severity
Blocking mode must be opt-in to avoid breaking legitimate workflows
Sensitive path patterns need expansion as new sensitive files emerge

Plugin controls:

post-session-guard.mjs: LLM_SECURITY_TRIFECTA_MODE=block|warn|off
Block mode: exit 2 for MCP-concentrated trifecta or sensitive path + exfil
Default warn mode preserves backward compatibility
Gap: Rule of Two is approximate — false positives possible for legitimate multi-tool workflows

4. Google DeepMind — "AI Agent Traps: A Taxonomy" (April 2026)

Key findings:

6-category taxonomy of traps targeting AI agents (see deepmind-agent-traps.md for full mapping)
Category 1: Content injection (steganography, syntactic masking)
Category 2: Semantic manipulation (oversight evasion, critic suppression)
Category 3: Context manipulation (memory poisoning, preference injection)
Category 4: Multi-agent exploitation (delegation abuse, trust chain attacks)
Category 5: Capability manipulation (tool misuse, privilege escalation)
Category 6: Human-in-the-loop exploitation (approval fatigue, summary suppression)

Implications for hook defenses:

Unicode Tag steganography (U+E0000-E007F) is a real vector for invisible injection
HITL traps exploit the human review step that security depends on
Sub-agent spawning creates trust delegation chains that amplify other attacks
Memory/context poisoning is persistent — survives session boundaries

Plugin controls:

injection-patterns.mjs: Unicode Tag detection (CRITICAL/HIGH), HITL trap patterns (HIGH), sub-agent spawn patterns (MEDIUM)
string-utils.mjs: decodeUnicodeTags(), stripBidiOverrides()
post-session-guard.mjs: Sub-agent delegation tracking, escalation-after-input advisory
See deepmind-agent-traps.md for complete coverage mapping

5. Google DeepMind — "Lessons from Defending Gemini" (May 2025)

Key findings:

Production-scale defense requires multiple independent layers
Instruction hierarchy helps but does not eliminate injection
Monitoring and alerting on anomalous agent behavior is essential for detection
More capable models show improved instruction-following but also improved attack surface
Real-world attacks often combine multiple techniques (hybrid attacks)

Implications for hook defenses:

Defense layers should be independently effective (not cascading dependencies)
Hook architecture (PreToolUse + PostToolUse + session guard) provides independent layers
Each hook should fail-safe (allow on error, not block)
Monitoring hooks should emit structured data for downstream analysis

Plugin controls:

Independent hook layers: input (pre-prompt-inject-scan), output (post-mcp-verify), session (post-session-guard), file (pre-edit-secrets, pre-write-pathguard), command (pre-bash-destructive, pre-install-supply-chain)
Each hook exits 0 on parse errors (fail-open for availability)
Structured JSON output for all advisories

6. Preamble — "Prompt Injection 2.0" (arXiv 2507.13169, January 2026)

Key findings:

Hybrid attacks combine prompt injection with other vulnerability classes:
- P2SQL: Injection text contains SQL keywords targeting downstream database operations
- Recursive injection: Injected text instructs the model to inject into its own output
- XSS in agent context: Script/event handlers in content processed by agents
Bash parameter expansion evasion: c${u}rl, w''get, r""m bypass naive pattern matching
Natural language indirection: instructions phrased as natural language requests rather than commands
Attacks succeed because each component alone appears benign; the combination is malicious

Implications for hook defenses:

Bash hooks need expansion normalization before pattern matching
Output scanning must check for cross-domain patterns (SQL + injection, XSS + injection)
NL indirection has inherent FP risk — deterministic hooks can only catch keyword patterns
Recursive injection is particularly dangerous for multi-agent systems

Plugin controls:

bash-normalize.mjs: Strips '', "", ${x}, \ before pattern matching
injection-patterns.mjs: HYBRID_PATTERNS for P2SQL, recursive, XSS
injection-patterns.mjs: NL indirection MEDIUM patterns (high FP caution)
post-mcp-verify.mjs: Hybrid pattern check on tool output
Gap: Novel NL indirection phrasing evades keyword patterns

7. Google DeepMind — CaMeL Defense Proposal (2025)

Key findings:

Proposed data flow tagging: track provenance of data through agent tool chains
Each data item receives a tag (hash) when produced by a tool
Tags propagate when data flows from one tool's output to another's input
Trifecta with linked data flows (provenance-tracked) has higher confidence than coincidental trifecta
Full CaMeL requires platform-level control plane — not implementable in hook layer

Implications for hook defenses:

Lightweight data-tagging (~30% of benefit, ~5% of complexity) is feasible in hooks
Hash first 200 chars of tool output as data tag; check substring match in next tool input
Linked flows elevate trifecta severity (higher confidence of intentional exfiltration chain)
Full provenance tracking requires platform support beyond what hooks can provide

Plugin controls:

post-session-guard.mjs: SHA-256 data tag on tool output, substring match on next input
Linked-flow trifecta reported with elevated severity
State file extended with dataTag field per entry
Gap: Substring matching is approximate; transformed data loses tag linkage

Summary: What Deterministic Hooks Can and Cannot Defend

Can defend (raise attack cost):

Known injection patterns (regex matching on critical/high/medium patterns)
Known evasion techniques (Unicode normalization, bash expansion, base64 decoding)
Known bad packages (blocklist-based supply chain protection)
Structural anomalies (trifecta patterns, behavioral drift, data volume spikes)
Known sensitive paths and secret patterns

Cannot defend (fundamental limitations):

Novel natural language indirection without keyword patterns
Adaptive attacks from motivated human red-teamers (100% ASR per joint paper)
Long-horizon attacks spanning hundreds of steps without detectable pattern
Semantic-level prompt injection (meaning-preserving rewording)
CLAUDE.md loading before hooks execute (Anthropic platform limitation)
Full data provenance tracking (requires platform-level control plane)

Design philosophy (v5.0):

Defense-in-depth: Multiple independent layers, each raising attack cost
Honest limitations: Document what cannot be defended, don't claim prevention
Advisory over blocking: MEDIUM patterns advise, never block (FP risk)
Opt-in enforcement: Rule of Two blocking requires explicit opt-in
Adaptive testing: Red-team with mutations, not just fixed payloads

Last updated: v5.0 S7 — Knowledge files + attack scenario expansion Sources verified against published papers as of 2026-04

10 KiB Raw Blame History

Prompt Injection Research 2025-2026

1. OpenAI — "Continuously Hardening ChatGPT Atlas" (December 2025)

2. Joint Paper — "The Attacker Moves Second" (arXiv 2510.09023, October 2025)

3. Meta — "Agents Rule of Two" (October 2025)

4. Google DeepMind — "AI Agent Traps: A Taxonomy" (April 2026)

5. Google DeepMind — "Lessons from Defending Gemini" (May 2025)

6. Preamble — "Prompt Injection 2.0" (arXiv 2507.13169, January 2026)

7. Google DeepMind — CaMeL Defense Proposal (2025)

Summary: What Deterministic Hooks Can and Cannot Defend

Can defend (raise attack cost):

Cannot defend (fundamental limitations):

Design philosophy (v5.0):

10 KiB

Raw Blame History