# Prompt Injection Research 2025-2026 Research summary for the llm-security plugin. Documents what the field has learned about prompt injection, what can and cannot be defended deterministically, and how each finding maps to plugin controls. **Purpose:** Reference material for `posture-assessor-agent`, `threat-modeler-agent`, and the "Known Limitations" section of documentation. Not loaded by default — only referenced when deep context is needed. --- ## 1. OpenAI — "Continuously Hardening ChatGPT Atlas" (December 2025) **Key findings:** - RL-trained attacker agent discovered multi-step injection chains spanning hundreds of tool calls - Long-horizon attacks evade sliding-window detectors that only examine recent calls - More capable models are NOT inherently more robust to injection - Indirect injection via tool outputs (files, web pages, API responses) remains the primary attack vector **Implications for hook defenses:** - Sliding-window trifecta detection (20 calls) is insufficient for long-horizon attacks - Extended 100-call window (v5.0 S3) addresses the gap but cannot catch attacks spread over 200+ calls - Behavioral drift detection (Jensen-Shannon divergence) provides a complementary signal - No deterministic defense can fully prevent multi-hundred-step attack chains **Plugin controls:** - `post-session-guard.mjs`: 100-call long-horizon window, slow-burn trifecta detection - `post-session-guard.mjs`: Behavioral drift via Jensen-Shannon divergence on tool distributions - **Gap:** Attacks exceeding 100 calls without detectable pattern remain undefended --- ## 2. Joint Paper — "The Attacker Moves Second" (arXiv 2510.09023, October 2025) **Authors:** 14 researchers from Google DeepMind, ETH Zurich, MIRI, and others **Key findings:** - Tested 12 proposed defenses against adaptive attackers - All 12 defenses broken with 95-100% attack success rate (ASR) - Defenses tested include: instruction hierarchy, delimiters, input/output filtering, sandwich defense, XML tagging, spotlighting, signed prompts, LLM-as-judge, known-answer detection, prompt shield, task-oriented, and repeat-back - Fundamental result: any defense that operates within the same token space as the attacker can be bypassed by a sufficiently adaptive attacker **Implications for hook defenses:** - Pattern-matching hooks (regex-based) are a necessary but insufficient layer - No single defense mechanism achieves reliable protection against adaptive attackers - Defense-in-depth is the only viable strategy: raise attack cost, not prevent attacks - Fixed payloads in red-team testing give false confidence; adaptive testing essential **Plugin controls:** - `attack-simulator.mjs --adaptive`: 5 mutation rounds test evasion resistance - All hooks: defense-in-depth layers (input scan + output scan + session monitoring + supply chain) - **Gap:** Novel synonym substitutions and semantic-level evasions bypass regex patterns --- ## 3. Meta — "Agents Rule of Two" (October 2025) **Key findings:** - Formalized the "lethal trifecta" as a constraint: untrusted input (A) + sensitive data (B) + state change/exfiltration (C) - Rule of Two: an agent should never simultaneously hold all three capabilities - Proposed architectural constraint rather than detection-based defense - Block mode enforces constraint at runtime; warn mode provides monitoring **Implications for hook defenses:** - Trifecta detection transitions from advisory to enforceable constraint - MCP-concentrated trifecta (all legs from same server) warrants elevated severity - Blocking mode must be opt-in to avoid breaking legitimate workflows - Sensitive path patterns need expansion as new sensitive files emerge **Plugin controls:** - `post-session-guard.mjs`: `LLM_SECURITY_TRIFECTA_MODE=block|warn|off` - Block mode: exit 2 for MCP-concentrated trifecta or sensitive path + exfil - Default warn mode preserves backward compatibility - **Gap:** Rule of Two is approximate — false positives possible for legitimate multi-tool workflows --- ## 4. Google DeepMind — "AI Agent Traps: A Taxonomy" (April 2026) **Key findings:** - 6-category taxonomy of traps targeting AI agents (see `deepmind-agent-traps.md` for full mapping) - Category 1: Content injection (steganography, syntactic masking) - Category 2: Semantic manipulation (oversight evasion, critic suppression) - Category 3: Context manipulation (memory poisoning, preference injection) - Category 4: Multi-agent exploitation (delegation abuse, trust chain attacks) - Category 5: Capability manipulation (tool misuse, privilege escalation) - Category 6: Human-in-the-loop exploitation (approval fatigue, summary suppression) **Implications for hook defenses:** - Unicode Tag steganography (U+E0000-E007F) is a real vector for invisible injection - HITL traps exploit the human review step that security depends on - Sub-agent spawning creates trust delegation chains that amplify other attacks - Memory/context poisoning is persistent — survives session boundaries **Plugin controls:** - `injection-patterns.mjs`: Unicode Tag detection (CRITICAL/HIGH), HITL trap patterns (HIGH), sub-agent spawn patterns (MEDIUM) - `string-utils.mjs`: `decodeUnicodeTags()`, `stripBidiOverrides()` - `post-session-guard.mjs`: Sub-agent delegation tracking, escalation-after-input advisory - See `deepmind-agent-traps.md` for complete coverage mapping --- ## 5. Google DeepMind — "Lessons from Defending Gemini" (May 2025) **Key findings:** - Production-scale defense requires multiple independent layers - Instruction hierarchy helps but does not eliminate injection - Monitoring and alerting on anomalous agent behavior is essential for detection - More capable models show improved instruction-following but also improved attack surface - Real-world attacks often combine multiple techniques (hybrid attacks) **Implications for hook defenses:** - Defense layers should be independently effective (not cascading dependencies) - Hook architecture (PreToolUse + PostToolUse + session guard) provides independent layers - Each hook should fail-safe (allow on error, not block) - Monitoring hooks should emit structured data for downstream analysis **Plugin controls:** - Independent hook layers: input (`pre-prompt-inject-scan`), output (`post-mcp-verify`), session (`post-session-guard`), file (`pre-edit-secrets`, `pre-write-pathguard`), command (`pre-bash-destructive`, `pre-install-supply-chain`) - Each hook exits 0 on parse errors (fail-open for availability) - Structured JSON output for all advisories --- ## 6. Preamble — "Prompt Injection 2.0" (arXiv 2507.13169, January 2026) **Key findings:** - Hybrid attacks combine prompt injection with other vulnerability classes: - P2SQL: Injection text contains SQL keywords targeting downstream database operations - Recursive injection: Injected text instructs the model to inject into its own output - XSS in agent context: Script/event handlers in content processed by agents - Bash parameter expansion evasion: `c${u}rl`, `w''get`, `r""m` bypass naive pattern matching - Natural language indirection: instructions phrased as natural language requests rather than commands - Attacks succeed because each component alone appears benign; the combination is malicious **Implications for hook defenses:** - Bash hooks need expansion normalization before pattern matching - Output scanning must check for cross-domain patterns (SQL + injection, XSS + injection) - NL indirection has inherent FP risk — deterministic hooks can only catch keyword patterns - Recursive injection is particularly dangerous for multi-agent systems **Plugin controls:** - `bash-normalize.mjs`: Strips `''`, `""`, `${x}`, `\` before pattern matching - `injection-patterns.mjs`: HYBRID_PATTERNS for P2SQL, recursive, XSS - `injection-patterns.mjs`: NL indirection MEDIUM patterns (high FP caution) - `post-mcp-verify.mjs`: Hybrid pattern check on tool output - **Gap:** Novel NL indirection phrasing evades keyword patterns --- ## 7. Google DeepMind — CaMeL Defense Proposal (2025) **Key findings:** - Proposed data flow tagging: track provenance of data through agent tool chains - Each data item receives a tag (hash) when produced by a tool - Tags propagate when data flows from one tool's output to another's input - Trifecta with linked data flows (provenance-tracked) has higher confidence than coincidental trifecta - Full CaMeL requires platform-level control plane — not implementable in hook layer **Implications for hook defenses:** - Lightweight data-tagging (~30% of benefit, ~5% of complexity) is feasible in hooks - Hash first 200 chars of tool output as data tag; check substring match in next tool input - Linked flows elevate trifecta severity (higher confidence of intentional exfiltration chain) - Full provenance tracking requires platform support beyond what hooks can provide **Plugin controls:** - `post-session-guard.mjs`: SHA-256 data tag on tool output, substring match on next input - Linked-flow trifecta reported with elevated severity - State file extended with `dataTag` field per entry - **Gap:** Substring matching is approximate; transformed data loses tag linkage --- ## Summary: What Deterministic Hooks Can and Cannot Defend ### Can defend (raise attack cost): - Known injection patterns (regex matching on critical/high/medium patterns) - Known evasion techniques (Unicode normalization, bash expansion, base64 decoding) - Known bad packages (blocklist-based supply chain protection) - Structural anomalies (trifecta patterns, behavioral drift, data volume spikes) - Known sensitive paths and secret patterns ### Cannot defend (fundamental limitations): - Novel natural language indirection without keyword patterns - Adaptive attacks from motivated human red-teamers (100% ASR per joint paper) - Long-horizon attacks spanning hundreds of steps without detectable pattern - Semantic-level prompt injection (meaning-preserving rewording) - CLAUDE.md loading before hooks execute (Anthropic platform limitation) - Full data provenance tracking (requires platform-level control plane) ### Design philosophy (v5.0): 1. **Defense-in-depth:** Multiple independent layers, each raising attack cost 2. **Honest limitations:** Document what cannot be defended, don't claim prevention 3. **Advisory over blocking:** MEDIUM patterns advise, never block (FP risk) 4. **Opt-in enforcement:** Rule of Two blocking requires explicit opt-in 5. **Adaptive testing:** Red-team with mutations, not just fixed payloads --- *Last updated: v5.0 S7 — Knowledge files + attack scenario expansion* *Sources verified against published papers as of 2026-04*