10 KiB
Prompt Injection Research 2025-2026
Research summary for the llm-security plugin. Documents what the field has learned about prompt injection, what can and cannot be defended deterministically, and how each finding maps to plugin controls.
Purpose: Reference material for posture-assessor-agent, threat-modeler-agent, and the "Known Limitations" section of documentation. Not loaded by default — only referenced when deep context is needed.
1. OpenAI — "Continuously Hardening ChatGPT Atlas" (December 2025)
Key findings:
- RL-trained attacker agent discovered multi-step injection chains spanning hundreds of tool calls
- Long-horizon attacks evade sliding-window detectors that only examine recent calls
- More capable models are NOT inherently more robust to injection
- Indirect injection via tool outputs (files, web pages, API responses) remains the primary attack vector
Implications for hook defenses:
- Sliding-window trifecta detection (20 calls) is insufficient for long-horizon attacks
- Extended 100-call window (v5.0 S3) addresses the gap but cannot catch attacks spread over 200+ calls
- Behavioral drift detection (Jensen-Shannon divergence) provides a complementary signal
- No deterministic defense can fully prevent multi-hundred-step attack chains
Plugin controls:
post-session-guard.mjs: 100-call long-horizon window, slow-burn trifecta detectionpost-session-guard.mjs: Behavioral drift via Jensen-Shannon divergence on tool distributions- Gap: Attacks exceeding 100 calls without detectable pattern remain undefended
2. Joint Paper — "The Attacker Moves Second" (arXiv 2510.09023, October 2025)
Authors: 14 researchers from Google DeepMind, ETH Zurich, MIRI, and others
Key findings:
- Tested 12 proposed defenses against adaptive attackers
- All 12 defenses broken with 95-100% attack success rate (ASR)
- Defenses tested include: instruction hierarchy, delimiters, input/output filtering, sandwich defense, XML tagging, spotlighting, signed prompts, LLM-as-judge, known-answer detection, prompt shield, task-oriented, and repeat-back
- Fundamental result: any defense that operates within the same token space as the attacker can be bypassed by a sufficiently adaptive attacker
Implications for hook defenses:
- Pattern-matching hooks (regex-based) are a necessary but insufficient layer
- No single defense mechanism achieves reliable protection against adaptive attackers
- Defense-in-depth is the only viable strategy: raise attack cost, not prevent attacks
- Fixed payloads in red-team testing give false confidence; adaptive testing essential
Plugin controls:
attack-simulator.mjs --adaptive: 5 mutation rounds test evasion resistance- All hooks: defense-in-depth layers (input scan + output scan + session monitoring + supply chain)
- Gap: Novel synonym substitutions and semantic-level evasions bypass regex patterns
3. Meta — "Agents Rule of Two" (October 2025)
Key findings:
- Formalized the "lethal trifecta" as a constraint: untrusted input (A) + sensitive data (B) + state change/exfiltration (C)
- Rule of Two: an agent should never simultaneously hold all three capabilities
- Proposed architectural constraint rather than detection-based defense
- Block mode enforces constraint at runtime; warn mode provides monitoring
Implications for hook defenses:
- Trifecta detection transitions from advisory to enforceable constraint
- MCP-concentrated trifecta (all legs from same server) warrants elevated severity
- Blocking mode must be opt-in to avoid breaking legitimate workflows
- Sensitive path patterns need expansion as new sensitive files emerge
Plugin controls:
post-session-guard.mjs:LLM_SECURITY_TRIFECTA_MODE=block|warn|off- Block mode: exit 2 for MCP-concentrated trifecta or sensitive path + exfil
- Default warn mode preserves backward compatibility
- Gap: Rule of Two is approximate — false positives possible for legitimate multi-tool workflows
4. Google DeepMind — "AI Agent Traps: A Taxonomy" (April 2026)
Key findings:
- 6-category taxonomy of traps targeting AI agents (see
deepmind-agent-traps.mdfor full mapping) - Category 1: Content injection (steganography, syntactic masking)
- Category 2: Semantic manipulation (oversight evasion, critic suppression)
- Category 3: Context manipulation (memory poisoning, preference injection)
- Category 4: Multi-agent exploitation (delegation abuse, trust chain attacks)
- Category 5: Capability manipulation (tool misuse, privilege escalation)
- Category 6: Human-in-the-loop exploitation (approval fatigue, summary suppression)
Implications for hook defenses:
- Unicode Tag steganography (U+E0000-E007F) is a real vector for invisible injection
- HITL traps exploit the human review step that security depends on
- Sub-agent spawning creates trust delegation chains that amplify other attacks
- Memory/context poisoning is persistent — survives session boundaries
Plugin controls:
injection-patterns.mjs: Unicode Tag detection (CRITICAL/HIGH), HITL trap patterns (HIGH), sub-agent spawn patterns (MEDIUM)string-utils.mjs:decodeUnicodeTags(),stripBidiOverrides()post-session-guard.mjs: Sub-agent delegation tracking, escalation-after-input advisory- See
deepmind-agent-traps.mdfor complete coverage mapping
5. Google DeepMind — "Lessons from Defending Gemini" (May 2025)
Key findings:
- Production-scale defense requires multiple independent layers
- Instruction hierarchy helps but does not eliminate injection
- Monitoring and alerting on anomalous agent behavior is essential for detection
- More capable models show improved instruction-following but also improved attack surface
- Real-world attacks often combine multiple techniques (hybrid attacks)
Implications for hook defenses:
- Defense layers should be independently effective (not cascading dependencies)
- Hook architecture (PreToolUse + PostToolUse + session guard) provides independent layers
- Each hook should fail-safe (allow on error, not block)
- Monitoring hooks should emit structured data for downstream analysis
Plugin controls:
- Independent hook layers: input (
pre-prompt-inject-scan), output (post-mcp-verify), session (post-session-guard), file (pre-edit-secrets,pre-write-pathguard), command (pre-bash-destructive,pre-install-supply-chain) - Each hook exits 0 on parse errors (fail-open for availability)
- Structured JSON output for all advisories
6. Preamble — "Prompt Injection 2.0" (arXiv 2507.13169, January 2026)
Key findings:
- Hybrid attacks combine prompt injection with other vulnerability classes:
- P2SQL: Injection text contains SQL keywords targeting downstream database operations
- Recursive injection: Injected text instructs the model to inject into its own output
- XSS in agent context: Script/event handlers in content processed by agents
- Bash parameter expansion evasion:
c${u}rl,w''get,r""mbypass naive pattern matching - Natural language indirection: instructions phrased as natural language requests rather than commands
- Attacks succeed because each component alone appears benign; the combination is malicious
Implications for hook defenses:
- Bash hooks need expansion normalization before pattern matching
- Output scanning must check for cross-domain patterns (SQL + injection, XSS + injection)
- NL indirection has inherent FP risk — deterministic hooks can only catch keyword patterns
- Recursive injection is particularly dangerous for multi-agent systems
Plugin controls:
bash-normalize.mjs: Strips'',"",${x},\before pattern matchinginjection-patterns.mjs: HYBRID_PATTERNS for P2SQL, recursive, XSSinjection-patterns.mjs: NL indirection MEDIUM patterns (high FP caution)post-mcp-verify.mjs: Hybrid pattern check on tool output- Gap: Novel NL indirection phrasing evades keyword patterns
7. Google DeepMind — CaMeL Defense Proposal (2025)
Key findings:
- Proposed data flow tagging: track provenance of data through agent tool chains
- Each data item receives a tag (hash) when produced by a tool
- Tags propagate when data flows from one tool's output to another's input
- Trifecta with linked data flows (provenance-tracked) has higher confidence than coincidental trifecta
- Full CaMeL requires platform-level control plane — not implementable in hook layer
Implications for hook defenses:
- Lightweight data-tagging (~30% of benefit, ~5% of complexity) is feasible in hooks
- Hash first 200 chars of tool output as data tag; check substring match in next tool input
- Linked flows elevate trifecta severity (higher confidence of intentional exfiltration chain)
- Full provenance tracking requires platform support beyond what hooks can provide
Plugin controls:
post-session-guard.mjs: SHA-256 data tag on tool output, substring match on next input- Linked-flow trifecta reported with elevated severity
- State file extended with
dataTagfield per entry - Gap: Substring matching is approximate; transformed data loses tag linkage
Summary: What Deterministic Hooks Can and Cannot Defend
Can defend (raise attack cost):
- Known injection patterns (regex matching on critical/high/medium patterns)
- Known evasion techniques (Unicode normalization, bash expansion, base64 decoding)
- Known bad packages (blocklist-based supply chain protection)
- Structural anomalies (trifecta patterns, behavioral drift, data volume spikes)
- Known sensitive paths and secret patterns
Cannot defend (fundamental limitations):
- Novel natural language indirection without keyword patterns
- Adaptive attacks from motivated human red-teamers (100% ASR per joint paper)
- Long-horizon attacks spanning hundreds of steps without detectable pattern
- Semantic-level prompt injection (meaning-preserving rewording)
- CLAUDE.md loading before hooks execute (Anthropic platform limitation)
- Full data provenance tracking (requires platform-level control plane)
Design philosophy (v5.0):
- Defense-in-depth: Multiple independent layers, each raising attack cost
- Honest limitations: Document what cannot be defended, don't claim prevention
- Advisory over blocking: MEDIUM patterns advise, never block (FP risk)
- Opt-in enforcement: Rule of Two blocking requires explicit opt-in
- Adaptive testing: Red-team with mutations, not just fixed payloads
Last updated: v5.0 S7 — Knowledge files + attack scenario expansion Sources verified against published papers as of 2026-04