198 lines
10 KiB
Markdown
198 lines
10 KiB
Markdown
# Prompt Injection Research 2025-2026
|
|
|
|
Research summary for the llm-security plugin. Documents what the field has learned about prompt injection, what can and cannot be defended deterministically, and how each finding maps to plugin controls.
|
|
|
|
**Purpose:** Reference material for `posture-assessor-agent`, `threat-modeler-agent`, and the "Known Limitations" section of documentation. Not loaded by default — only referenced when deep context is needed.
|
|
|
|
---
|
|
|
|
## 1. OpenAI — "Continuously Hardening ChatGPT Atlas" (December 2025)
|
|
|
|
**Key findings:**
|
|
- RL-trained attacker agent discovered multi-step injection chains spanning hundreds of tool calls
|
|
- Long-horizon attacks evade sliding-window detectors that only examine recent calls
|
|
- More capable models are NOT inherently more robust to injection
|
|
- Indirect injection via tool outputs (files, web pages, API responses) remains the primary attack vector
|
|
|
|
**Implications for hook defenses:**
|
|
- Sliding-window trifecta detection (20 calls) is insufficient for long-horizon attacks
|
|
- Extended 100-call window (v5.0 S3) addresses the gap but cannot catch attacks spread over 200+ calls
|
|
- Behavioral drift detection (Jensen-Shannon divergence) provides a complementary signal
|
|
- No deterministic defense can fully prevent multi-hundred-step attack chains
|
|
|
|
**Plugin controls:**
|
|
- `post-session-guard.mjs`: 100-call long-horizon window, slow-burn trifecta detection
|
|
- `post-session-guard.mjs`: Behavioral drift via Jensen-Shannon divergence on tool distributions
|
|
- **Gap:** Attacks exceeding 100 calls without detectable pattern remain undefended
|
|
|
|
---
|
|
|
|
## 2. Joint Paper — "The Attacker Moves Second" (arXiv 2510.09023, October 2025)
|
|
|
|
**Authors:** 14 researchers from Google DeepMind, ETH Zurich, MIRI, and others
|
|
|
|
**Key findings:**
|
|
- Tested 12 proposed defenses against adaptive attackers
|
|
- All 12 defenses broken with 95-100% attack success rate (ASR)
|
|
- Defenses tested include: instruction hierarchy, delimiters, input/output filtering, sandwich defense, XML tagging, spotlighting, signed prompts, LLM-as-judge, known-answer detection, prompt shield, task-oriented, and repeat-back
|
|
- Fundamental result: any defense that operates within the same token space as the attacker can be bypassed by a sufficiently adaptive attacker
|
|
|
|
**Implications for hook defenses:**
|
|
- Pattern-matching hooks (regex-based) are a necessary but insufficient layer
|
|
- No single defense mechanism achieves reliable protection against adaptive attackers
|
|
- Defense-in-depth is the only viable strategy: raise attack cost, not prevent attacks
|
|
- Fixed payloads in red-team testing give false confidence; adaptive testing essential
|
|
|
|
**Plugin controls:**
|
|
- `attack-simulator.mjs --adaptive`: 5 mutation rounds test evasion resistance
|
|
- All hooks: defense-in-depth layers (input scan + output scan + session monitoring + supply chain)
|
|
- **Gap:** Novel synonym substitutions and semantic-level evasions bypass regex patterns
|
|
|
|
---
|
|
|
|
## 3. Meta — "Agents Rule of Two" (October 2025)
|
|
|
|
**Key findings:**
|
|
- Formalized the "lethal trifecta" as a constraint: untrusted input (A) + sensitive data (B) + state change/exfiltration (C)
|
|
- Rule of Two: an agent should never simultaneously hold all three capabilities
|
|
- Proposed architectural constraint rather than detection-based defense
|
|
- Block mode enforces constraint at runtime; warn mode provides monitoring
|
|
|
|
**Implications for hook defenses:**
|
|
- Trifecta detection transitions from advisory to enforceable constraint
|
|
- MCP-concentrated trifecta (all legs from same server) warrants elevated severity
|
|
- Blocking mode must be opt-in to avoid breaking legitimate workflows
|
|
- Sensitive path patterns need expansion as new sensitive files emerge
|
|
|
|
**Plugin controls:**
|
|
- `post-session-guard.mjs`: `LLM_SECURITY_TRIFECTA_MODE=block|warn|off`
|
|
- Block mode: exit 2 for MCP-concentrated trifecta or sensitive path + exfil
|
|
- Default warn mode preserves backward compatibility
|
|
- **Gap:** Rule of Two is approximate — false positives possible for legitimate multi-tool workflows
|
|
|
|
---
|
|
|
|
## 4. Google DeepMind — "AI Agent Traps: A Taxonomy" (April 2026)
|
|
|
|
**Key findings:**
|
|
- 6-category taxonomy of traps targeting AI agents (see `deepmind-agent-traps.md` for full mapping)
|
|
- Category 1: Content injection (steganography, syntactic masking)
|
|
- Category 2: Semantic manipulation (oversight evasion, critic suppression)
|
|
- Category 3: Context manipulation (memory poisoning, preference injection)
|
|
- Category 4: Multi-agent exploitation (delegation abuse, trust chain attacks)
|
|
- Category 5: Capability manipulation (tool misuse, privilege escalation)
|
|
- Category 6: Human-in-the-loop exploitation (approval fatigue, summary suppression)
|
|
|
|
**Implications for hook defenses:**
|
|
- Unicode Tag steganography (U+E0000-E007F) is a real vector for invisible injection
|
|
- HITL traps exploit the human review step that security depends on
|
|
- Sub-agent spawning creates trust delegation chains that amplify other attacks
|
|
- Memory/context poisoning is persistent — survives session boundaries
|
|
|
|
**Plugin controls:**
|
|
- `injection-patterns.mjs`: Unicode Tag detection (CRITICAL/HIGH), HITL trap patterns (HIGH), sub-agent spawn patterns (MEDIUM)
|
|
- `string-utils.mjs`: `decodeUnicodeTags()`, `stripBidiOverrides()`
|
|
- `post-session-guard.mjs`: Sub-agent delegation tracking, escalation-after-input advisory
|
|
- See `deepmind-agent-traps.md` for complete coverage mapping
|
|
|
|
---
|
|
|
|
## 5. Google DeepMind — "Lessons from Defending Gemini" (May 2025)
|
|
|
|
**Key findings:**
|
|
- Production-scale defense requires multiple independent layers
|
|
- Instruction hierarchy helps but does not eliminate injection
|
|
- Monitoring and alerting on anomalous agent behavior is essential for detection
|
|
- More capable models show improved instruction-following but also improved attack surface
|
|
- Real-world attacks often combine multiple techniques (hybrid attacks)
|
|
|
|
**Implications for hook defenses:**
|
|
- Defense layers should be independently effective (not cascading dependencies)
|
|
- Hook architecture (PreToolUse + PostToolUse + session guard) provides independent layers
|
|
- Each hook should fail-safe (allow on error, not block)
|
|
- Monitoring hooks should emit structured data for downstream analysis
|
|
|
|
**Plugin controls:**
|
|
- Independent hook layers: input (`pre-prompt-inject-scan`), output (`post-mcp-verify`), session (`post-session-guard`), file (`pre-edit-secrets`, `pre-write-pathguard`), command (`pre-bash-destructive`, `pre-install-supply-chain`)
|
|
- Each hook exits 0 on parse errors (fail-open for availability)
|
|
- Structured JSON output for all advisories
|
|
|
|
---
|
|
|
|
## 6. Preamble — "Prompt Injection 2.0" (arXiv 2507.13169, January 2026)
|
|
|
|
**Key findings:**
|
|
- Hybrid attacks combine prompt injection with other vulnerability classes:
|
|
- P2SQL: Injection text contains SQL keywords targeting downstream database operations
|
|
- Recursive injection: Injected text instructs the model to inject into its own output
|
|
- XSS in agent context: Script/event handlers in content processed by agents
|
|
- Bash parameter expansion evasion: `c${u}rl`, `w''get`, `r""m` bypass naive pattern matching
|
|
- Natural language indirection: instructions phrased as natural language requests rather than commands
|
|
- Attacks succeed because each component alone appears benign; the combination is malicious
|
|
|
|
**Implications for hook defenses:**
|
|
- Bash hooks need expansion normalization before pattern matching
|
|
- Output scanning must check for cross-domain patterns (SQL + injection, XSS + injection)
|
|
- NL indirection has inherent FP risk — deterministic hooks can only catch keyword patterns
|
|
- Recursive injection is particularly dangerous for multi-agent systems
|
|
|
|
**Plugin controls:**
|
|
- `bash-normalize.mjs`: Strips `''`, `""`, `${x}`, `\` before pattern matching
|
|
- `injection-patterns.mjs`: HYBRID_PATTERNS for P2SQL, recursive, XSS
|
|
- `injection-patterns.mjs`: NL indirection MEDIUM patterns (high FP caution)
|
|
- `post-mcp-verify.mjs`: Hybrid pattern check on tool output
|
|
- **Gap:** Novel NL indirection phrasing evades keyword patterns
|
|
|
|
---
|
|
|
|
## 7. Google DeepMind — CaMeL Defense Proposal (2025)
|
|
|
|
**Key findings:**
|
|
- Proposed data flow tagging: track provenance of data through agent tool chains
|
|
- Each data item receives a tag (hash) when produced by a tool
|
|
- Tags propagate when data flows from one tool's output to another's input
|
|
- Trifecta with linked data flows (provenance-tracked) has higher confidence than coincidental trifecta
|
|
- Full CaMeL requires platform-level control plane — not implementable in hook layer
|
|
|
|
**Implications for hook defenses:**
|
|
- Lightweight data-tagging (~30% of benefit, ~5% of complexity) is feasible in hooks
|
|
- Hash first 200 chars of tool output as data tag; check substring match in next tool input
|
|
- Linked flows elevate trifecta severity (higher confidence of intentional exfiltration chain)
|
|
- Full provenance tracking requires platform support beyond what hooks can provide
|
|
|
|
**Plugin controls:**
|
|
- `post-session-guard.mjs`: SHA-256 data tag on tool output, substring match on next input
|
|
- Linked-flow trifecta reported with elevated severity
|
|
- State file extended with `dataTag` field per entry
|
|
- **Gap:** Substring matching is approximate; transformed data loses tag linkage
|
|
|
|
---
|
|
|
|
## Summary: What Deterministic Hooks Can and Cannot Defend
|
|
|
|
### Can defend (raise attack cost):
|
|
- Known injection patterns (regex matching on critical/high/medium patterns)
|
|
- Known evasion techniques (Unicode normalization, bash expansion, base64 decoding)
|
|
- Known bad packages (blocklist-based supply chain protection)
|
|
- Structural anomalies (trifecta patterns, behavioral drift, data volume spikes)
|
|
- Known sensitive paths and secret patterns
|
|
|
|
### Cannot defend (fundamental limitations):
|
|
- Novel natural language indirection without keyword patterns
|
|
- Adaptive attacks from motivated human red-teamers (100% ASR per joint paper)
|
|
- Long-horizon attacks spanning hundreds of steps without detectable pattern
|
|
- Semantic-level prompt injection (meaning-preserving rewording)
|
|
- CLAUDE.md loading before hooks execute (Anthropic platform limitation)
|
|
- Full data provenance tracking (requires platform-level control plane)
|
|
|
|
### Design philosophy (v5.0):
|
|
1. **Defense-in-depth:** Multiple independent layers, each raising attack cost
|
|
2. **Honest limitations:** Document what cannot be defended, don't claim prevention
|
|
3. **Advisory over blocking:** MEDIUM patterns advise, never block (FP risk)
|
|
4. **Opt-in enforcement:** Rule of Two blocking requires explicit opt-in
|
|
5. **Adaptive testing:** Red-team with mutations, not just fixed payloads
|
|
|
|
---
|
|
|
|
*Last updated: v5.0 S7 — Knowledge files + attack scenario expansion*
|
|
*Sources verified against published papers as of 2026-04*
|