ktg-plugin-marketplace/plugins/llm-security/knowledge/prompt-injection-research-2025-2026.md

198 lines
10 KiB
Markdown

# Prompt Injection Research 2025-2026
Research summary for the llm-security plugin. Documents what the field has learned about prompt injection, what can and cannot be defended deterministically, and how each finding maps to plugin controls.
**Purpose:** Reference material for `posture-assessor-agent`, `threat-modeler-agent`, and the "Known Limitations" section of documentation. Not loaded by default — only referenced when deep context is needed.
---
## 1. OpenAI — "Continuously Hardening ChatGPT Atlas" (December 2025)
**Key findings:**
- RL-trained attacker agent discovered multi-step injection chains spanning hundreds of tool calls
- Long-horizon attacks evade sliding-window detectors that only examine recent calls
- More capable models are NOT inherently more robust to injection
- Indirect injection via tool outputs (files, web pages, API responses) remains the primary attack vector
**Implications for hook defenses:**
- Sliding-window trifecta detection (20 calls) is insufficient for long-horizon attacks
- Extended 100-call window (v5.0 S3) addresses the gap but cannot catch attacks spread over 200+ calls
- Behavioral drift detection (Jensen-Shannon divergence) provides a complementary signal
- No deterministic defense can fully prevent multi-hundred-step attack chains
**Plugin controls:**
- `post-session-guard.mjs`: 100-call long-horizon window, slow-burn trifecta detection
- `post-session-guard.mjs`: Behavioral drift via Jensen-Shannon divergence on tool distributions
- **Gap:** Attacks exceeding 100 calls without detectable pattern remain undefended
---
## 2. Joint Paper — "The Attacker Moves Second" (arXiv 2510.09023, October 2025)
**Authors:** 14 researchers from Google DeepMind, ETH Zurich, MIRI, and others
**Key findings:**
- Tested 12 proposed defenses against adaptive attackers
- All 12 defenses broken with 95-100% attack success rate (ASR)
- Defenses tested include: instruction hierarchy, delimiters, input/output filtering, sandwich defense, XML tagging, spotlighting, signed prompts, LLM-as-judge, known-answer detection, prompt shield, task-oriented, and repeat-back
- Fundamental result: any defense that operates within the same token space as the attacker can be bypassed by a sufficiently adaptive attacker
**Implications for hook defenses:**
- Pattern-matching hooks (regex-based) are a necessary but insufficient layer
- No single defense mechanism achieves reliable protection against adaptive attackers
- Defense-in-depth is the only viable strategy: raise attack cost, not prevent attacks
- Fixed payloads in red-team testing give false confidence; adaptive testing essential
**Plugin controls:**
- `attack-simulator.mjs --adaptive`: 5 mutation rounds test evasion resistance
- All hooks: defense-in-depth layers (input scan + output scan + session monitoring + supply chain)
- **Gap:** Novel synonym substitutions and semantic-level evasions bypass regex patterns
---
## 3. Meta — "Agents Rule of Two" (October 2025)
**Key findings:**
- Formalized the "lethal trifecta" as a constraint: untrusted input (A) + sensitive data (B) + state change/exfiltration (C)
- Rule of Two: an agent should never simultaneously hold all three capabilities
- Proposed architectural constraint rather than detection-based defense
- Block mode enforces constraint at runtime; warn mode provides monitoring
**Implications for hook defenses:**
- Trifecta detection transitions from advisory to enforceable constraint
- MCP-concentrated trifecta (all legs from same server) warrants elevated severity
- Blocking mode must be opt-in to avoid breaking legitimate workflows
- Sensitive path patterns need expansion as new sensitive files emerge
**Plugin controls:**
- `post-session-guard.mjs`: `LLM_SECURITY_TRIFECTA_MODE=block|warn|off`
- Block mode: exit 2 for MCP-concentrated trifecta or sensitive path + exfil
- Default warn mode preserves backward compatibility
- **Gap:** Rule of Two is approximate — false positives possible for legitimate multi-tool workflows
---
## 4. Google DeepMind — "AI Agent Traps: A Taxonomy" (April 2026)
**Key findings:**
- 6-category taxonomy of traps targeting AI agents (see `deepmind-agent-traps.md` for full mapping)
- Category 1: Content injection (steganography, syntactic masking)
- Category 2: Semantic manipulation (oversight evasion, critic suppression)
- Category 3: Context manipulation (memory poisoning, preference injection)
- Category 4: Multi-agent exploitation (delegation abuse, trust chain attacks)
- Category 5: Capability manipulation (tool misuse, privilege escalation)
- Category 6: Human-in-the-loop exploitation (approval fatigue, summary suppression)
**Implications for hook defenses:**
- Unicode Tag steganography (U+E0000-E007F) is a real vector for invisible injection
- HITL traps exploit the human review step that security depends on
- Sub-agent spawning creates trust delegation chains that amplify other attacks
- Memory/context poisoning is persistent — survives session boundaries
**Plugin controls:**
- `injection-patterns.mjs`: Unicode Tag detection (CRITICAL/HIGH), HITL trap patterns (HIGH), sub-agent spawn patterns (MEDIUM)
- `string-utils.mjs`: `decodeUnicodeTags()`, `stripBidiOverrides()`
- `post-session-guard.mjs`: Sub-agent delegation tracking, escalation-after-input advisory
- See `deepmind-agent-traps.md` for complete coverage mapping
---
## 5. Google DeepMind — "Lessons from Defending Gemini" (May 2025)
**Key findings:**
- Production-scale defense requires multiple independent layers
- Instruction hierarchy helps but does not eliminate injection
- Monitoring and alerting on anomalous agent behavior is essential for detection
- More capable models show improved instruction-following but also improved attack surface
- Real-world attacks often combine multiple techniques (hybrid attacks)
**Implications for hook defenses:**
- Defense layers should be independently effective (not cascading dependencies)
- Hook architecture (PreToolUse + PostToolUse + session guard) provides independent layers
- Each hook should fail-safe (allow on error, not block)
- Monitoring hooks should emit structured data for downstream analysis
**Plugin controls:**
- Independent hook layers: input (`pre-prompt-inject-scan`), output (`post-mcp-verify`), session (`post-session-guard`), file (`pre-edit-secrets`, `pre-write-pathguard`), command (`pre-bash-destructive`, `pre-install-supply-chain`)
- Each hook exits 0 on parse errors (fail-open for availability)
- Structured JSON output for all advisories
---
## 6. Preamble — "Prompt Injection 2.0" (arXiv 2507.13169, January 2026)
**Key findings:**
- Hybrid attacks combine prompt injection with other vulnerability classes:
- P2SQL: Injection text contains SQL keywords targeting downstream database operations
- Recursive injection: Injected text instructs the model to inject into its own output
- XSS in agent context: Script/event handlers in content processed by agents
- Bash parameter expansion evasion: `c${u}rl`, `w''get`, `r""m` bypass naive pattern matching
- Natural language indirection: instructions phrased as natural language requests rather than commands
- Attacks succeed because each component alone appears benign; the combination is malicious
**Implications for hook defenses:**
- Bash hooks need expansion normalization before pattern matching
- Output scanning must check for cross-domain patterns (SQL + injection, XSS + injection)
- NL indirection has inherent FP risk — deterministic hooks can only catch keyword patterns
- Recursive injection is particularly dangerous for multi-agent systems
**Plugin controls:**
- `bash-normalize.mjs`: Strips `''`, `""`, `${x}`, `\` before pattern matching
- `injection-patterns.mjs`: HYBRID_PATTERNS for P2SQL, recursive, XSS
- `injection-patterns.mjs`: NL indirection MEDIUM patterns (high FP caution)
- `post-mcp-verify.mjs`: Hybrid pattern check on tool output
- **Gap:** Novel NL indirection phrasing evades keyword patterns
---
## 7. Google DeepMind — CaMeL Defense Proposal (2025)
**Key findings:**
- Proposed data flow tagging: track provenance of data through agent tool chains
- Each data item receives a tag (hash) when produced by a tool
- Tags propagate when data flows from one tool's output to another's input
- Trifecta with linked data flows (provenance-tracked) has higher confidence than coincidental trifecta
- Full CaMeL requires platform-level control plane — not implementable in hook layer
**Implications for hook defenses:**
- Lightweight data-tagging (~30% of benefit, ~5% of complexity) is feasible in hooks
- Hash first 200 chars of tool output as data tag; check substring match in next tool input
- Linked flows elevate trifecta severity (higher confidence of intentional exfiltration chain)
- Full provenance tracking requires platform support beyond what hooks can provide
**Plugin controls:**
- `post-session-guard.mjs`: SHA-256 data tag on tool output, substring match on next input
- Linked-flow trifecta reported with elevated severity
- State file extended with `dataTag` field per entry
- **Gap:** Substring matching is approximate; transformed data loses tag linkage
---
## Summary: What Deterministic Hooks Can and Cannot Defend
### Can defend (raise attack cost):
- Known injection patterns (regex matching on critical/high/medium patterns)
- Known evasion techniques (Unicode normalization, bash expansion, base64 decoding)
- Known bad packages (blocklist-based supply chain protection)
- Structural anomalies (trifecta patterns, behavioral drift, data volume spikes)
- Known sensitive paths and secret patterns
### Cannot defend (fundamental limitations):
- Novel natural language indirection without keyword patterns
- Adaptive attacks from motivated human red-teamers (100% ASR per joint paper)
- Long-horizon attacks spanning hundreds of steps without detectable pattern
- Semantic-level prompt injection (meaning-preserving rewording)
- CLAUDE.md loading before hooks execute (Anthropic platform limitation)
- Full data provenance tracking (requires platform-level control plane)
### Design philosophy (v5.0):
1. **Defense-in-depth:** Multiple independent layers, each raising attack cost
2. **Honest limitations:** Document what cannot be defended, don't claim prevention
3. **Advisory over blocking:** MEDIUM patterns advise, never block (FP risk)
4. **Opt-in enforcement:** Rule of Two blocking requires explicit opt-in
5. **Adaptive testing:** Red-team with mutations, not just fixed payloads
---
*Last updated: v5.0 S7 — Knowledge files + attack scenario expansion*
*Sources verified against published papers as of 2026-04*