ktg-plugin-marketplace/plugins/llm-security/knowledge/prompt-injection-research-2025-2026.md

10 KiB

Prompt Injection Research 2025-2026

Research summary for the llm-security plugin. Documents what the field has learned about prompt injection, what can and cannot be defended deterministically, and how each finding maps to plugin controls.

Purpose: Reference material for posture-assessor-agent, threat-modeler-agent, and the "Known Limitations" section of documentation. Not loaded by default — only referenced when deep context is needed.


1. OpenAI — "Continuously Hardening ChatGPT Atlas" (December 2025)

Key findings:

  • RL-trained attacker agent discovered multi-step injection chains spanning hundreds of tool calls
  • Long-horizon attacks evade sliding-window detectors that only examine recent calls
  • More capable models are NOT inherently more robust to injection
  • Indirect injection via tool outputs (files, web pages, API responses) remains the primary attack vector

Implications for hook defenses:

  • Sliding-window trifecta detection (20 calls) is insufficient for long-horizon attacks
  • Extended 100-call window (v5.0 S3) addresses the gap but cannot catch attacks spread over 200+ calls
  • Behavioral drift detection (Jensen-Shannon divergence) provides a complementary signal
  • No deterministic defense can fully prevent multi-hundred-step attack chains

Plugin controls:

  • post-session-guard.mjs: 100-call long-horizon window, slow-burn trifecta detection
  • post-session-guard.mjs: Behavioral drift via Jensen-Shannon divergence on tool distributions
  • Gap: Attacks exceeding 100 calls without detectable pattern remain undefended

2. Joint Paper — "The Attacker Moves Second" (arXiv 2510.09023, October 2025)

Authors: 14 researchers from Google DeepMind, ETH Zurich, MIRI, and others

Key findings:

  • Tested 12 proposed defenses against adaptive attackers
  • All 12 defenses broken with 95-100% attack success rate (ASR)
  • Defenses tested include: instruction hierarchy, delimiters, input/output filtering, sandwich defense, XML tagging, spotlighting, signed prompts, LLM-as-judge, known-answer detection, prompt shield, task-oriented, and repeat-back
  • Fundamental result: any defense that operates within the same token space as the attacker can be bypassed by a sufficiently adaptive attacker

Implications for hook defenses:

  • Pattern-matching hooks (regex-based) are a necessary but insufficient layer
  • No single defense mechanism achieves reliable protection against adaptive attackers
  • Defense-in-depth is the only viable strategy: raise attack cost, not prevent attacks
  • Fixed payloads in red-team testing give false confidence; adaptive testing essential

Plugin controls:

  • attack-simulator.mjs --adaptive: 5 mutation rounds test evasion resistance
  • All hooks: defense-in-depth layers (input scan + output scan + session monitoring + supply chain)
  • Gap: Novel synonym substitutions and semantic-level evasions bypass regex patterns

3. Meta — "Agents Rule of Two" (October 2025)

Key findings:

  • Formalized the "lethal trifecta" as a constraint: untrusted input (A) + sensitive data (B) + state change/exfiltration (C)
  • Rule of Two: an agent should never simultaneously hold all three capabilities
  • Proposed architectural constraint rather than detection-based defense
  • Block mode enforces constraint at runtime; warn mode provides monitoring

Implications for hook defenses:

  • Trifecta detection transitions from advisory to enforceable constraint
  • MCP-concentrated trifecta (all legs from same server) warrants elevated severity
  • Blocking mode must be opt-in to avoid breaking legitimate workflows
  • Sensitive path patterns need expansion as new sensitive files emerge

Plugin controls:

  • post-session-guard.mjs: LLM_SECURITY_TRIFECTA_MODE=block|warn|off
  • Block mode: exit 2 for MCP-concentrated trifecta or sensitive path + exfil
  • Default warn mode preserves backward compatibility
  • Gap: Rule of Two is approximate — false positives possible for legitimate multi-tool workflows

4. Google DeepMind — "AI Agent Traps: A Taxonomy" (April 2026)

Key findings:

  • 6-category taxonomy of traps targeting AI agents (see deepmind-agent-traps.md for full mapping)
  • Category 1: Content injection (steganography, syntactic masking)
  • Category 2: Semantic manipulation (oversight evasion, critic suppression)
  • Category 3: Context manipulation (memory poisoning, preference injection)
  • Category 4: Multi-agent exploitation (delegation abuse, trust chain attacks)
  • Category 5: Capability manipulation (tool misuse, privilege escalation)
  • Category 6: Human-in-the-loop exploitation (approval fatigue, summary suppression)

Implications for hook defenses:

  • Unicode Tag steganography (U+E0000-E007F) is a real vector for invisible injection
  • HITL traps exploit the human review step that security depends on
  • Sub-agent spawning creates trust delegation chains that amplify other attacks
  • Memory/context poisoning is persistent — survives session boundaries

Plugin controls:

  • injection-patterns.mjs: Unicode Tag detection (CRITICAL/HIGH), HITL trap patterns (HIGH), sub-agent spawn patterns (MEDIUM)
  • string-utils.mjs: decodeUnicodeTags(), stripBidiOverrides()
  • post-session-guard.mjs: Sub-agent delegation tracking, escalation-after-input advisory
  • See deepmind-agent-traps.md for complete coverage mapping

5. Google DeepMind — "Lessons from Defending Gemini" (May 2025)

Key findings:

  • Production-scale defense requires multiple independent layers
  • Instruction hierarchy helps but does not eliminate injection
  • Monitoring and alerting on anomalous agent behavior is essential for detection
  • More capable models show improved instruction-following but also improved attack surface
  • Real-world attacks often combine multiple techniques (hybrid attacks)

Implications for hook defenses:

  • Defense layers should be independently effective (not cascading dependencies)
  • Hook architecture (PreToolUse + PostToolUse + session guard) provides independent layers
  • Each hook should fail-safe (allow on error, not block)
  • Monitoring hooks should emit structured data for downstream analysis

Plugin controls:

  • Independent hook layers: input (pre-prompt-inject-scan), output (post-mcp-verify), session (post-session-guard), file (pre-edit-secrets, pre-write-pathguard), command (pre-bash-destructive, pre-install-supply-chain)
  • Each hook exits 0 on parse errors (fail-open for availability)
  • Structured JSON output for all advisories

6. Preamble — "Prompt Injection 2.0" (arXiv 2507.13169, January 2026)

Key findings:

  • Hybrid attacks combine prompt injection with other vulnerability classes:
    • P2SQL: Injection text contains SQL keywords targeting downstream database operations
    • Recursive injection: Injected text instructs the model to inject into its own output
    • XSS in agent context: Script/event handlers in content processed by agents
  • Bash parameter expansion evasion: c${u}rl, w''get, r""m bypass naive pattern matching
  • Natural language indirection: instructions phrased as natural language requests rather than commands
  • Attacks succeed because each component alone appears benign; the combination is malicious

Implications for hook defenses:

  • Bash hooks need expansion normalization before pattern matching
  • Output scanning must check for cross-domain patterns (SQL + injection, XSS + injection)
  • NL indirection has inherent FP risk — deterministic hooks can only catch keyword patterns
  • Recursive injection is particularly dangerous for multi-agent systems

Plugin controls:

  • bash-normalize.mjs: Strips '', "", ${x}, \ before pattern matching
  • injection-patterns.mjs: HYBRID_PATTERNS for P2SQL, recursive, XSS
  • injection-patterns.mjs: NL indirection MEDIUM patterns (high FP caution)
  • post-mcp-verify.mjs: Hybrid pattern check on tool output
  • Gap: Novel NL indirection phrasing evades keyword patterns

7. Google DeepMind — CaMeL Defense Proposal (2025)

Key findings:

  • Proposed data flow tagging: track provenance of data through agent tool chains
  • Each data item receives a tag (hash) when produced by a tool
  • Tags propagate when data flows from one tool's output to another's input
  • Trifecta with linked data flows (provenance-tracked) has higher confidence than coincidental trifecta
  • Full CaMeL requires platform-level control plane — not implementable in hook layer

Implications for hook defenses:

  • Lightweight data-tagging (~30% of benefit, ~5% of complexity) is feasible in hooks
  • Hash first 200 chars of tool output as data tag; check substring match in next tool input
  • Linked flows elevate trifecta severity (higher confidence of intentional exfiltration chain)
  • Full provenance tracking requires platform support beyond what hooks can provide

Plugin controls:

  • post-session-guard.mjs: SHA-256 data tag on tool output, substring match on next input
  • Linked-flow trifecta reported with elevated severity
  • State file extended with dataTag field per entry
  • Gap: Substring matching is approximate; transformed data loses tag linkage

Summary: What Deterministic Hooks Can and Cannot Defend

Can defend (raise attack cost):

  • Known injection patterns (regex matching on critical/high/medium patterns)
  • Known evasion techniques (Unicode normalization, bash expansion, base64 decoding)
  • Known bad packages (blocklist-based supply chain protection)
  • Structural anomalies (trifecta patterns, behavioral drift, data volume spikes)
  • Known sensitive paths and secret patterns

Cannot defend (fundamental limitations):

  • Novel natural language indirection without keyword patterns
  • Adaptive attacks from motivated human red-teamers (100% ASR per joint paper)
  • Long-horizon attacks spanning hundreds of steps without detectable pattern
  • Semantic-level prompt injection (meaning-preserving rewording)
  • CLAUDE.md loading before hooks execute (Anthropic platform limitation)
  • Full data provenance tracking (requires platform-level control plane)

Design philosophy (v5.0):

  1. Defense-in-depth: Multiple independent layers, each raising attack cost
  2. Honest limitations: Document what cannot be defended, don't claim prevention
  3. Advisory over blocking: MEDIUM patterns advise, never block (FP risk)
  4. Opt-in enforcement: Rule of Two blocking requires explicit opt-in
  5. Adaptive testing: Red-team with mutations, not just fixed payloads

Last updated: v5.0 S7 — Knowledge files + attack scenario expansion Sources verified against published papers as of 2026-04