Kjell Tore Guttormsen f460814fe9 chore: WIP marketplace doc adjustments across plugins

Pre-trekexecute snapshot of in-progress CLAUDE.md/SKILL.md edits and
extracted docs/ files. Captured as one commit so /trekexecute claude-design
can run against a clean working tree.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-18 12:04:02 +02:00

2.9 KiB

Raw Blame History

LLM Security — Defense philosophy (v5.0)

Imported from CLAUDE.md via @docs/defense-philosophy.md.

Prompt injection is structurally unsolvable with current architectures (joint paper, 14 researchers, 95-100% ASR against all 12 tested defenses). v5.0 does not claim to "prevent" injection. Instead, it implements defense-in-depth:

Broader detection — MEDIUM advisory for obfuscation signals (leetspeak, homoglyphs, zero-width, multi-language), Unicode Tag steganography, bash expansion evasion
Increased attack cost — Rule of Two detection (configurable block/warn/off for lethal trifecta; default warn, blocks on high-confidence trifectas in opt-in block mode; distributed trifectas across MCP servers are detected but not blocked by default), bash normalization before gate matching
Longer monitoring windows — 100-call long-horizon alongside 20-call sliding window, slow-burn trifecta detection, behavioral drift via Jensen-Shannon divergence
Architectural constraints — opportunistic byte-matching of truncated output fingerprints (first 200 bytes, SHA-256/16-hex tag; not semantic lineage; trivially bypassed by mutation or summarisation of tool output), sub-agent delegation tracking, HITL trap detection. Inspired by CaMeL (DeepMind, 2025), but this is a lightweight byte-fingerprint, not semantic capability tracking
Honest documentation — Known Limitations section acknowledges what deterministic hooks cannot detect

Bash evasion layers (T1-T6): bash-normalize.mjs collapses six known obfuscation techniques before gate matching as a defense-in-depth layer. T1 empty quotes (rm''-rf), T2 ${} parameter expansion, T3 backslash continuation, T4 tab/whitespace splitting, T5 ${IFS} word-splitting, T6 ANSI-C hex quoting ($'\x72\x6d'). These layers complement — not replace — Claude Code 2.1.98+ harness-level protections. Full reference: docs/security-hardening-guide.md.

Opus 4.7 system card alignment:

System card §5.2.1 (agentic safety evaluations) documents that multi-layer defenses outperform single-layer defenses against adaptive attacks. This plugin's posture (prompt-scan + pathguard + trifecta-guard + pre-compact-scan operating in depth) matches that guidance.
System card §6.3.1.1 (instruction following and hierarchy) documents that Opus 4.7 interprets agent instructions more literally. Stacked imperatives (e.g., "MUST NOT do X") are therefore less useful than tool-level enforcement via tools: frontmatter. Agent files in this plugin have been updated accordingly.
See docs/security-hardening-guide.md §5 for the full mapping.

What v5.0 cannot do:

Prevent adaptive attacks from motivated human red-teamers (100% ASR per joint paper)
Fix CLAUDE.md loading before hooks (platform limitation)
Detect novel NL indirection without ML
Prevent long-horizon attacks without detectable patterns
Provide formal worst-case guarantees

2.9 KiB Raw Blame History

LLM Security — Defense philosophy (v5.0)

2.9 KiB

Raw Blame History