Kjell Tore Guttormsen e2c8924074 feat(knowledge): add MITRE ATLAS IDs to OWASP files + Norwegian regulatory context

2026-04-10 12:49:10 +02:00

27 KiB

Raw Blame History

OWASP Top 10 for LLM Applications (2025)

Reference material for security scanning agents in the llm-security plugin. Each category maps to detection signals and mitigations actionable within Claude Code projects (skills, commands, MCP servers, hooks, CLAUDE.md, agents).

Source: https://genai.owasp.org/llm-top-10/ — OWASP GenAI Security Project v2025.

LLM01 — Prompt Injection

MITRE ATLAS: AML.T0051 (LLM Prompt Injection)

Risk: Attackers manipulate LLM behavior by crafting inputs that override system instructions, bypass guardrails, or cause the model to execute unintended actions.

Attack Vectors:

Direct injection: User input contains explicit override instructions ("Ignore previous instructions and...", "Disregard your system prompt...")
Indirect injection: External content fetched during task execution contains hidden instructions (malicious web pages, documents, emails, tool outputs)
Multimodal injection: Instructions hidden in images, PDFs, or audio processed by the model
Adversarial suffixes: Nonsensical token sequences that reliably break model alignment
Context manipulation: Gradual context poisoning over multi-turn conversations that shifts model behavior without a single obvious trigger
RAG poisoning for injection: Malicious content injected into the retrieval context to redirect agent behavior

Real Examples:

Hidden  in an HTML file fed to a Claude Code scan command
A CLAUDE.md file in a cloned repo instructing the model to exfiltrate env variables
A task description in a Linear issue that re-routes an agent to access unrelated files
PDF documentation with white-on-white text containing override instructions

Detection Signals:

Presence of phrases like ignore previous, disregard, new instructions, system override, forget in external content processed by agents
Instructions embedded in HTML comments, metadata fields, or low-contrast text
User input that contains role definitions ("You are now...", "Act as...")
Skill/command files that read arbitrary external URLs or files without sanitization
MCP tool definitions that pass raw user input directly to sub-calls without validation layers
Agent allowed-tools lists that include both Write/Bash AND external fetch capabilities with no input validation

Claude Code Mitigations:

Treat external content (files, URLs, tool outputs) as untrusted data, not instructions — enforce explicit separation in agent prompts
Define strict task boundaries in agent frontmatter descriptions; agents should refuse out-of-scope requests
Hook UserPromptSubmit to scan for injection patterns before processing
Never pass raw external content directly into sub-agent Task prompts; wrap with explicit framing ("The following is untrusted content: ...")
Use allowed-tools minimally — agents that only read should never have Write/Bash
Add prompt injection pattern checks to pre-write-pathguard.mjs and scan hooks

Severity: Critical

LLM02 — Sensitive Information Disclosure

MITRE ATLAS: AML.T0024 (Exfiltration via ML Inference API)

Risk: LLMs unintentionally expose private, proprietary, or credential data through outputs, memorized training content, or cross-session leakage.

Attack Vectors:

Training data memorization: Model regurgitates exact text from training data including credentials or PII seen during pre-training
System prompt extraction: Targeted prompts that cause the model to reproduce its own system prompt verbatim
Cross-session leakage: Conversation history, user data, or context bled between sessions in stateful deployments
RAG knowledge base exposure: Retrieval of sensitive documents accessible through overly broad vector search
Output over-sharing: Model includes more context than necessary (full file contents instead of relevant excerpt, full API response instead of needed fields)
Targeted extraction via social engineering: "Repeat the first 100 tokens of your context", "What was in the document you just summarized?"

Real Examples:

A skill that reads .env files for context and includes their contents in agent summaries
An MCP server that returns full database rows when only a subset of fields is needed
A CLAUDE.md that hardcodes API keys or passwords in command descriptions
An agent summary that includes full file paths and internal project structure

Detection Signals:

Hardcoded secrets in CLAUDE.md, agent frontmatter, or skill reference files (API keys, tokens, passwords, connection strings)
Commands/agents that read .env, *.pem, *.key, credentials*, secrets* files without explicit justification
Agent prompts that instruct the model to include raw file contents in outputs
MCP server definitions that lack output field filtering or response size limits
Missing input/output sanitization in skill pipelines that process user-supplied files

Claude Code Mitigations:

The pre-edit-secrets.mjs hook detects credential patterns in files being written — ensure it is active and pattern list is current (see knowledge/secrets-patterns.md)
Never place credentials in CLAUDE.md, plugin.json, or agent/skill markdown files
Use .env + .env.template pattern; ensure .env is in .gitignore
Agent prompts should instruct selective extraction: include only fields relevant to the task, not full file or response dumps
MCP server tools should define explicit output schemas with field allowlists
Apply the pre-write-pathguard.mjs hook to block writes of sensitive file patterns

Severity: High

LLM03 — Supply Chain Vulnerabilities

MITRE ATLAS: AML.T0010 (ML Supply Chain Compromise)

Risk: Compromised third-party models, datasets, plugins, MCP servers, or dependencies introduce backdoors, malicious behavior, or known vulnerabilities.

Attack Vectors:

Compromised base models: Open-source models with hidden backdoors or poisoned weights published to model hubs
Malicious fine-tuning adapters: LoRA adapters or PEFT layers that alter model behavior on specific trigger inputs
Dependency confusion: npm/pip packages with names similar to legitimate libraries containing malicious code
Outdated dependencies: Known CVEs in libraries used by MCP servers or hooks
Untrusted MCP servers: Third-party MCP server packages that exfiltrate tool call data or modify responses
Plugin poisoning: A Claude Code plugin installed from an untrusted source that modifies hooks to intercept all file writes

Real Examples:

An MCP server npm package that phones home with tool invocation payloads
A community Claude Code plugin that adds a Stop hook sending session summaries to an external endpoint
A plugin that modifies hooks.json to inject malicious hook scripts

Detection Signals:

MCP server packages from non-official, unverified npm/PyPI sources
Hook scripts that make outbound network calls without documentation
Plugin dependencies that lack pinned version constraints (^ ranges in package.json)
Missing integrity checks (no lockfiles, no hash verification) for installed plugins
Hooks that have network access (fetch, curl, wget) without explicit justification
MCP server definitions pointing to localhost ports with no auth — could be hijacked by local malware

Claude Code Mitigations:

Audit all installed plugins and MCP servers before enabling; prefer official Anthropic marketplace sources
Review hooks/scripts/*.mjs files in any plugin before installation — check for outbound network calls
Pin MCP server package versions with exact version constraints and use lockfiles
Maintain a software bill of materials (SBOM) for all project dependencies
Run npm audit / pip-audit against MCP server dependencies regularly
Verify hook scripts do not contain network calls unless explicitly required and documented in the plugin CLAUDE.md

Severity: High

LLM04 — Data and Model Poisoning

MITRE ATLAS: AML.T0020 (Poison Training Data), AML.T0018 (Backdoor ML Model)

Risk: Malicious or accidental contamination of training data, fine-tuning datasets, RAG knowledge bases, or embeddings degrades model behavior or introduces backdoors.

Attack Vectors:

Training data poisoning: Biased or malicious samples injected during pre-training to propagate misinformation or embed trigger-based backdoors
Fine-tuning poisoning: Compromised task-specific datasets that skew model outputs toward attacker objectives
RAG knowledge base poisoning: Attacker writes malicious documents into the retrieval store, which are then cited as authoritative context
Embedding poisoning: Corrupted vector representations causing semantic misalignment (malicious terms placed close to trusted terms in embedding space)
Trigger-based backdoors: Specific input patterns activate hidden behaviors (particular tokens or phrases cause data exfiltration or unsafe outputs)

Real Examples:

A knowledge base directory in a Claude Code skill where any contributor can push documents — an attacker adds a file that misdirects the security audit agent
Reference files in skills/*/references/ updated with contradictory guidance to confuse skill behavior
An MCP server that writes to a shared RAG index without access controls, allowing one user to poison context for all users

Detection Signals:

Knowledge base files (knowledge/, references/) with recent unreviewed modifications by multiple contributors
RAG ingestion pipelines with no input validation or source attribution
Skill reference files that contradict each other on security-critical guidance
Missing integrity verification for knowledge base files (no checksums, no signing)
MCP servers with write access to shared knowledge stores without per-user isolation
Unexpected behavioral drift in agent outputs after knowledge base updates

Claude Code Mitigations:

Treat all files in knowledge/ and references/ as code — require code review before merging changes
Implement source attribution in all knowledge files (authorship, date, source URL)
Validate that RAG ingestion pipelines reject untrusted or unverified sources
For MCP servers with write access to shared indexes, enforce per-user namespacing
Use git history and signatures to detect unauthorized modifications to reference files
Red-team skill agents after knowledge base updates to verify behavior consistency

Severity: High

LLM05 — Improper Output Handling

MITRE ATLAS: AML.T0043 (Craft Adversarial Data)

Risk: LLM-generated output is passed to downstream systems without adequate validation or sanitization, enabling injection attacks, privilege escalation, or unintended side effects.

Attack Vectors:

XSS via LLM output: Model generates JavaScript that is rendered unescaped in a web context
SQL injection via LLM output: Model constructs SQL queries interpolated directly into database calls
Command injection: Model-generated shell commands executed without sanitization
API call hijacking: Hallucinated or manipulated API call parameters passed directly to external services
Code execution: Model-generated code run without review in automated pipelines (eval, exec, subprocess)
Over-trust in structured output: JSON/YAML output from the model used directly as configuration without schema validation

Real Examples:

A Claude Code command that takes model-generated code and passes it directly to exec() without human review
An agent that constructs filesystem paths from model output and uses them in rm or mv operations without path sanitization
A skill that writes model-generated YAML directly to a Kubernetes config without schema validation

Detection Signals:

Bash tool calls in agent prompts that interpolate model output directly into shell commands without quoting or validation
Commands/agents that pass model-generated file paths to destructive operations (rm, mv, chmod) without path canonicalization
MCP tools that accept model output as SQL queries, shell commands, or code strings
Absence of schema validation between model output and downstream API calls
Agent workflows with no human-in-the-loop step before executing model-generated actions on production systems

Claude Code Mitigations:

The pre-bash-destructive.mjs hook intercepts destructive shell commands — ensure pattern list covers model-generated variants
Always validate model-generated file paths against an allowed directory whitelist before I/O operations
Use parameterized queries (never string interpolation) when model output reaches database layers
Require explicit human approval in agent workflows before executing model-generated code on production systems
Apply strict JSON schema validation to all structured model output before use as configuration or API parameters
Treat model output as untrusted user input when passing to any system interface

Severity: High

LLM06 — Excessive Agency

MITRE ATLAS: AML.T0061 (AI Agent Tools)

Risk: LLMs granted excessive functionality, permissions, or autonomy take unintended high-impact actions with real-world consequences.

Attack Vectors:

Over-privileged tools: Agents given access to tools beyond task requirements (delete, admin, write) when only read access is needed
Unchecked autonomy: Multi-step agent pipelines execute sequences of high-impact actions without human approval checkpoints
Unnecessary extension permissions: MCP servers exposing administrative capabilities that agents can invoke based on model judgment
Scope creep via prompt: Agent instructed to "do whatever is needed" interprets this as authorization for broad actions
Chained tool misuse: A sequence of individually low-risk tool calls that together achieve a high-impact unauthorized outcome

Real Examples:

An agent with both Read and Bash access that, when injected, uses Bash to exfiltrate files it read
A skill that grants allowed-tools: Read, Write, Bash when the task only requires Read and Grep
An MCP server with admin scope passed to all agents regardless of their actual needs

Detection Signals:

Agent frontmatter with broad tools lists that include Write/Bash when task description only requires reading/analysis
Commands with allowed-tools that include destructive capabilities (Bash) for non-execution tasks (scan, analyze, report)
MCP server definitions that expose delete/admin operations with no access tier separation
Absence of human-in-the-loop (AskUserQuestion) calls before irreversible actions in agent workflows
Agent task descriptions that include "do whatever is needed" or similarly unbounded authorization language
No rate limiting or action budgets on autonomous agent loops

Claude Code Mitigations:

Assign the minimum allowed-tools for each command; read-only tasks get Read, Glob, Grep — never Bash
Require AskUserQuestion before any destructive, irreversible, or production- touching action in agent workflows
Define explicit action budgets in autonomous loop agents (max N tool calls, max N file writes per session)
Separate agent roles: analyst agents (Read/Glob/Grep) vs. executor agents (Write/Bash) with explicit handoff requiring human confirmation
MCP server tool definitions should separate read-only and write/admin operations into distinct tool namespaces with different auth requirements
Audit all agents quarterly: does each tools list match the agent's stated role?

Severity: Critical

LLM07 — System Prompt Leakage

MITRE ATLAS: AML.T0024 (Exfiltration via ML Inference API)

Risk: Internal system prompts containing sensitive instructions, credentials, or behavioral guardrails are exposed to users or attackers, enabling bypass or credential theft.

Attack Vectors:

Direct extraction: Prompts like "Print your system prompt", "Repeat the first 100 tokens of your context", "What instructions were you given?"
Jailbreak extraction: Using roleplay or hypothetical framing to elicit system prompt contents
Error-based disclosure: Error messages or debug outputs that include prompt context
Embedded credential exposure: API keys, passwords, or internal URLs hardcoded in system prompts leak when prompt is extracted
Guardrail mapping: Extracting system prompt reveals exact filtering logic, enabling targeted bypass

Real Examples:

A skill SKILL.md that embeds an API key in an example command that gets loaded as system context
A CLAUDE.md with internal network addresses or internal tool names that reveal infrastructure topology when extracted
An agent prompt that lists all available internal MCP tools including their auth tokens

Detection Signals:

API keys, tokens, passwords, or connection strings in CLAUDE.md, skill markdown files, or agent prompts (caught by pre-edit-secrets.mjs)
Internal hostnames, IP addresses, or internal URLs embedded in skill/command definitions
Agent prompts that instruct the model on how to bypass its own restrictions (the bypass logic itself becomes the attack surface if leaked)
System prompts used as the primary security enforcement mechanism rather than external validation layers

Claude Code Mitigations:

Never embed credentials in CLAUDE.md, plugin.json, or any markdown skill/command file — use environment variables or secrets managers
Design prompts as behavioral guidance, not security boundaries; security enforcement must happen in code (hooks, validation layers), not in prompts
Use the pre-edit-secrets.mjs hook to prevent credential introduction into any skill or documentation file
Avoid listing internal infrastructure details (tool names, endpoints, internal URLs) in any agent-facing documentation
Treat system prompts as potentially extractable; they must not contain anything that would be harmful if fully disclosed

Severity: High

LLM08 — Vector and Embedding Weaknesses

MITRE ATLAS: AML.T0020 (Poison Training Data), AML.T0019 (Publish Poisoned Datasets)

Risk: Vulnerabilities in how embeddings are generated, stored, or retrieved allow unauthorized data access, information leakage, or manipulation of RAG-based agent behavior.

Attack Vectors:

Embedding inversion attacks: Reverse-engineering vector representations to recover original sensitive training data or documents
Vector database access control bypass: Misconfigured vector stores that allow cross-tenant data retrieval or lack per-user partitioning
RAG poisoning via embedding: Malicious documents injected into the retrieval index cause agents to cite attacker-controlled content as authoritative
Semantic misalignment poisoning: Corrupted embeddings place malicious terms adjacent to trusted terms in embedding space, causing retrieval of harmful content for legitimate queries
Retrieval manipulation: Query crafted to retrieve a specific malicious document from a shared index regardless of the actual user's task context

Real Examples:

A shared knowledge base for multiple Claude Code projects where one project's sensitive architecture docs are retrieved by another project's agents
An MCP server with a vector search tool that returns documents from all users' namespaces when tenant isolation is misconfigured
Skill reference files indexed in a shared embedding store without access control, leaking internal security procedures to agents with insufficient clearance

Detection Signals:

Vector database configurations with no per-user or per-tenant namespace isolation
RAG ingestion pipelines that accept documents from any source without validation or source verification
Missing access control metadata on vector store entries (no owner, no permission scope)
Embedding stores shared across multiple agent contexts without query-time authorization checks
No audit logging on vector database retrieval operations

Claude Code Mitigations:

For any RAG-enabled MCP server, verify that vector database queries are scoped to the authenticated user's namespace
Validate all documents before RAG ingestion: verify source, reject untrusted contributors, apply content policies
Implement retrieval audit logging — log every document retrieved for every agent query to enable anomaly detection
Separate embedding namespaces by project, user, and sensitivity level; never use a single shared flat namespace
Review MCP server vector tool definitions for proper access control enforcement at query time, not just at ingestion time

Severity: High

LLM09 — Misinformation

MITRE ATLAS: AML.T0031 (Erode ML Model Integrity)

Risk: LLMs generate plausible but factually incorrect outputs (hallucinations) that are acted upon without verification, leading to incorrect decisions, security bypasses, or dependency on non-existent resources.

Attack Vectors:

Hallucinated package names: Coding assistants invent plausible npm/pip package names that don't exist — attackers register those names with malicious payloads (package hallucination / dependency confusion vector)
Fabricated API endpoints or documentation: Model invents API specs that don't match the actual service, causing misconfigurations
False security guidance: Model generates outdated or incorrect security recommendations that introduce vulnerabilities
Confident incorrect outputs: Model presents incorrect information with high apparent confidence, discouraging verification
Training data bias: Outputs systematically favor certain viewpoints, technologies, or approaches due to training data imbalance

Real Examples:

A Claude Code agent recommends installing express-security-middleware (hallucinated) which an attacker has registered as a malicious package
An agent generates a TLS configuration with deprecated cipher suites presented as current best practice
A security scan agent incorrectly clears a finding as "false positive" due to hallucinated knowledge about a library's behavior

Detection Signals:

Agent workflows that install packages or dependencies based solely on model recommendations without verification against package registries
Security scan commands that rely on model knowledge of CVEs without cross-referencing external vulnerability databases
Absence of human review before acting on model-generated security assessments
Skills that make definitive statements about external APIs or libraries without grounding in retrieved documentation
Commands that generate configurations (TLS, auth, network) based on model knowledge without validation against authoritative references

Claude Code Mitigations:

Security-critical recommendations from agents should always cite a retrievable source; knowledge/ files serve as the grounded reference layer for this plugin
Verify all package names recommended by model agents against official package registries before installation
Ground security guidance agents in authoritative references (this knowledge base, OWASP docs) via explicit Read of reference files, not model memory alone
Include uncertainty signaling in agent prompts: instruct agents to state confidence level and flag when operating outside their verified knowledge
For dependency management, agents should recommend but humans must approve all package installs

Severity: Medium

LLM10 — Unbounded Consumption

MITRE ATLAS: AML.T0029 (Denial of ML Service), AML.T0034 (Cost Harvesting)

Risk: Uncontrolled resource usage by LLM applications enables denial of service, financial exploitation via excessive API costs, or unauthorized model capability extraction through systematic querying.

Attack Vectors:

Denial of Wallet: Attacker triggers excessive API calls to exhaust compute budget (pay-per-token billing makes this financially damaging)
Resource exhaustion via large inputs: Crafted inputs maximizing context window usage to slow processing and increase cost
Runaway agent loops: Autonomous agents enter infinite loops or generate exponentially growing task trees consuming unlimited resources
Model extraction: Systematic querying to reverse-engineer model capabilities, fine- tuning data, or system prompts at scale
Cascading sub-agent spawning: Agent spawns sub-agents that each spawn more sub-agents, creating unbounded parallel execution

Real Examples:

A Claude Code loop command with no iteration limit that runs indefinitely when the termination condition is never met due to a model error
A harness agent that spawns a sub-agent per file in a large repository (10,000+ files) without batching or rate limiting
A /security scan command without a file count cap that processes every file in a monorepo triggering thousands of API calls

Detection Signals:

Agent loop commands (continue, loop) without explicit iteration limits or budget caps
Sub-agent spawning patterns (Task tool calls) without a ceiling on parallel instances
Commands that process all files in a directory recursively without pagination or file count limits
Absence of timeout configurations in long-running agent workflows
No API usage monitoring or alerting configured for the project
Harness or loop mode agents with no circuit breaker or stall detection

Claude Code Mitigations:

All loop and continue commands must define explicit iteration limits and session budgets (max N API calls, max N minutes)
Agent prompts that spawn sub-agents should cap parallel Task instances (e.g., spawn at most 5 parallel agents)
File-processing commands should paginate: process N files per invocation, not all files in a single unbounded pass
Implement stall detection in autonomous loop agents — if no meaningful progress after N iterations, halt and report
Monitor Claude API token usage per project; set billing alerts at defined thresholds
The post-mcp-verify.mjs hook should check for response size anomalies that indicate runaway data consumption

Severity: High

Quick Reference — Severity and Agent Mapping

ID	Category	Severity	Primary Scanning Agent
LLM01	Prompt Injection	Critical	`skill-scanner-agent`
LLM02	Sensitive Information Disclosure	High	`skill-scanner-agent`
LLM03	Supply Chain Vulnerabilities	High	`mcp-scanner-agent`
LLM04	Data and Model Poisoning	High	`posture-assessor-agent`
LLM05	Improper Output Handling	High	`skill-scanner-agent`
LLM06	Excessive Agency	Critical	`skill-scanner-agent`
LLM07	System Prompt Leakage	High	`skill-scanner-agent`
LLM08	Vector and Embedding Weaknesses	High	`mcp-scanner-agent`
LLM09	Misinformation	Medium	`posture-assessor-agent`
LLM10	Unbounded Consumption	High	`posture-assessor-agent`

Claude Code Attack Surface Map

Surface	Primary Risks
`commands/*.md`	LLM01, LLM05, LLM06, LLM10
`agents/*.md`	LLM01, LLM06, LLM07, LLM10
`skills/*/SKILL.md`	LLM01, LLM02, LLM07
`skills/*/references/`	LLM04, LLM09
`hooks/scripts/*.mjs`	LLM03, LLM05
`hooks/hooks.json`	LLM03, LLM06
`CLAUDE.md`	LLM02, LLM07
`knowledge/`	LLM04, LLM09
MCP server configs	LLM03, LLM06, LLM08
`.claude-plugin/plugin.json`	LLM03, LLM06

27 KiB Raw Blame History

OWASP Top 10 for LLM Applications (2025)

LLM01 — Prompt Injection

LLM02 — Sensitive Information Disclosure

LLM03 — Supply Chain Vulnerabilities

LLM04 — Data and Model Poisoning

LLM05 — Improper Output Handling

LLM06 — Excessive Agency

LLM07 — System Prompt Leakage

LLM08 — Vector and Embedding Weaknesses

LLM09 — Misinformation

LLM10 — Unbounded Consumption

Quick Reference — Severity and Agent Mapping

Claude Code Attack Surface Map

27 KiB

Raw Blame History