test(ultraplan-local): add skill-factory calibration fixtures

3 source/draft pairs that pin n-gram-overlap verdicts to representative prose:
- accepted (containment 0.014 / longestRun 3)
- needs-review (containment 0.211 / longestRun 12)
- rejected (containment 0.676 / longestRun 74)

Topics: session-start hooks, subagent delegation, output styles — proximate to
the production source material the skill-factory will ingest.

Plan: .claude/projects/2026-04-18-skill-factory-fase-1-mvp/plan.md (step 5)
This commit is contained in:
Kjell Tore Guttormsen 2026-04-18 15:16:28 +02:00
commit 486f544d39
7 changed files with 280 additions and 0 deletions

View file

@ -0,0 +1,42 @@
# Skill-factory calibration fixtures
These fixtures calibrate the IP-hygiene thresholds used by `scripts/ngram-overlap.mjs`. Each
pair (`source-*`, `draft-*`) is hand-tuned so that the n-gram containment verdict lands in a
specific band, anchoring the empirical thresholds against representative prose.
## Pairs
| Pair | Target verdict | Containment | Longest run | Notes |
|------|----------------|-------------|-------------|-------|
| `source-accepted.md``draft-accepted.md` | **accepted** | 0.014 | 3 | Heavy paraphrase; concept-equivalent without phrase reuse |
| `source-needs-review.md``draft-needs-review.md` | **needs-review** | 0.211 | 12 | Mixed: paraphrased frame, retained domain phrasing |
| `source-rejected.md``draft-rejected.md` | **rejected** | 0.676 | 74 | Light edit on top of source; verbatim runs survive |
## Verdict bands
The verdict bands match the constants in `scripts/ngram-overlap.mjs`:
- **accepted** — containment < 0.15 AND longestRun < 8
- **needs-review** — between accepted and rejected
- **rejected** — containment ≥ 0.35 OR longestRun ≥ 15
If you change the thresholds in `ngram-overlap.mjs`, re-verify each fixture pair to confirm
the calibration still holds. The fixtures are content-stable; the thresholds are the variable.
## Why these topics
The fixtures use Claude Code reference prose (session-start hooks, subagent delegation, output
styles) so they live near the kind of source material the skill-factory will actually paraphrase
in production. Drift between fixture domain and production domain would weaken the calibration
signal.
## Regeneration
These files are committed to the repo as ground-truth fixtures. Do not regenerate them ad-hoc —
edit deliberately, re-run the verification commands listed in `plan.md` Step 5, and commit
intentionally.
```bash
node scripts/ngram-overlap.mjs tests/fixtures/skill-factory/draft-accepted.md \
tests/fixtures/skill-factory/source-accepted.md
```

View file

@ -0,0 +1,55 @@
---
name: session-startup-hook-pattern
description: Pattern for warming up agent context at conversation boot
layer: pattern
cc_feature: hooks
source: ./source-accepted.md
concept: warm-start briefing via boot hook
last_verified: 2026-04-18
ngram_overlap_score: null
review_status: pending
---
## Use this when
You want each new conversation to begin with the agent already aware of project status — pending tasks, recent commits, key file paths — instead of burning early turns on orientation.
## Shape
Bind one script to the boot lifecycle event. Inside, gather whatever situational data justifies opening a fresh chat with: git log summary, todo file digest, branch identity, env snapshot. Format as terse markdown headings. Emit on stdout. The runtime grafts whatever you print into the model's opening context window.
## Forces
Cold turns cost real seconds. Anything you compute inside this script delays the first user prompt visibly. Therefore: prefer cached lookups, avoid synchronous calls to remote services, and timebox expensive operations. If a piece of data takes longer than roughly 200 ms to fetch, demote it from the briefing or front-load a cache refresh asynchronously elsewhere.
Multiple installation layers can each register their own boot script. They run in this order: marketplace-supplied first, your personal home-level second, and project-rooted third. Outputs concatenate without merging. Two scripts both reporting the branch name will both show up. There is no built-in dedupe.
Crash semantics are partial-write friendly: anything you streamed to stdout before exiting non-zero still reaches the agent. The fix is to assemble the full briefing in a buffer and only flush on the success path. That way a midpoint failure leaves an empty intro rather than a torn one.
## Gotchas
- No native simulator. Test by piping a hand-crafted payload directly into the script and reading the stdout it returns. Keep one canonical payload checked in.
- No selective firing — boot scripts cannot be scoped to subdirectories or file patterns. Workaround: branch inside the script body and short-circuit early when the heuristic does not match.
- Privileges are unfiltered: these scripts run with full operator credentials before any agent sandbox spins up. Audit third-party scripts before enabling them, and pin their versions to defeat silent updates.
- Telemetry is your problem: the runtime does not emit structured execution logs. If you want metrics, append to a local jsonl yourself from inside the script.
## Anti-patterns
- Doing remote network calls on the hot path without caching.
- Skipping output buffering and streaming partial briefings.
- Allowing two layers to print conflicting information without coordinating.
- Trusting installed marketplace scripts implicitly.
## Decision quick-check
Pick this approach when:
- Pending state across sessions is non-trivial enough to merit the boot delay.
- You can keep the script under a 250 ms wall-clock budget.
- The information surfaced changes meaningfully between sessions.
Skip it when:
- Project state rarely shifts; static instructions in CLAUDE.md serve the same purpose at zero startup cost.
- The data you would print is sensitive and could leak into copies of the conversation transcript.
- You are still iterating on what should appear; build the briefing as a plain script the operator runs manually before nailing it down as a boot binding.

View file

@ -0,0 +1,57 @@
---
name: subagent-delegation-pattern
description: Pattern for spawning isolated worker contexts to handle bounded sub-tasks
layer: pattern
cc_feature: subagents
source: ./source-needs-review.md
concept: context-isolated worker delegation
last_verified: 2026-04-18
ngram_overlap_score: null
review_status: pending
---
## Use this when
A chunk of work is logically self-contained, would otherwise drown the main thread in noise, or could profitably run in parallel with other branches of investigation. Reach for delegation whenever the orchestrator's transcript would be polluted by intermediate steps that nobody upstream cares about.
## Shape
The orchestrator constructs a complete prompt at spawn time and hands it off to a worker. That worker operates in its own conversation, with its own context window, and answers with one structured response when finished. The parent treats that response as a tool result and proceeds. Whatever the worker reads, generates, or discards never appears in the orchestrator's transcript — only the final return value crosses the boundary. That bound is precious in long sessions where every token of orchestrator context counts.
For parallel decomposition, the parent kicks off several workers in a single turn and waits for them all to complete. The runtime executes them concurrently, so a research task that would otherwise be three sequential delegations finishes in roughly the time of the slowest single one. For workflows that legitimately decompose into independent investigations, the speedup is substantial.
## Forces
Each spawn carries a startup tax. The runtime loads a fresh system prompt and the relevant tool schemas — several thousand tokens before the worker has done anything useful. Spawning ten workers to do work that one careful Grep would have handled is wasteful. Heuristic: do not delegate anything you could finish in two or three direct tool calls in the main thread.
Tool scoping cuts both noise and overhead. A search worker has no need for write or edit. A summarizer has no business running shell commands. Tightening the toolset reduces the chance of off-task behavior and shrinks the system prompt overhead. The default open-tool-set is appropriate only for general-purpose subagents that genuinely need flexibility.
Briefing is the second discipline. The worker has no view into the parent's broader conversation. Anything implicit upstream — user intent, file paths already in play, active constraints — has to be restated explicitly inside the spawn payload. Skip that step and the worker drifts into plausible-looking tangents that have nothing to do with what the parent was actually trying to accomplish.
## Gotchas
- Trust failure is the largest pitfall. The parent only sees the worker's final structured response. If the worker hallucinates that it wrote a file, the orchestrator has no easy way to verify. Best practice: independently confirm every claim that matters — read the file, check the test result, inspect the diff.
- Failure handling is on you. The runtime does not auto-retry. When a worker crashes, decide deliberately whether to retry, fall back to in-line execution, or surface the failure to the user. Cap retries at two or three to prevent loops, and log enough to iterate on the prompt later.
- Cost compounds quickly. Workers bill against the same account as the orchestrator. A workflow that spawns five workers per orchestrator turn can multiply token spend by an order of magnitude. Watch the burn rate and treat hitting cost ceilings as a signal that something is over-delegated.
- Deep hierarchies seldom pay off. Workers can spawn further workers, producing a tree, but in practice the overhead compounds at every layer and trust verification becomes intractable. One orchestrator, one layer of workers — that is the default.
## Anti-patterns
- Delegating work that two or three direct tool calls would have covered.
- Granting the open default toolset when a narrow scope would suffice.
- Trusting a worker's self-report without separate verification.
- Building deep hierarchies where every additional layer compounds overhead and weakens the trust chain.
## Decision quick-check
Pick this approach when:
- The sub-task is bounded and produces an output you can structure ahead of time.
- You can brief it adequately in a single spawn prompt without dragging in the whole transcript.
- The context savings or parallelism gain pay back the spawn tax.
Skip it when:
- The work fits in two or three direct tool calls from the main thread.
- The brief required would essentially recapitulate the orchestrator's whole conversation anyway.
- The result is hard for you to verify and the cost of a fabricated success would be material.

View file

@ -0,0 +1,55 @@
---
name: output-styles-pattern
description: Pattern for reshaping how the agent presents responses
layer: pattern
cc_feature: output-styles
source: ./source-rejected.md
concept: presentation-only voice and format directives
last_verified: 2026-04-18
ngram_overlap_score: null
review_status: pending
---
## Use this when
You want to change the voice, format, or rendering of agent output without changing what the agent actually does. Output styles let operators reshape how Claude Code presents its responses without changing the underlying behavior of the agent. They live as small markdown files in a known directory and bind by name. When a style is active, the runtime injects its directives into the system prompt area before the agent generates each turn.
## Shape
A style file is a short markdown document. A handful of bullets describing the desired tone, target reading level, and any structural conventions like always using level-three headings or always wrapping commands in fenced code blocks.
The runtime does not enforce any of these directives mechanically. It is the agent itself that reads the style file at turn time and chooses to comply. This means style enforcement is best-effort and degrades when the agent's primary task pulls strongly in a different direction.
## Forces
Layering matters. Styles can be defined at the user level, the project level, or supplied by an installed plugin. When more than one style is active simultaneously, the runtime concatenates them in a documented order. There is no merging or conflict resolution. If two styles disagree about whether to use markdown tables, both directives are presented to the agent, which then has to pick one. Practitioners learn to keep styles narrow and orthogonal so that conflicts are rare.
Performance is rarely an issue with styles. They add a small fixed cost to each turn for prompt assembly and that is essentially all. Unlike session-start hooks, they do not block first-turn rendering. Unlike subagent spawns, they do not introduce additional billable workers. The cost is purely the marginal tokens of the style body itself, multiplied by the number of turns in the session.
## Gotchas
- A pitfall worth flagging is the temptation to encode behavior in styles when a hook or a skill would be more appropriate. Styles change presentation, not capability. If you find yourself writing imperative instructions like "always run the linter before reporting success," that belongs in a hook, not a style.
- Testing styles is awkward because their effect is on free-form output. The standard approach is to keep a small set of canonical prompts and run them once with the style and once without, then eyeball the diff.
- Security boundaries are minimal because styles only inject text into the prompt area. They cannot run code, cannot read files, and cannot make network calls. The risk surface is therefore mostly social.
- Observability is essentially DIY. The runtime does not log which style was active for which turn or whether the agent obeyed the style's directives.
## Anti-patterns
- Encoding procedural behavior in a style instead of a hook.
- Stacking many overlapping styles whose directives conflict.
- Trusting a third-party plugin's style without reading its body.
- Treating style enforcement as a hard guarantee rather than best-effort guidance.
## Decision quick-check
Pick this approach when:
- You have multiple distinct audiences and consistent voice matters across sessions.
- The directive is purely about format, voice, or rendering — not capability.
- You can keep the style narrow enough to compose with others without conflict.
Skip it when:
- The behavior you want is procedural and would belong in a hook or skill.
- The team has only one audience and the default voice is already adequate.
- You need hard enforcement guarantees rather than best-effort agent compliance.

View file

@ -0,0 +1,23 @@
# Session-Start Hooks Reference
Session-start hooks fire when a Claude Code conversation begins. Their job is to inject opening context into the agent's working memory before the user has typed anything. Practitioners reach for them when a project carries cross-cutting state that must be surfaced on every session entry: open todo items, recent commit history, current branch, the location of relevant config files, and so on.
The runtime invokes these hooks once per session. Each hook receives a JSON payload describing the session id, the working directory at startup, the launch flags, and a few environment hints. Output written to stdout is appended verbatim into the system prompt area before the agent reads its first user message. A non-zero exit code blocks the session from progressing. Implementations should treat that path with care — a misbehaving session-start hook can soft-brick the project for the operator until they remember to disable the binding.
A typical pattern is the briefing script. The hook reads a TODO file, summarizes recent activity in git, and emits a short markdown block with three or four headings. The agent then begins its first turn already aware of where the user left off. This collapses the warm-up phase that would otherwise consume several tool calls every time someone opens the project.
Performance discipline matters. Session-start hooks add latency to every cold start. Operators have reported sessions that take eight or ten seconds just to render the first prompt because a chain of synchronous network calls runs at session start. Cache where possible. Keep the hot path local. If you genuinely need a remote lookup, make it asynchronous and wrap the result so a slow upstream cannot stall the entire session.
Conflict handling shows up when multiple layers register a session-start hook. Plugin hooks land first, then user-level hooks, then project-level hooks. Each layer's stdout output is concatenated in that order. There is no merging or deduplication. A plugin and the user might both decide to print the current branch name, and the agent will see both. The runtime makes no attempt to resolve such collisions; the operator must coordinate.
Failure recovery deserves explicit thought. If a session-start hook crashes mid-way, partial stdout that was already emitted may still reach the agent. This can produce a confusing initial state where the briefing was truncated and the agent reasons over a half-written summary. Defensive implementations buffer their entire output and emit it as a single write at the very end, after all I/O has succeeded. That way a crash leaves a clean empty briefing rather than a partial one.
Testing these hooks is awkward because there is no native simulation harness. The standard workaround is to invoke the hook script directly with a synthesized JSON payload on stdin, then inspect the resulting stdout. Operators with tight feedback loops keep a small fixture file containing the canonical payload and run the hook through it after every change. This catches the most common failure modes — JSON parse errors, missing keys, broken pipes — without needing to spin up a real session.
Security boundaries deserve a second look. Session-start hooks run with the operator's full filesystem privileges, before any sandbox is established for the agent itself. A compromised hook can read any path the operator can read. Plugin marketplaces distribute hooks as plain shell or node scripts, often without strong signing. The mitigation is review discipline: read the hook source before installing a plugin, and pin plugin versions so an upstream rug-pull does not silently swap the script.
Observability is mostly DIY. The runtime does not log hook execution to any structured channel. If you want telemetry — how often each hook ran, how long it took, whether it succeeded — you must wire it yourself, typically by appending to a local jsonl file from inside the hook script. A few teams have built lightweight aggregators that scrape these jsonl files into a dashboard, but no shared community tool exists for this yet.
The flag landscape is small. There is no built-in flag to skip session-start hooks for a particular run. Operators sometimes work around this by toggling a sentinel environment variable that their own hook checks at the top, exiting fast when present. This is purely a convention; the runtime will not interpret it.
Looking ahead, the most-requested feature on community boards is selective firing — the ability to bind a session-start hook only when the session opens in a particular subdirectory, or only when a particular file pattern matches. The current matcher syntax is too coarse for this, and operators end up writing internal switch logic inside the hook to bail out early when the heuristic does not apply. A future runtime release may address this, but as of the current shipping version it remains an open gap that operators must paper over themselves.

View file

@ -0,0 +1,23 @@
# Subagent Delegation Reference
Subagents are isolated worker contexts that the primary agent can spawn to handle bounded sub-tasks. Each subagent runs in its own conversation, with its own context window, and returns a single structured response when finished. The orchestrating agent treats this response as a tool result and continues from there. The pattern shines when a piece of work is independent enough that running it in the main thread would either pollute the context or block other concurrent work.
The headline benefit is context isolation. The primary thread can stay focused on the user's intent while a subagent burns its own context on noisy intermediate steps — searching the codebase, reading large files, summarizing transcripts. Whatever the subagent reads or generates does not appear in the orchestrator's transcript. Only the final return value crosses the boundary. This bound is precious in long sessions where every token of orchestrator context counts.
The second benefit is parallelism. The orchestrator can launch several subagents in a single turn and wait for them all to finish. The runtime executes them concurrently. A research task that would have taken three sequential subagent calls finishes in roughly the time of the slowest single call. For workflows that legitimately decompose into independent investigations, the speedup is substantial.
There are real costs. Each subagent loads a fresh system prompt and any allow-listed tool schemas. That startup cost is several thousand tokens per subagent. Spawning ten subagents to do work that one careful Grep would have handled is wasteful. The break-even threshold depends on subagent complexity and on how much orchestrator context you save by not inlining the work, but the heuristic is: do not spawn a subagent for anything you could complete in two or three direct tool calls in the main thread.
The biggest pitfall is trust failure. The orchestrator only sees the subagent's final structured response. If the subagent hallucinates that it wrote a file, the orchestrator has no easy way to verify. Best practice is to have the orchestrator independently confirm every claim that matters: read the file, check the test result, inspect the diff. Treat the subagent's self-report as input to verification, not as ground truth.
Subagent prompts are the second pitfall. The subagent does not see the orchestrator's full conversation; it only receives the prompt the orchestrator constructs at spawn time. Anything the orchestrator implicitly knew about the user's intent, the file paths discussed, or the constraints in play must be re-stated in the spawn prompt. Otherwise the subagent will go off on tangents that look plausible in isolation but miss the point of the parent task. Briefing matters.
Tool scoping is the third dimension. The orchestrator can restrict the tool set the subagent receives. A code-search subagent does not need Write or Edit. A summarizer subagent does not need Bash. Tighter scopes reduce both the chance of off-task behavior and the system prompt overhead. Most workflows benefit from this discipline. The default open-tool-set is appropriate only for general-purpose subagents that genuinely need flexibility.
Failure modes deserve explicit thought. If a subagent crashes or returns an error, the orchestrator receives that as a tool error and must decide whether to retry, fall back to in-line execution, or propagate the failure to the user. The runtime does not auto-retry. Build the retry policy into the orchestrator's logic. Limit retries to two or three to prevent loops. Log the failure mode so the operator can iterate on the prompt.
Cost accounting matters. Subagents bill against the same account as the orchestrator. A workflow that spawns five subagents per orchestrator turn can multiply token spend by an order of magnitude. Watch the burn rate. If your workflow is hitting cost ceilings, the first place to look is whether subagents are being used for problems that did not warrant the overhead.
Composability deserves a final note. Subagents can themselves spawn further subagents, producing a tree. In practice this rarely pays off; the overhead compounds at every layer and trust verification becomes intractable. Keep the tree shallow — one orchestrator, one layer of workers — unless the problem genuinely requires deeper hierarchy.
Looking forward, community discussions show steady demand for richer return-value schemas, persistent subagent state across calls, and cleaner cancellation semantics. None of these are shipping today. Operators must work within the constraints of single-shot stateless workers with the result schema they define themselves at spawn time.

View file

@ -0,0 +1,25 @@
# Output Styles Reference
Output styles let operators reshape how Claude Code presents its responses without changing the underlying behavior of the agent. They live as small markdown files in a known directory and bind by name. When a style is active, the runtime injects its directives into the system prompt area before the agent generates each turn. The agent treats those directives as binding instructions about format, voice, and rendering, and adapts its output accordingly.
The simplest style files are short. A handful of bullets describing the desired tone, target reading level, and any structural conventions like always using level-three headings or always wrapping commands in fenced code blocks. The runtime does not enforce any of these directives mechanically. It is the agent itself that reads the style file at turn time and chooses to comply. This means style enforcement is best-effort and degrades when the agent's primary task pulls strongly in a different direction.
A common pattern is the audience-specific style. A team might keep one style tuned for non-technical stakeholders, with longer prose, fewer code blocks, and a polite explanatory tone, alongside another style tuned for engineers, with terse responses, dense code, and minimal hedging. Operators switch styles per session depending on who is asking. The style file itself stays in version control so the conventions are reviewable and shareable.
Layering matters. Styles can be defined at the user level, the project level, or supplied by an installed plugin. When more than one style is active simultaneously, the runtime concatenates them in a documented order. There is no merging or conflict resolution. If two styles disagree about whether to use markdown tables, both directives are presented to the agent, which then has to pick one. Practitioners learn to keep styles narrow and orthogonal so that conflicts are rare.
A pitfall worth flagging is the temptation to encode behavior in styles when a hook or a skill would be more appropriate. Styles change presentation, not capability. If you find yourself writing imperative instructions like "always run the linter before reporting success," that belongs in a hook, not a style. Styles whose body looks like a procedure are usually mis-scoped and will not be followed reliably because the agent reads them as voice guidance rather than as obligations.
Performance is rarely an issue with styles. They add a small fixed cost to each turn for prompt assembly and that is essentially all. Unlike session-start hooks, they do not block first-turn rendering. Unlike subagent spawns, they do not introduce additional billable workers. The cost is purely the marginal tokens of the style body itself, multiplied by the number of turns in the session.
Testing styles is awkward because their effect is on free-form output. The standard approach is to keep a small set of canonical prompts and run them once with the style and once without, then eyeball the diff. Some teams have built lightweight golden-file harnesses for this, but no shared community tool exists. Most operators rely on quick manual checks during style development and trust the style to keep working in normal use.
Security boundaries are minimal because styles only inject text into the prompt area. They cannot run code, cannot read files, and cannot make network calls. The risk surface is therefore mostly social: a malicious style could try to coax the agent into ignoring safety instructions or revealing prompt-internal information. The mitigation is review discipline. Read the style body before installing a third-party plugin that ships one, the same way you would read any prompt-injecting artifact.
Observability is essentially DIY. The runtime does not log which style was active for which turn or whether the agent obeyed the style's directives. If you want to track style adherence, you will need to build it yourself, typically by sampling sessions and comparing the output to the style spec. A few teams have wired this into their prompt evaluation suites.
The flag landscape is mostly per-session. There is no global way to disable styles for a particular run, but operators can switch to a no-op style or remove the binding before launching. Some setups bind styles to environment variables so that a single export controls which voice the session uses, but this is a convention, not a runtime feature.
Looking ahead, the most-requested feature is conditional style activation: the ability to bind a style only when certain conditions are met, like when the working directory matches a pattern or when a specific user is operating the session. Currently this kind of selectivity has to be handled by manually swapping styles between sessions, which is friction operators would gladly trade for declarative bindings. A future runtime release may address this.
The takeaway is that styles are a low-risk, low-cost lever for tuning how the agent talks, and they pay off most when teams have multiple distinct audiences or when consistent voice across sessions matters for the brand or the team's external communication. Used within those constraints, they earn their keep without surprising anyone.