feat(ultraplan-local): v2.1.0 — dynamic quality-gated interview

Replace hardcoded Q1-Q8 in /ultrabrief-local with a section-driven
completeness loop (Phase 3) and a draft/review/revise loop with
brief-reviewer as stop-gate (Phase 4). Quality drives the interview,
not a question counter.

brief-reviewer now emits a machine-readable JSON block with per-dimension
scores (1-5) and detail arrays alongside the existing prose report;
planning-orchestrator continues to consume the prose verdict unchanged.

Phase 4 gate: all dimensions >= 4 AND research_plan = 5. On fail, a
targeted follow-up is generated from the weakest dimension's detail
field and the draft is re-reviewed. Max 3 review iterations bound cost;
exhaustion writes brief.md with brief_quality: partial and an explicit
Brief Quality section. Force-stop surfaces per-dimension findings before
the user chooses continue or partial.

Not breaking. /ultrabrief-local [--quick] <task> interface unchanged.
--quick now means compact start with escalation, not a max-N cap.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Kjell Tore Guttormsen 2026-04-18 09:43:43 +02:00
commit 1634197853
8 changed files with 501 additions and 120 deletions

View file

@ -6,7 +6,7 @@ model: opus
allowed-tools: Agent, Read, Glob, Grep, Write, Edit, Bash, AskUserQuestion
---
# Ultrabrief Local v2.0
# Ultrabrief Local v2.1
Interactive requirements-gathering command. Produces a **task brief** — a
structured markdown file that declares intent, goal, constraints, and an
@ -33,11 +33,14 @@ foreground if the user opts in.
Parse `$ARGUMENTS`:
1. If arguments start with `--quick`: set **mode = quick**. Interview is
shorter (3-4 questions instead of 5-8). Strip the flag; remainder is
the task description.
1. If arguments start with `--quick`: set **mode = quick**. The interview
starts more compactly (fewer opening probes per section) but still
escalates automatically if quality gates fail. There is no hard cap on
question count — quality drives the loop, not a counter. Strip the flag;
remainder is the task description.
2. Otherwise: **mode = default**. Full interview (5-8 questions).
2. Otherwise: **mode = default**. Interview probes each section until the
completeness gate (Phase 3) and brief-review gate (Phase 4) both pass.
If no task description is provided, output usage and stop:
@ -46,8 +49,8 @@ Usage: /ultrabrief-local <task description>
/ultrabrief-local --quick <task description>
Modes:
default Full interview (5-8 questions) → brief with research plan
--quick Short interview (3-4 questions) → brief with research plan
default Dynamic interview until quality gates pass — brief with research plan
--quick Compact start; still escalates on weak sections — brief with research plan
Examples:
/ultrabrief-local Add user authentication with JWT tokens
@ -86,74 +89,144 @@ If the directory already exists and is non-empty, warn and ask:
Use `AskUserQuestion` with three options. If "pick new slug", ask for a
new slug and restart Phase 2.
## Phase 3 — Interview
## Phase 3 — Completeness loop
Use `AskUserQuestion` throughout. **Ask one question at a time.** Never
dump all questions at once. Follow up based on answers.
Phase 3 is a **section-driven completeness loop**. Instead of a numbered
question list, maintain an internal state of brief sections and keep asking
until every required section has substantive content. Quality drives the
loop — there is no hard cap on question count.
### Interview flow
Use `AskUserQuestion` for every question. **Ask one question at a time.**
Never dump multiple questions.
**Question 1 (always) — Intent:**
> "What is the motivation for this task? Why does it matter? What happens
> if we don't do it? (The plan will use this to justify every implementation
> decision.)"
### Internal state
**Question 2 (always) — Goal:**
> "What does success look like concretely? Describe the end state in
> 1-3 sentences — specific enough to disagree with."
Track this structure in memory as the loop runs:
**Question 3 (always) — Success criteria:**
> "How will we verify it's done? List 2-4 specific, testable conditions
> (commands to run, observations, metrics). Avoid 'it works'."
```
state = {
intent: { content: "", probes: 0 }, # required
goal: { content: "", probes: 0 }, # required
success_criteria: { content: [], probes: 0 }, # required
research_plan: { topics: [], probes: 0 }, # required
non_goals: { content: [], probes: 0 }, # optional
constraints: { content: [], probes: 0 }, # optional
preferences: { content: [], probes: 0 }, # optional
nfrs: { content: [], probes: 0 }, # optional
prior_attempts: { content: "", probes: 0 }, # optional
question_history: [] # list of questions asked
}
```
**Question 4 (usually) — Non-goals:**
> "What is explicitly NOT in scope? (Prevents scope-guardian flagging gaps
> for things we deliberately don't do.)"
`content` is raw user answers merged; `probes` is how many times this
section has been asked; `question_history` prevents re-asking the same
variant twice.
**Question 5 (conditional) — Constraints:**
> "Are there technical, time, or resource constraints? (Dependencies,
> compatibility, deadlines, budget.)"
>
> Skip if the user already mentioned constraints.
### Required sections (initial-signal gate)
**Question 6 (conditional) — Preferences:**
> "Preferences on libraries, patterns, or architectural style?"
>
> Skip for small tasks or when constraints already imply them.
Four sections MUST have substantive content before exiting Phase 3:
**Question 7 (conditional) — Non-functional requirements:**
> "Performance, security, accessibility, or scalability targets? (Quantified
> where possible.)"
>
> Skip if not applicable.
1. **Intent** — full sentence or paragraph (not a single word or phrase)
2. **Goal** — full sentence or paragraph
3. **Success Criteria** — at least one concrete, testable item
4. **Research Plan** — either ≥ 1 topic probed, OR the user has explicitly
confirmed "no external research needed"
**Question 8 (conditional) — Prior attempts:**
> "Has this been tried before? What worked or failed?"
>
> Skip if the task is clearly fresh.
"Substantive" means: non-empty, not a trivial one-word reply, not
"I don't know" without a recorded assumption. The strict falsifiability
check happens in Phase 4 (brief-review gate); Phase 3 is just the
initial-signal bar.
### Adaptive depth
Optional sections (Non-Goals, Constraints, Preferences, NFRs, Prior
Attempts) do not gate exit. If they remain empty after the required
sections pass, they will be recorded as "Not discussed — no constraints
assumed" in Phase 4's draft.
After each answer:
### Question bank (per section)
- **Detailed technical answer (2+ sentences, domain vocabulary):** skip
obvious follow-ups the user already covered. Aim for 4-5 total questions.
- **Short or uncertain answer ("I don't know", "not sure", vague):** offer
alternatives instead of open questions. Record uncertainty as an
open assumption.
- **"Skip" / "just make it" / "proceed":** stop interviewing. Write a
minimal brief from the task description and answers so far. Mark
uncovered sections as "Not discussed — no constraints assumed."
Pick the next question from the section's bank based on `content` and
`probes`. Wording must stay conversational — only the *selection* is
section-driven, not the tone.
### Quick mode
**Intent** (required):
- _Anchor_ (probes=0, content empty): "Why are we doing this? What is the
motivation, the user need, or the strategic context behind the task?"
- _Follow-up_ (probes≥1, content present but shallow): "What happens if
we do nothing? Who is affected?"
- _Sharpen_ (user mentioned a symptom): "You mentioned {X}. Is {X} the
symptom or the underlying cause?"
If **mode = quick**, ask only Questions 1, 2, 3, and 4. Maximum 4 questions.
Skip the rest.
**Goal** (required):
- _Anchor_: "Describe the end state in 13 sentences — specific enough to
disagree with."
- _Follow-up_: "How would you recognize this is done when looking at
the UI / API / codebase?"
### Research topic identification (CRITICAL)
**Success Criteria** (required):
- _Anchor_: "How do we verify it is actually done? List 24 concrete,
testable conditions — commands to run, observations, or metrics."
- _Sharpen_ (criterion is vague): "'{quoted criterion}' is subjective.
Which command, observation, or metric would prove this is met?"
- _Quantify_ (performance/quality claim): "You mentioned it should be
{fast/reliable/secure}. What number or threshold counts as success?"
As the interview progresses, identify topics that will need research for
the plan to be high-confidence. Watch for:
**Research Plan** (required, strictest):
- _Anchor_ (no topics yet): "Are there technologies, libraries, or
decisions in this task you do not have solid current knowledge of?
Examples might be library choice, a protocol, or a security pattern."
- _Per-topic sharpen_ (topic exists but incomplete): "For topic
'{title}': which parts of the plan depend on the answer? What
confidence level do you need — high, medium, or low?"
- _Scope question_: "Is '{topic}' answerable from the existing codebase,
from external docs, or both?"
- _Confirm none_ (user refuses all topics): "Confirming: no external
research needed — you already know everything the plan will depend on?"
**Non-Goals** (optional):
- _Anchor_: "What is explicitly NOT in scope? This prevents scope-guardian
from flagging gaps for things we deliberately don't do."
**Constraints** (optional):
- _Anchor_: "Technical, time, or resource constraints the plan must
respect? Dependencies, compatibility, deadlines, or budget."
- _Sharpen_: "You mentioned {deadline / budget / compatibility}. Is it
hard or guidance?"
**Preferences** (optional):
- _Anchor_: "Preferences for libraries, patterns, or architectural style?"
**NFRs** (optional):
- _Anchor_: "Performance, security, accessibility, or scalability targets?
Quantified wherever possible."
**Prior Attempts** (optional):
- _Anchor_: "Has this been attempted before? What worked or failed?"
### Selection rule
On each loop iteration:
1. Compute the next section to probe:
- If any required section is below the initial-signal gate → pick the
weakest required section in this priority order:
Intent → Goal → Success Criteria → Research Plan.
- Else if an optional section is clearly missing and likely material
to scope (heuristic: the task description hints at constraints or
NFRs) → probe it at most once.
- Else: exit Phase 3.
2. Within the chosen section, pick the question variant:
- If `probes == 0` and content is empty → _Anchor_.
- If content exists but is shallow → _Follow-up_ or _Sharpen_.
- If the section is Research Plan and topics exist → iterate per-topic
sharpen across incomplete topics.
3. Ensure the exact question is NOT already in `question_history`. If it
is, pick the next variant or skip to the next weakest section.
4. Ask via `AskUserQuestion`. Append question to history. Increment probes.
5. Record the answer into `content`. Never overwrite — merge.
### Research topic identification
As the user answers Intent, Goal, or Success Criteria, listen for:
- **Unfamiliar technologies** — libraries, frameworks, protocols not
clearly present in the codebase
@ -163,48 +236,226 @@ the plan to be high-confidence. Watch for:
- **Unknown integrations** — third-party APIs, external services
- **Compliance / legal** — GDPR, accessibility, industry regulations
For each potential topic, probe briefly:
> "Do you already know {topic}, or should I plan a research step for it?"
Record:
- Topic title (short)
- Why it matters for the plan
- Exact research question (phrased for `/ultraresearch-local`)
- Suggested scope (local / external / both)
When you hear one, add a *candidate* topic to `research_plan.topics` with
only a title and why-it-matters. Probe it on the next Research Plan
iteration using the per-topic sharpen question to fill in:
- Research question (must end in `?`)
- Required for plan steps
- Scope (local / external / both)
- Confidence needed (high / medium / low)
- Estimated cost (quick / standard / deep)
If the user says "I know this" — do not add it as a topic. Trust the user.
If the user says "I know this" to a candidate topic, remove it from the
list. Trust the user. If no topics emerge after probing, the user confirms
"no external research needed" → `research_plan` gate passes with 0 topics.
If no topics emerge: `research_topics = 0` is valid. Not every task needs
external research.
### Quick mode adjustments
## Phase 4 — Write the brief
If **mode = quick**:
- For optional sections, cap probes at 1 each. Do not revisit optional
sections during Phase 3.
- Required sections still have no probe cap — quality gate still applies.
- Prefer _Anchor_ variants over _Sharpen_ on the first pass.
Read the brief template:
### Force-stop path
If the user says "skip", "stop asking", "just proceed", "enough", or
similar, break the loop immediately:
- Mark any required sections still below the initial-signal gate as
`{ incomplete_forced_stop: true }` in state.
- Proceed to Phase 4 with a note that the brief will carry a reduced
confidence flag.
### Exit condition
Exit Phase 3 when:
- All four required sections meet the initial-signal gate, OR
- The user has force-stopped.
Report:
```
Phase 3 complete: {N} questions asked across {M} sections.
Proceeding to draft and review.
```
## Phase 4 — Draft, review, and revise
Phase 4 runs a **draft → brief-reviewer → revise** loop. The draft is
not written to disk until the brief-review quality gate passes (or the
iteration cap is hit). This ensures the brief that reaches `/ultraplan-local`
has already survived a critical review.
Read the brief template first:
`@${CLAUDE_PLUGIN_ROOT}/templates/ultrabrief-template.md`
Write the brief to: `{PROJECT_DIR}/brief.md`.
### Loop bound
Fill in every section based on interview answers:
**Maximum 3 review iterations.** This bounds cost in the worst case while
leaving room for two rounds of targeted follow-ups.
- **Frontmatter:** populate `task`, `slug`, `project_dir`, `research_topics`,
`research_status: pending`, `auto_research: false` (will update in Phase 5
if user opts in), `interview_turns`, `source: interview`.
- **Intent:** expand the user's motivation into 3-5 sentences. This is
load-bearing — be explicit.
### Iteration step-by-step
**Step 4a — Draft in memory**
Build the brief text from Phase 3 state by filling the template:
- **Frontmatter:** populate `task`, `slug`, `project_dir`, `research_topics`
(count of topics), `research_status: pending`, `auto_research: false`
(will update in Phase 5 if user opts in), `interview_turns` (total
questions asked across Phase 3 + Phase 4), `source: interview`.
- **Intent:** expand the user's motivation into 35 sentences. Load-bearing.
- **Goal:** concrete end state.
- **Non-Goals:** from user's answer, or empty list with note.
- **Constraints / Preferences / NFRs:** from user's answers. Mark
"Not discussed — no constraints assumed" if not covered.
- **Success Criteria:** falsifiable commands/observations. Reject vague
criteria from the user — ask for a concrete version if needed.
- **Research Plan:** one section per identified topic, with the full
structure from the template. If 0 topics, write the "No external
research needed" note.
- **Open Questions / Assumptions:** from "I don't know" answers and
implicit gaps.
- **Prior Attempts:** from user's answer, or "None — fresh task."
- **Non-Goals:** from state, or "- None explicitly stated" bullet if empty.
- **Constraints / Preferences / NFRs:** from state, or "Not discussed — no
constraints assumed" note if empty.
- **Success Criteria:** falsifiable commands/observations from state.
- **Research Plan:** one `### Topic N: {title}` section per topic with the
full structure from the template. If 0 topics: write the "No external
research needed — user confirmed solid knowledge of all plan
dependencies" note.
- **Open Questions / Assumptions:** from any `"I don't know"` answers
recorded during Phase 3, plus implicit gaps.
- **Prior Attempts:** from state, or "None — fresh task."
**Step 4b — Write draft to disk**
Write the draft to `{PROJECT_DIR}/brief.md.draft` (not `brief.md` — the
final file is only written after the gate passes).
**Step 4c — Launch brief-reviewer**
Launch the `brief-reviewer` agent (foreground, blocking) with the prompt:
> "Review this task brief for quality: `{PROJECT_DIR}/brief.md.draft`.
> Check completeness, consistency, testability, scope clarity, and
> research-plan validity. Report findings, verdict, and the required
> machine-readable JSON block."
**Step 4d — Parse JSON scores**
Parse the agent's output. Locate the **last** fenced ```json``` block.
Extract per-dimension scores:
```
review = {
completeness: { score, gaps },
consistency: { score, issues },
testability: { score, weak_criteria },
scope_clarity: { score, unclear_sections },
research_plan: { score, invalid_topics },
verdict: "PROCEED | PROCEED_WITH_RISKS | REVISE"
}
```
**JSON fallback:** if the JSON block is missing, invalid, or a dimension
is missing, treat all dimensions as `score: 3` and set the `verdict` from
the prose verdict if present, otherwise `PROCEED_WITH_RISKS`. Emit an
internal note that the reviewer output was degraded. This ensures the
loop never deadlocks on a parser error.
**Step 4e — Gate evaluation**
The gate **passes** when all of the following are true:
- `completeness.score ≥ 4`
- `consistency.score ≥ 4`
- `testability.score ≥ 4`
- `scope_clarity.score ≥ 4`
- `research_plan.score == 5`
(Research Plan requires a perfect score because its format is checked
mechanically: ends in `?`, `Required for plan steps` filled, scope is
one of `local | external | both`, confidence is `high | medium | low`.
Anything less means at least one topic is malformed and planning will
stumble.)
**If gate passes:**
1. Move `brief.md.draft``brief.md` (atomic rename).
2. Delete the draft file if rename is not possible on the OS; write
`brief.md` fresh.
3. Break the loop and proceed to Step 4g.
**If gate fails AND iteration count < 3:**
1. Identify the weakest dimension (lowest score; tie broken by priority:
research_plan > testability > completeness > consistency > scope_clarity).
2. Generate a targeted follow-up question from the dimension's detail
field (gaps / issues / weak_criteria / unclear_sections / invalid_topics).
Example generators:
- `completeness.gaps: ["Non-Goals empty, unclear if deliberate"]`
→ "You did not specify anything out-of-scope. Is that deliberate, or
are there things we should explicitly exclude?"
- `testability.weak_criteria: ["'system should be fast'"]`
→ "'System should be fast' is not falsifiable. Which metric or
threshold proves this is met — e.g., p95 < 200ms, or throughput
≥ X requests/sec?"
- `research_plan.invalid_topics: [{"topic":"JWT","issue":"Required for plan steps empty"}]`
→ "For research topic 'JWT': which plan steps depend on the answer?
Give one or two concrete kinds of step (e.g., 'library selection',
'threat model', 'migration strategy')."
3. Ask via `AskUserQuestion`. Record the answer into Phase 3 state.
4. Return to Step 4a with incremented iteration count. The reviewer sees
an updated draft, so you MUST re-read the brief and regenerate the
review each iteration — do not reuse stale scores.
5. When launching the reviewer on iteration 2 or 3, include prior
questions in the prompt so it does not produce circular follow-ups:
> "Questions already asked during this interview: {list from
> question_history}. Focus on issues that remain after those answers —
> do not re-raise gaps that have already been addressed."
**If gate fails AND iteration count == 3 (loop exhausted):**
1. Move `brief.md.draft``brief.md`.
2. Add `brief_quality: partial` to the frontmatter (edit the file
post-rename — insert the key above the closing `---`).
3. Add a `## Brief Quality` section near the end with the failing
dimensions and their `detail` arrays from the final review, formatted:
```
## Brief Quality
Review loop exhausted after 3 iterations. The following dimensions
did not reach the pass threshold:
- **Research Plan (score 2/5):** Topic 'JWT library' missing
Required-for-plan-steps field.
- **Testability (score 3/5):** Success criterion "works correctly"
is not falsifiable.
Downstream planning will treat these as reduced-confidence areas.
```
4. Break the loop and proceed to Step 4g.
### Step 4f — Force-stop handling
If during any `AskUserQuestion` in Step 4e the user says "stop", "skip",
"enough", "just write it", or similar, do NOT exit the loop immediately.
Instead, surface the current review findings in plain text:
```
Brief-reviewer would flag these issues:
- Research Plan (score 2/5): Topic 'JWT library choice' missing Required-for-plan-steps field.
- Testability (score 3/5): Success criterion "works correctly" is not falsifiable.
Continue anyway? The plan will have lower confidence in these areas.
```
Then ask via `AskUserQuestion`:
| Option | Action |
|--------|--------|
| **Answer one more follow-up** | Return to Step 4e with the current weakest-dimension question. |
| **Stop now (accept partial brief)** | Finalize brief with `brief_quality: partial` and the `## Brief Quality` section (same path as iteration-cap exhaustion). Break loop. |
The force-stop path is distinct from a silent iteration cap: the user
sees exactly which dimensions are weak and chooses informed.
### Step 4g — Finalize
After the loop exits (pass, cap, or force-stop), ensure:
- `brief.md` exists at `{PROJECT_DIR}/brief.md`.
- `brief.md.draft` no longer exists.
- If the loop ended without a clean pass, frontmatter contains
`brief_quality: partial` and a `## Brief Quality` section exists.
- If the loop ended with a clean pass, `brief_quality` is either
absent or set to `complete`.
Populate the "How to continue" footer with the actual project path and
topic questions.
@ -212,6 +463,8 @@ topic questions.
Report:
```
Brief written: {PROJECT_DIR}/brief.md
Review iterations: {1..3}
Final quality: {complete | partial}
Research topics identified: {N}
```
@ -387,6 +640,8 @@ Append one record to `${CLAUDE_PLUGIN_DATA}/ultrabrief-stats.jsonl`:
"slug": "{slug}",
"mode": "{default | quick}",
"interview_turns": {N},
"review_iterations": {1..3},
"brief_quality": "{complete | partial}",
"research_topics": {N},
"auto_research": {true | false},
"auto_result": "{completed | cancelled | failed | manual}",
@ -417,4 +672,11 @@ Never let stats failures block the workflow.
7. **Auto mode blocks foreground.** If the user opts into auto, this
session waits for research + planning to complete. Document this in
the opt-in question.
8. **Privacy:** never log prompt text, secrets, or credentials.
8. **Quality gates, not question counts.** Phase 3 and Phase 4 are
quality-gated loops; do not enforce a hard cap on interview questions.
The brief-review gate (Phase 4) caps at 3 review iterations to bound
cost, but Phase 3 has no cap — required-section content drives exit.
9. **Never write `brief.md` while the review gate is still pending.**
Draft lives in `brief.md.draft` until the loop terminates. A caller
that sees `brief.md` must be able to trust that Phase 4 finished.
10. **Privacy:** never log prompt text, secrets, or credentials.