Steg 20 (remediation Wave 4 / S5, SOLO): measure whether the 7-agent long-form review stack carries redundant gates. Method: cross-reference each agent's check taxonomy against its in-repo fasit fixture; four fixtures (editorial, content, language, fact-reviewer) target the SAME Del 4 edition, enabling a real cross-gate overlap comparison on one piece (not a live run — fixtures' own live-run notes require a reload + cross-repo Maskinrommet access, out of scope). Finding: every gate has >=1 unique catch on Del 4. The four genuine overlaps (verbatim repetition, the Vi/Vi-i-Nav quote, the postulated number, the small-orgs thread) are each justified — a cold re-take (Endring 9's reason to exist), the same symptom via a different operation (flag-absence vs web-verify), or two distinct defects sharing a surface topic — with no subsumption either way. The fact-checker <-> fact-reviewer overlap is load-bearing (the pivot premise arrived after Step 5, so only the cold re-run caught it). Decision: NO TRIM. voice-scrubber has no fixture -> inconclusive; redundancy retained (Step 20 On-failure = skip). Counts unchanged 19 agents / 27 commands; count contract (EXPECT_AGENTS=19) untouched. test-runner 62/62 green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
14 KiB
Long-form Review-Pass Overlap Measurement — Steg 20
Remediation Voyage, Wave 4 / S5. Measures whether the long-form review stack carries redundant gates, and trims only where a gate catches nothing the others don't. Written 2026-05-30, SOLO (no subagent fan-out).
The question and the trim rule
The long-form pipeline runs seven review agents. Endring 9 (v3.1.0) added a
cold/headless package (content-reviewer, language-reviewer, fact-reviewer)
whose agent prompts argue, in their own words, that they overlap the in-session
gates on purpose (fact-reviewer: «the redundancy is load-bearing, not
waste»; language-reviewer anti-pattern: «'De-duplicate' yourself against
editorial-reviewer — the overlap is the cold re-take»). Steg 20 tests that claim
against evidence instead of taking it on faith:
Trim a gate ONLY where it catches nothing the others don't (then merge/remove it + update the count contract). If the redundancy is justified, record that and keep it. If the fixture is insufficient to decide, record «inconclusive; redundancy retained» and do NOT trim. (Step 20 On-failure = skip the trim.)
Method — and its honest limit
I measured the documented catch-sets: each agent's check taxonomy (the agent
.md) cross-referenced against its in-repo fasit fixture
(agents/fixtures/*-cases.md). I did not run the agents live: every fixture's
own live-run note states a live cold run needs (a) a session reload and (b)
read access to the frozen Del 4 draft in the Maskinrommet series folder —
cross-repo, explicitly out of scope this session. By each fixture's own
declaration the fasit is «the gold-standard of record» until both hold, so the
fasit catch-sets are the legitimate measurement surface.
The lucky break that makes this more than taxonomy-reasoning: four of the six
fixtures target the same edition — Del 4 (Security Champions, Maskinrommet).
editorial-reviewer reviewed v5 (2026-05-28, in-session); the cold trio
(content/language/fact-reviewer) re-read the frozen/pivoted version
(2026-05-29). That shared edition lets me compare what each gate actually caught
on one piece — a real cross-gate overlap measurement, not just a boundary
restatement.
| Fixture | Edition under review | Cases | Enables shared-edition compare? |
|---|---|---|---|
editorial-reviewer-cases.md |
Del 4 v5 (28.05, in-session) | 8 | ✅ yes |
content-reviewer-cases.md |
Del 4 frozen/pivoted (29.05, cold) | 6 | ✅ yes |
language-reviewer-cases.md |
Del 4 frozen (29.05, cold) | 6 | ✅ yes |
fact-reviewer-cases.md |
Del 4 frozen/pivoted (29.05, cold) | 6 | ✅ yes |
persona-reviewer-cases.md |
separate jargon-wall sample (+ documented Del 4 behaviour) | 6 axes | partial |
fact-checker-cases.md |
3 generic reference claims (not Del 4) | 3 | role only |
voice-scrubber |
NO FIXTURE | — | ❌ inconclusive |
The seven agents — axis map
| Agent | Step | Axis (the one question it answers) | When | Fixture |
|---|---|---|---|---|
fact-checker |
5 | factual truth — is it true? | in-session, moving draft | generic (3 claims) |
editorial-reviewer |
5.5 | prose craft + narrative architecture — is it well-made? | in-session | Del 4 v5 |
persona-reviewer |
2.5/6/9 | reader response — does it land? | in-session | sample + Del 4 behaviour |
voice-scrubber |
4 | de-AI + chronicle voice drift — does it sound like the author? | in-session (applies edits) | none |
content-reviewer |
6.5 | argument integrity — does the reasoning hold? | cold/frozen | Del 4 frozen |
language-reviewer |
6.5 | Norwegian language — does it read clean? | cold/frozen | Del 4 frozen |
fact-reviewer |
6.5 | factual truth, re-verified — is every claim, incl. pivot, true? | cold/frozen+pivoted | Del 4 frozen |
Per-reviewer catch table (what each gate caught on the fixtures)
Legend: U = unique catch (no other gate's fixture surfaces this defect) · O = overlaps another gate's catch (overlap analysed in the matrix below).
editorial-reviewer — Del 4 v5 (8 catches)
| # | Check | Defect caught | Sev | U/O |
|---|---|---|---|---|
| 1 | A1 | abstract figure never instantiated (craft/vividness) | REWORK | O → content C4 (adjacent) |
| 2 | P3 | postulated number, no source/hedge — flags absence, no search | REWORK | O → fact-reviewer F3 |
| 3 | A2 | trust-effect hypothesis with no SDT/theory anchor | BLOCK | U |
| 4 | A3 | broken series-title symmetry (part floats free) | REWORK | U |
| 5 | A4 | small-business addressee stranded — no usable action | BLOCK | O → content C5 (adjacent) |
| 6 | P2 | verbatim repetition | REWORK | O → language L1 |
| 7 | P1 | em-dash over-density | REWORK | U |
| 8 | P4 | prose-level internal contradiction (two passages) | BLOCK | O → content C3 (adjacent) |
content-reviewer — Del 4 frozen (6 catches) — argument-integritet
| # | Check | Defect caught | Sev | U/O |
|---|---|---|---|---|
| 1 | C2 | Security-Champions pivot premise asserted unsupported | BLOCK | U |
| 2 | C5 | unanswered «what about small orgs?» objection | BLOCK | O → editorial A4 (adjacent) |
| 3 | C1 | logical hole «Champions finnes» → «dømmekraft bevart» | REWORK | U |
| 4 | C4 | role section needs one concrete org for the argument | REWORK | O → editorial A1 (adjacent) |
| 5 | C3 | recommendation delegates the judgment the series premise rules out | BLOCK | U |
| 6 | C2 | gevinst assumes widespread org maturity | REWORK | U |
language-reviewer — Del 4 frozen (6 catches) — norsk-språkkvalitet
| # | Check | Defect caught | Sev | U/O |
|---|---|---|---|---|
| 1 | L4 | quote error «Vi» vs «Vi i Nav» (wording misrepresents source) | BLOCK | O → fact-reviewer F2 |
| 2 | L2 | anglicism «adressere problemet» | REWORK | U |
| 3 | L2 | anglicism «på en daglig basis» | REWORK | U |
| 4 | L1 | verbatim repetition 3× across §1/§4/§6 | REWORK | O → editorial P2 |
| 5 | L3 | «det vises til» kanselli-stil in a personal chronicle | REWORK | U |
| 6 | L5 | monotone cadence (5 same-length sentences) | NICE | U |
fact-reviewer — Del 4 frozen/pivoted (6 catches) — faktisk-korrekthet (cold)
| # | Check | Defect caught | Verdict | U/O |
|---|---|---|---|---|
| 1 | F1 | pivot premise never met Step 5 (PIVOT-RISK headline) | 🔴 | U |
| 2 | F1+F2 | misattribution to wrong originator | 🔴 | U |
| 3 | F2 | quote precision «Vi» vs «Vi i Nav» (vs source) | 🟡 | O → language L4 |
| 4 | F3 | postulated number, no provenance — searches, finds none | 🟡 | O → editorial P3 |
| 5 | F1 | «Security Champions» as a settled standard that varies per org (PIVOT-RISK) | 🔴 | U |
| 6 | F4+F3 | secondary source for a precise figure («~a third» ≠ «37 %») | 🟡 | U |
fact-checker — role on Del 4 (generic fixture, 3 claims)
Catches truth defects cheaply and early, on the moving draft (Step 5). Its
fixture is 3 generic ground-truth claims (EU AI Act 🟢 / GPT-4-by-Anthropic 🔴 /
unverifiable 37 % 🟡), not Del 4. Its measured role on Del 4 is documented by
the fact-reviewer fixture: the Security-Champions pivot arrived after the
Step 5 sweep, so fact-checker structurally never saw the pivot premise. It
is necessary (early/cheap truth gate) but provably insufficient — which is the
entire reason fact-reviewer exists. U by pipeline position.
persona-reviewer — resonance/response
On Del 4 the persona sweep returned 15 flags across 3 personas and every persona PASS / ready-to-publish (per the editorial fixture). Its own fixture (jargon-wall sample) shows the 6 response axes (Krok IKKE, Leder-takeaway IKKE, …). Catches reader-response defects no other gate measures. U by axis.
voice-scrubber — de-AI + chronicle voice drift
No fixture exists. Its axis (mechanical AI-tells + Norwegian-chronicle voice drift, judged against approved Norwegian editions) is measured by no other gate, and uniquely it applies edits (Pass 1) and maintains a drift-log — it is not even part of the review-report package. Overlap inconclusive from in-repo fixtures; see decision below.
Cross-gate overlap matrix (the shared Del 4 edition)
Four genuine overlaps surface on Del 4. The decisive test for each: does either gate's catch-set subsume the other's? In every case — no.
| # | Defect | Gates that catch it | Same defect or same symptom? | Subsumption? | Justification |
|---|---|---|---|---|---|
| O1 | verbatim repetition | editorial P2 (in-session, v5) ↔ language L1 (cold, frozen) | same defect | neither | Cold re-take. Editorial caught it in-session sharing the author's framing; language re-caught it cold on the frozen version. The agent prompts mandate this overlap explicitly. The value is the independent reading, not a second checklist. |
| O2 | quote «Vi» vs «Vi i Nav» | language L4 (BLOCK) ↔ fact-reviewer F2 (🟡) | same defect, two operations | neither | language flags the wording misrepresenting the source without web access; fact-reviewer verifies against the actual source via web search. Different tools, different severities — one catches it if the source is unreachable, the other if the wording reads clean but the source differs. |
| O3 | postulated number | editorial P3 (REWORK) ↔ fact-reviewer F3 (🟡) | same symptom, two operations | neither | editorial flags the absence of a source/hedge (no search); fact-reviewer searches for provenance and finds none. The prompts draw this boundary by hand. A bare number with a findable source passes editorial (it has none inline) but is exactly what fact-reviewer's search resolves. |
| O4 | small-orgs thread | editorial A4 (stranded addressee) ↔ content C5 (unanswered objection) | adjacent — different defects | n/a | Same surface topic (small orgs) decomposes into two genuinely different defects: A4 = «the small-business reader leaves with no action» (architecture); C5 = «the argument never meets the obvious counter and collapses for that class» (logic). Not redundancy — two gates needed to see both faces. |
Plus the fact-checker ↔ fact-reviewer time-axis overlap (deliberate, not in the matrix because it spans pipeline stages, not one defect): Step 5 runs in-session on the moving draft; Step 6.5 re-runs cold on the frozen/pivoted draft. Case 1 (pivot premise) is the proof it's load-bearing — the pivot arrived after Step 5, so only the cold re-run could catch it. Collapsing the two would re-open the exact gap that motivated Endring 9.
Adjacent (not overlap) pairs the prompts separate by design and the Del 4 cases confirm as distinct defects: editorial P4 (prose contradiction) vs content C3 (argument-logic contradiction); editorial A1 (vividness) vs content C4 (a load-bearing claim a skeptic won't believe abstractly).
Unique catch per gate — none is a subset of another
Every one of the seven has ≥1 catch no other gate's fixture surfaces:
- fact-checker — early/cheap truth on the moving draft; provably insufficient
alone (never saw the pivot), which is the case for keeping
fact-reviewer. - editorial-reviewer — A2 theory-anchor and A3 series-title symmetry are pure blind spots no other gate measures (and were persona-blind on Del 4).
- persona-reviewer — reader response (Krok/resonans/takeaway); the only gate on that axis. The «PASS yet 8 editorial + 6 argument + 6 language points» result is the whole motivation for the stack.
- content-reviewer — argument logic (C1/C2/C3/C5 all unique); the only gate that asks does the reasoning hold?
- language-reviewer — anglicisms, kanselli-stil, cadence; the only gate on Norwegian idiom/register/rhythm.
- fact-reviewer — the pivot-risk catches (Cases 1, 5); the only cold post-pivot truth re-run.
- voice-scrubber — de-AI tells + chronicle voice drift; the only gate that applies edits and keeps a drift-log.
Trim decision — NO TRIM
No gate catches nothing the others don't. Every gate has ≥1 unique catch on the fixtures, and every one of the four genuine overlaps (O1–O4) is justified — a cold re-take (O1), the same symptom via a different operation (O2, O3), or two distinct defects sharing a surface topic (O4) — with no subsumption in any direction. The fact-checker ↔ fact-reviewer overlap is load-bearing by construction (proven by the pivot-premise catch). Per the Steg 20 rule this is the «redundancy is justified — record and keep» case for all measurable gates.
voice-scrubber specifically: no in-repo fixture, so its overlap cannot be
measured here → «measurement inconclusive; redundancy retained pending a real
edition» (Step 20 On-failure = skip the trim). Its axis is orthogonal by design
and it is not part of the review-report package, so there is no redundancy claim to
adjudicate even in principle.
Consequence for the count contract: no gate removed → counts unchanged.
| Count | Value | Touched? |
|---|---|---|
| Agents | 19 | no |
| Commands | 27 | no |
The count contract (EXPECT_AGENTS=19, the CLAUDE.md/README agent tables) is not
modified this step — there is nothing to update because nothing was trimmed.
Steg 21 (version bump + count recompute) inherits an unchanged 19/27 baseline.
Verification
test -f docs/remediation/overlap-measurement.md→ present (this file).- Per-reviewer catch table present (one per gate) + cross-gate overlap matrix.
- No gate removed → count contract untouched;
EXPECT_AGENTSstays 19. (The trim branch'stest-runner.sh exit 0 + same-commit count updateis N/A — no trim.) bash scripts/test-runner.shrun for hygiene regardless → expect exit 0 (repo green, nothing changed but a doc).