Kjell Tore Guttormsen 0d3da7828d docs(linkedin-studio): measure long-form review-pass overlap, trim where unjustified

Steg 20 (remediation Wave 4 / S5, SOLO): measure whether the 7-agent long-form
review stack carries redundant gates. Method: cross-reference each agent's check
taxonomy against its in-repo fasit fixture; four fixtures (editorial, content,
language, fact-reviewer) target the SAME Del 4 edition, enabling a real
cross-gate overlap comparison on one piece (not a live run — fixtures' own
live-run notes require a reload + cross-repo Maskinrommet access, out of scope).

Finding: every gate has >=1 unique catch on Del 4. The four genuine overlaps
(verbatim repetition, the Vi/Vi-i-Nav quote, the postulated number, the
small-orgs thread) are each justified — a cold re-take (Endring 9's reason to
exist), the same symptom via a different operation (flag-absence vs web-verify),
or two distinct defects sharing a surface topic — with no subsumption either way.
The fact-checker <-> fact-reviewer overlap is load-bearing (the pivot premise
arrived after Step 5, so only the cold re-run caught it).

Decision: NO TRIM. voice-scrubber has no fixture -> inconclusive; redundancy
retained (Step 20 On-failure = skip). Counts unchanged 19 agents / 27 commands;
count contract (EXPECT_AGENTS=19) untouched. test-runner 62/62 green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-05-30 07:17:55 +02:00

14 KiB

Raw Permalink Blame History

Long-form Review-Pass Overlap Measurement — Steg 20

Remediation Voyage, Wave 4 / S5. Measures whether the long-form review stack carries redundant gates, and trims only where a gate catches nothing the others don't. Written 2026-05-30, SOLO (no subagent fan-out).

The question and the trim rule

The long-form pipeline runs seven review agents. Endring 9 (v3.1.0) added a cold/headless package (content-reviewer, language-reviewer, fact-reviewer) whose agent prompts argue, in their own words, that they overlap the in-session gates on purpose (fact-reviewer: «the redundancy is load-bearing, not waste»; language-reviewer anti-pattern: «'De-duplicate' yourself against editorial-reviewer — the overlap is the cold re-take»). Steg 20 tests that claim against evidence instead of taking it on faith:

Trim a gate ONLY where it catches nothing the others don't (then merge/remove it + update the count contract). If the redundancy is justified, record that and keep it. If the fixture is insufficient to decide, record «inconclusive; redundancy retained» and do NOT trim. (Step 20 On-failure = skip the trim.)

Method — and its honest limit

I measured the documented catch-sets: each agent's check taxonomy (the agent .md) cross-referenced against its in-repo fasit fixture (agents/fixtures/*-cases.md). I did not run the agents live: every fixture's own live-run note states a live cold run needs (a) a session reload and (b) read access to the frozen Del 4 draft in the Maskinrommet series folder — cross-repo, explicitly out of scope this session. By each fixture's own declaration the fasit is «the gold-standard of record» until both hold, so the fasit catch-sets are the legitimate measurement surface.

The lucky break that makes this more than taxonomy-reasoning: four of the six fixtures target the same edition — Del 4 (Security Champions, Maskinrommet). editorial-reviewer reviewed v5 (2026-05-28, in-session); the cold trio (content/language/fact-reviewer) re-read the frozen/pivoted version (2026-05-29). That shared edition lets me compare what each gate actually caught on one piece — a real cross-gate overlap measurement, not just a boundary restatement.

Fixture	Edition under review	Cases	Enables shared-edition compare?
`editorial-reviewer-cases.md`	Del 4 v5 (28.05, in-session)	8	✅ yes
`content-reviewer-cases.md`	Del 4 frozen/pivoted (29.05, cold)	6	✅ yes
`language-reviewer-cases.md`	Del 4 frozen (29.05, cold)	6	✅ yes
`fact-reviewer-cases.md`	Del 4 frozen/pivoted (29.05, cold)	6	✅ yes
`persona-reviewer-cases.md`	separate jargon-wall sample (+ documented Del 4 behaviour)	6 axes	partial
`fact-checker-cases.md`	3 generic reference claims (not Del 4)	3	role only
`voice-scrubber`	NO FIXTURE	—	❌ inconclusive

The seven agents — axis map

Agent	Step	Axis (the one question it answers)	When	Fixture
`fact-checker`	5	factual truth — is it true?	in-session, moving draft	generic (3 claims)
`editorial-reviewer`	5.5	prose craft + narrative architecture — is it well-made?	in-session	Del 4 v5
`persona-reviewer`	2.5/6/9	reader response — does it land?	in-session	sample + Del 4 behaviour
`voice-scrubber`	4	de-AI + chronicle voice drift — does it sound like the author?	in-session (applies edits)	none
`content-reviewer`	6.5	argument integrity — does the reasoning hold?	cold/frozen	Del 4 frozen
`language-reviewer`	6.5	Norwegian language — does it read clean?	cold/frozen	Del 4 frozen
`fact-reviewer`	6.5	factual truth, re-verified — is every claim, incl. pivot, true?	cold/frozen+pivoted	Del 4 frozen

Per-reviewer catch table (what each gate caught on the fixtures)

Legend: U = unique catch (no other gate's fixture surfaces this defect) · O = overlaps another gate's catch (overlap analysed in the matrix below).

`editorial-reviewer` — Del 4 v5 (8 catches)

#	Check	Defect caught	Sev	U/O
1	A1	abstract figure never instantiated (craft/vividness)	REWORK	O → content C4 (adjacent)
2	P3	postulated number, no source/hedge — flags absence, no search	REWORK	O → fact-reviewer F3
3	A2	trust-effect hypothesis with no SDT/theory anchor	BLOCK	U
4	A3	broken series-title symmetry (part floats free)	REWORK	U
5	A4	small-business addressee stranded — no usable action	BLOCK	O → content C5 (adjacent)
6	P2	verbatim repetition	REWORK	O → language L1
7	P1	em-dash over-density	REWORK	U
8	P4	prose-level internal contradiction (two passages)	BLOCK	O → content C3 (adjacent)

`content-reviewer` — Del 4 frozen (6 catches) — argument-integritet

#	Check	Defect caught	Sev	U/O
1	C2	Security-Champions pivot premise asserted unsupported	BLOCK	U
2	C5	unanswered «what about small orgs?» objection	BLOCK	O → editorial A4 (adjacent)
3	C1	logical hole «Champions finnes» → «dømmekraft bevart»	REWORK	U
4	C4	role section needs one concrete org for the argument	REWORK	O → editorial A1 (adjacent)
5	C3	recommendation delegates the judgment the series premise rules out	BLOCK	U
6	C2	gevinst assumes widespread org maturity	REWORK	U

`language-reviewer` — Del 4 frozen (6 catches) — norsk-språkkvalitet

#	Check	Defect caught	Sev	U/O
1	L4	quote error «Vi» vs «Vi i Nav» (wording misrepresents source)	BLOCK	O → fact-reviewer F2
2	L2	anglicism «adressere problemet»	REWORK	U
3	L2	anglicism «på en daglig basis»	REWORK	U
4	L1	verbatim repetition 3× across §1/§4/§6	REWORK	O → editorial P2
5	L3	«det vises til» kanselli-stil in a personal chronicle	REWORK	U
6	L5	monotone cadence (5 same-length sentences)	NICE	U

`fact-reviewer` — Del 4 frozen/pivoted (6 catches) — faktisk-korrekthet (cold)

#	Check	Defect caught	Verdict	U/O
1	F1	pivot premise never met Step 5 (PIVOT-RISK headline)	🔴	U
2	F1+F2	misattribution to wrong originator	🔴	U
3	F2	quote precision «Vi» vs «Vi i Nav» (vs source)	🟡	O → language L4
4	F3	postulated number, no provenance — searches, finds none	🟡	O → editorial P3
5	F1	«Security Champions» as a settled standard that varies per org (PIVOT-RISK)	🔴	U
6	F4+F3	secondary source for a precise figure («~a third» ≠ «37 %»)	🟡	U

`fact-checker` — role on Del 4 (generic fixture, 3 claims)

Catches truth defects cheaply and early, on the moving draft (Step 5). Its fixture is 3 generic ground-truth claims (EU AI Act 🟢 / GPT-4-by-Anthropic 🔴 / unverifiable 37 % 🟡), not Del 4. Its measured role on Del 4 is documented by the fact-reviewer fixture: the Security-Champions pivot arrived after the Step 5 sweep, so fact-checker structurally never saw the pivot premise. It is necessary (early/cheap truth gate) but provably insufficient — which is the entire reason fact-reviewer exists. U by pipeline position.

`persona-reviewer` — resonance/response

On Del 4 the persona sweep returned 15 flags across 3 personas and every persona PASS / ready-to-publish (per the editorial fixture). Its own fixture (jargon-wall sample) shows the 6 response axes (Krok IKKE, Leder-takeaway IKKE, …). Catches reader-response defects no other gate measures. U by axis.

`voice-scrubber` — de-AI + chronicle voice drift

No fixture exists. Its axis (mechanical AI-tells + Norwegian-chronicle voice drift, judged against approved Norwegian editions) is measured by no other gate, and uniquely it applies edits (Pass 1) and maintains a drift-log — it is not even part of the review-report package. Overlap inconclusive from in-repo fixtures; see decision below.

Cross-gate overlap matrix (the shared Del 4 edition)

Four genuine overlaps surface on Del 4. The decisive test for each: does either gate's catch-set subsume the other's? In every case — no.

#	Defect	Gates that catch it	Same defect or same symptom?	Subsumption?	Justification
O1	verbatim repetition	editorial P2 (in-session, v5) ↔ language L1 (cold, frozen)	same defect	neither	Cold re-take. Editorial caught it in-session sharing the author's framing; language re-caught it cold on the frozen version. The agent prompts mandate this overlap explicitly. The value is the independent reading, not a second checklist.
O2	quote «Vi» vs «Vi i Nav»	language L4 (BLOCK) ↔ fact-reviewer F2 (🟡)	same defect, two operations	neither	language flags the wording misrepresenting the source without web access; fact-reviewer verifies against the actual source via web search. Different tools, different severities — one catches it if the source is unreachable, the other if the wording reads clean but the source differs.
O3	postulated number	editorial P3 (REWORK) ↔ fact-reviewer F3 (🟡)	same symptom, two operations	neither	editorial flags the absence of a source/hedge (no search); fact-reviewer searches for provenance and finds none. The prompts draw this boundary by hand. A bare number with a findable source passes editorial (it has none inline) but is exactly what fact-reviewer's search resolves.
O4	small-orgs thread	editorial A4 (stranded addressee) ↔ content C5 (unanswered objection)	adjacent — different defects	n/a	Same surface topic (small orgs) decomposes into two genuinely different defects: A4 = «the small-business reader leaves with no action» (architecture); C5 = «the argument never meets the obvious counter and collapses for that class» (logic). Not redundancy — two gates needed to see both faces.

Plus the fact-checker ↔ fact-reviewer time-axis overlap (deliberate, not in the matrix because it spans pipeline stages, not one defect): Step 5 runs in-session on the moving draft; Step 6.5 re-runs cold on the frozen/pivoted draft. Case 1 (pivot premise) is the proof it's load-bearing — the pivot arrived after Step 5, so only the cold re-run could catch it. Collapsing the two would re-open the exact gap that motivated Endring 9.

Adjacent (not overlap) pairs the prompts separate by design and the Del 4 cases confirm as distinct defects: editorial P4 (prose contradiction) vs content C3 (argument-logic contradiction); editorial A1 (vividness) vs content C4 (a load-bearing claim a skeptic won't believe abstractly).

Unique catch per gate — none is a subset of another

Every one of the seven has ≥1 catch no other gate's fixture surfaces:

fact-checker — early/cheap truth on the moving draft; provably insufficient alone (never saw the pivot), which is the case for keeping fact-reviewer.
editorial-reviewer — A2 theory-anchor and A3 series-title symmetry are pure blind spots no other gate measures (and were persona-blind on Del 4).
persona-reviewer — reader response (Krok/resonans/takeaway); the only gate on that axis. The «PASS yet 8 editorial + 6 argument + 6 language points» result is the whole motivation for the stack.
content-reviewer — argument logic (C1/C2/C3/C5 all unique); the only gate that asks does the reasoning hold?
language-reviewer — anglicisms, kanselli-stil, cadence; the only gate on Norwegian idiom/register/rhythm.
fact-reviewer — the pivot-risk catches (Cases 1, 5); the only cold post-pivot truth re-run.
voice-scrubber — de-AI tells + chronicle voice drift; the only gate that applies edits and keeps a drift-log.

Trim decision — NO TRIM

No gate catches nothing the others don't. Every gate has ≥1 unique catch on the fixtures, and every one of the four genuine overlaps (O1–O4) is justified — a cold re-take (O1), the same symptom via a different operation (O2, O3), or two distinct defects sharing a surface topic (O4) — with no subsumption in any direction. The fact-checker ↔ fact-reviewer overlap is load-bearing by construction (proven by the pivot-premise catch). Per the Steg 20 rule this is the «redundancy is justified — record and keep» case for all measurable gates.

voice-scrubber specifically: no in-repo fixture, so its overlap cannot be measured here → «measurement inconclusive; redundancy retained pending a real edition» (Step 20 On-failure = skip the trim). Its axis is orthogonal by design and it is not part of the review-report package, so there is no redundancy claim to adjudicate even in principle.

Consequence for the count contract: no gate removed → counts unchanged.

Count	Value	Touched?
Agents	19	no
Commands	27	no

The count contract (EXPECT_AGENTS=19, the CLAUDE.md/README agent tables) is not modified this step — there is nothing to update because nothing was trimmed. Steg 21 (version bump + count recompute) inherits an unchanged 19/27 baseline.

Verification

test -f docs/remediation/overlap-measurement.md → present (this file).
Per-reviewer catch table present (one per gate) + cross-gate overlap matrix.
No gate removed → count contract untouched; EXPECT_AGENTS stays 19. (The trim branch's test-runner.sh exit 0 + same-commit count update is N/A — no trim.)
bash scripts/test-runner.sh run for hygiene regardless → expect exit 0 (repo green, nothing changed but a doc).

14 KiB Raw Permalink Blame History Unescape Escape