ktg-plugin-marketplace/plugins/linkedin-studio/docs/remediation/overlap-measurement.md
Kjell Tore Guttormsen 0d3da7828d docs(linkedin-studio): measure long-form review-pass overlap, trim where unjustified
Steg 20 (remediation Wave 4 / S5, SOLO): measure whether the 7-agent long-form
review stack carries redundant gates. Method: cross-reference each agent's check
taxonomy against its in-repo fasit fixture; four fixtures (editorial, content,
language, fact-reviewer) target the SAME Del 4 edition, enabling a real
cross-gate overlap comparison on one piece (not a live run — fixtures' own
live-run notes require a reload + cross-repo Maskinrommet access, out of scope).

Finding: every gate has >=1 unique catch on Del 4. The four genuine overlaps
(verbatim repetition, the Vi/Vi-i-Nav quote, the postulated number, the
small-orgs thread) are each justified — a cold re-take (Endring 9's reason to
exist), the same symptom via a different operation (flag-absence vs web-verify),
or two distinct defects sharing a surface topic — with no subsumption either way.
The fact-checker <-> fact-reviewer overlap is load-bearing (the pivot premise
arrived after Step 5, so only the cold re-run caught it).

Decision: NO TRIM. voice-scrubber has no fixture -> inconclusive; redundancy
retained (Step 20 On-failure = skip). Counts unchanged 19 agents / 27 commands;
count contract (EXPECT_AGENTS=19) untouched. test-runner 62/62 green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 07:17:55 +02:00

217 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Long-form Review-Pass Overlap Measurement — Steg 20
_Remediation Voyage, Wave 4 / S5. Measures whether the long-form review stack
carries redundant gates, and trims **only** where a gate catches nothing the
others don't. Written 2026-05-30, SOLO (no subagent fan-out)._
## The question and the trim rule
The long-form pipeline runs **seven** review agents. Endring 9 (v3.1.0) added a
cold/headless package (`content-reviewer`, `language-reviewer`, `fact-reviewer`)
whose agent prompts argue, in their own words, that they overlap the in-session
gates **on purpose** (`fact-reviewer`: «the redundancy is load-bearing, not
waste»; `language-reviewer` anti-pattern: «'De-duplicate' yourself against
editorial-reviewer — the overlap is the cold re-take»). Steg 20 tests that claim
against evidence instead of taking it on faith:
> **Trim a gate ONLY where it catches nothing the others don't** (then merge/remove
> it + update the count contract). **If the redundancy is justified, record that
> and keep it. If the fixture is insufficient to decide, record «inconclusive;
> redundancy retained» and do NOT trim.** (Step 20 On-failure = skip the trim.)
## Method — and its honest limit
I measured the **documented catch-sets**: each agent's check taxonomy (the agent
`.md`) cross-referenced against its in-repo **fasit fixture**
(`agents/fixtures/*-cases.md`). I did **not** run the agents live: every fixture's
own *live-run note* states a live cold run needs (a) a session reload and (b)
read access to the frozen Del 4 draft in the **Maskinrommet series folder**
cross-repo, explicitly out of scope this session. By each fixture's own
declaration the fasit is «the gold-standard of record» until both hold, so the
fasit catch-sets are the legitimate measurement surface.
**The lucky break that makes this more than taxonomy-reasoning:** four of the six
fixtures target the **same edition** — Del 4 (Security Champions, Maskinrommet).
`editorial-reviewer` reviewed v5 (2026-05-28, in-session); the cold trio
(`content`/`language`/`fact-reviewer`) re-read the **frozen/pivoted** version
(2026-05-29). That shared edition lets me compare what each gate *actually caught
on one piece* — a real cross-gate overlap measurement, not just a boundary
restatement.
| Fixture | Edition under review | Cases | Enables shared-edition compare? |
|---------|---------------------|-------|----------------------------------|
| `editorial-reviewer-cases.md` | **Del 4 v5** (28.05, in-session) | 8 | ✅ yes |
| `content-reviewer-cases.md` | **Del 4 frozen/pivoted** (29.05, cold) | 6 | ✅ yes |
| `language-reviewer-cases.md` | **Del 4 frozen** (29.05, cold) | 6 | ✅ yes |
| `fact-reviewer-cases.md` | **Del 4 frozen/pivoted** (29.05, cold) | 6 | ✅ yes |
| `persona-reviewer-cases.md` | separate jargon-wall sample (+ documented Del 4 behaviour) | 6 axes | partial |
| `fact-checker-cases.md` | 3 generic reference claims (not Del 4) | 3 | role only |
| `voice-scrubber` | **NO FIXTURE** | — | ❌ inconclusive |
## The seven agents — axis map
| Agent | Step | Axis (the one question it answers) | When | Fixture |
|-------|------|-------------------------------------|------|---------|
| `fact-checker` | 5 | factual truth — *is it true?* | in-session, **moving draft** | generic (3 claims) |
| `editorial-reviewer` | 5.5 | prose craft + narrative architecture — *is it well-made?* | in-session | Del 4 v5 |
| `persona-reviewer` | 2.5/6/9 | reader response — *does it land?* | in-session | sample + Del 4 behaviour |
| `voice-scrubber` | 4 | de-AI + chronicle voice drift — *does it sound like the author?* | in-session (**applies** edits) | none |
| `content-reviewer` | 6.5 | argument integrity — *does the reasoning hold?* | **cold/frozen** | Del 4 frozen |
| `language-reviewer` | 6.5 | Norwegian language — *does it read clean?* | **cold/frozen** | Del 4 frozen |
| `fact-reviewer` | 6.5 | factual truth, re-verified — *is every claim, incl. pivot, true?* | **cold/frozen+pivoted** | Del 4 frozen |
## Per-reviewer catch table (what each gate caught on the fixtures)
Legend: **U** = unique catch (no other gate's fixture surfaces this defect) ·
**O** = overlaps another gate's catch (overlap analysed in the matrix below).
### `editorial-reviewer` — Del 4 v5 (8 catches)
| # | Check | Defect caught | Sev | U/O |
|---|-------|---------------|-----|-----|
| 1 | A1 | abstract figure never instantiated (craft/vividness) | REWORK | O → content C4 (adjacent) |
| 2 | P3 | postulated number, no source/hedge — *flags absence, no search* | REWORK | O → fact-reviewer F3 |
| 3 | A2 | trust-effect hypothesis with no SDT/theory anchor | BLOCK | **U** |
| 4 | A3 | broken series-title symmetry (part floats free) | REWORK | **U** |
| 5 | A4 | small-business addressee stranded — no usable action | BLOCK | O → content C5 (adjacent) |
| 6 | P2 | verbatim repetition | REWORK | O → language L1 |
| 7 | P1 | em-dash over-density | REWORK | **U** |
| 8 | P4 | prose-level internal contradiction (two passages) | BLOCK | O → content C3 (adjacent) |
### `content-reviewer` — Del 4 frozen (6 catches) — argument-integritet
| # | Check | Defect caught | Sev | U/O |
|---|-------|---------------|-----|-----|
| 1 | C2 | Security-Champions **pivot premise** asserted unsupported | BLOCK | **U** |
| 2 | C5 | unanswered «what about small orgs?» objection | BLOCK | O → editorial A4 (adjacent) |
| 3 | C1 | logical hole «Champions finnes» → «dømmekraft bevart» | REWORK | **U** |
| 4 | C4 | role section needs **one concrete org** for the argument | REWORK | O → editorial A1 (adjacent) |
| 5 | C3 | recommendation **delegates the judgment** the series premise rules out | BLOCK | **U** |
| 6 | C2 | gevinst assumes widespread org maturity | REWORK | **U** |
### `language-reviewer` — Del 4 frozen (6 catches) — norsk-språkkvalitet
| # | Check | Defect caught | Sev | U/O |
|---|-------|---------------|-----|-----|
| 1 | L4 | quote error «Vi» vs «Vi i Nav» (wording misrepresents source) | BLOCK | O → fact-reviewer F2 |
| 2 | L2 | anglicism «adressere problemet» | REWORK | **U** |
| 3 | L2 | anglicism «på en daglig basis» | REWORK | **U** |
| 4 | L1 | verbatim repetition 3× across §1/§4/§6 | REWORK | O → editorial P2 |
| 5 | L3 | «det vises til» kanselli-stil in a personal chronicle | REWORK | **U** |
| 6 | L5 | monotone cadence (5 same-length sentences) | NICE | **U** |
### `fact-reviewer` — Del 4 frozen/pivoted (6 catches) — faktisk-korrekthet (cold)
| # | Check | Defect caught | Verdict | U/O |
|---|-------|---------------|---------|-----|
| 1 | F1 | **pivot premise never met Step 5** (PIVOT-RISK headline) | 🔴 | **U** |
| 2 | F1+F2 | misattribution to wrong originator | 🔴 | **U** |
| 3 | F2 | quote precision «Vi» vs «Vi i Nav» (vs source) | 🟡 | O → language L4 |
| 4 | F3 | postulated number, no provenance — *searches, finds none* | 🟡 | O → editorial P3 |
| 5 | F1 | «Security Champions» as a settled standard that **varies per org** (PIVOT-RISK) | 🔴 | **U** |
| 6 | F4+F3 | secondary source for a precise figure («~a third» ≠ «37 %») | 🟡 | **U** |
### `fact-checker` — role on Del 4 (generic fixture, 3 claims)
Catches truth defects **cheaply and early, on the moving draft** (Step 5). Its
fixture is 3 generic ground-truth claims (EU AI Act 🟢 / GPT-4-by-Anthropic 🔴 /
unverifiable 37 % 🟡), not Del 4. Its measured **role** on Del 4 is documented by
the `fact-reviewer` fixture: the Security-Champions pivot arrived **after** the
Step 5 sweep, so `fact-checker` structurally **never saw** the pivot premise. It
is necessary (early/cheap truth gate) but **provably insufficient** — which is the
entire reason `fact-reviewer` exists. **U** by pipeline position.
### `persona-reviewer` — resonance/response
On Del 4 the persona sweep returned **15 flags across 3 personas and every
persona PASS / ready-to-publish** (per the editorial fixture). Its own fixture
(jargon-wall sample) shows the 6 response axes (Krok IKKE, Leder-takeaway IKKE,
…). Catches **reader-response** defects no other gate measures. **U** by axis.
### `voice-scrubber` — de-AI + chronicle voice drift
**No fixture exists.** Its axis (mechanical AI-tells + Norwegian-chronicle voice
drift, judged against approved Norwegian editions) is measured by no other gate,
and uniquely it **applies** edits (Pass 1) and maintains a drift-log — it is not
even part of the review-report package. Overlap **inconclusive from in-repo
fixtures**; see decision below.
## Cross-gate overlap matrix (the shared Del 4 edition)
Four genuine overlaps surface on Del 4. The decisive test for each: **does either
gate's catch-set subsume the other's?** In every case — **no**.
| # | Defect | Gates that catch it | Same defect or same symptom? | Subsumption? | Justification |
|---|--------|---------------------|------------------------------|--------------|---------------|
| O1 | verbatim repetition | editorial **P2** (in-session, v5) ↔ language **L1** (cold, frozen) | same defect | **neither** | **Cold re-take.** Editorial caught it in-session sharing the author's framing; language re-caught it cold on the frozen version. The agent prompts mandate this overlap explicitly. The value is the independent reading, not a second checklist. |
| O2 | quote «Vi» vs «Vi i Nav» | language **L4** (BLOCK) ↔ fact-reviewer **F2** (🟡) | same defect, **two operations** | **neither** | language flags the *wording* misrepresenting the source **without web access**; fact-reviewer *verifies against the actual source via web search*. Different tools, different severities — one catches it if the source is unreachable, the other if the wording reads clean but the source differs. |
| O3 | postulated number | editorial **P3** (REWORK) ↔ fact-reviewer **F3** (🟡) | same symptom, **two operations** | **neither** | editorial flags the **absence** of a source/hedge (no search); fact-reviewer **searches for provenance and finds none**. The prompts draw this boundary by hand. A bare number with a *findable* source passes editorial (it has none inline) but is exactly what fact-reviewer's search resolves. |
| O4 | small-orgs thread | editorial **A4** (stranded addressee) ↔ content **C5** (unanswered objection) | **adjacent — different defects** | n/a | Same surface topic (small orgs) decomposes into two genuinely different defects: A4 = «the small-business reader leaves with no *action*» (architecture); C5 = «the *argument* never meets the obvious counter and collapses for that class» (logic). Not redundancy — two gates needed to see both faces. |
Plus the **fact-checker ↔ fact-reviewer time-axis overlap** (deliberate, not in
the matrix because it spans pipeline stages, not one defect): Step 5 runs
in-session on the **moving** draft; Step 6.5 re-runs cold on the **frozen/pivoted**
draft. **Case 1 (pivot premise) is the proof it's load-bearing** — the pivot
arrived after Step 5, so only the cold re-run could catch it. Collapsing the two
would re-open the exact gap that motivated Endring 9.
Adjacent (not overlap) pairs the prompts separate by design and the Del 4 cases
confirm as distinct defects: editorial **P4** (prose contradiction) vs content
**C3** (argument-logic contradiction); editorial **A1** (vividness) vs content
**C4** (a load-bearing claim a skeptic won't *believe* abstractly).
## Unique catch per gate — none is a subset of another
Every one of the seven has **≥1 catch no other gate's fixture surfaces**:
- **fact-checker** — early/cheap truth on the moving draft; provably *insufficient*
alone (never saw the pivot), which is the case for keeping `fact-reviewer`.
- **editorial-reviewer** — **A2 theory-anchor** and **A3 series-title symmetry**
are pure blind spots no other gate measures (and were persona-blind on Del 4).
- **persona-reviewer** — reader response (Krok/resonans/takeaway); the only gate on
that axis. The «PASS yet 8 editorial + 6 argument + 6 language points» result is
the whole motivation for the stack.
- **content-reviewer** — argument logic (C1/C2/C3/C5 all unique); the only gate that
asks *does the reasoning hold?*
- **language-reviewer** — anglicisms, kanselli-stil, cadence; the only gate on
Norwegian idiom/register/rhythm.
- **fact-reviewer** — the **pivot-risk** catches (Cases 1, 5); the only cold
post-pivot truth re-run.
- **voice-scrubber** — de-AI tells + chronicle voice drift; the only gate that
*applies* edits and keeps a drift-log.
## Trim decision — NO TRIM
**No gate catches nothing the others don't.** Every gate has ≥1 unique catch on the
fixtures, and every one of the four genuine overlaps (O1O4) is justified — a cold
re-take (O1), the same symptom via a different operation (O2, O3), or two distinct
defects sharing a surface topic (O4) — with **no subsumption in any direction**.
The fact-checker ↔ fact-reviewer overlap is load-bearing by construction (proven by
the pivot-premise catch). Per the Steg 20 rule this is the **«redundancy is
justified — record and keep»** case for all measurable gates.
**`voice-scrubber` specifically:** no in-repo fixture, so its overlap cannot be
*measured* here → **«measurement inconclusive; redundancy retained pending a real
edition»** (Step 20 On-failure = skip the trim). Its axis is orthogonal by design
and it is not part of the review-report package, so there is no redundancy claim to
adjudicate even in principle.
**Consequence for the count contract:** **no gate removed → counts unchanged.**
| Count | Value | Touched? |
|-------|-------|----------|
| Agents | **19** | no |
| Commands | **27** | no |
The count contract (`EXPECT_AGENTS=19`, the CLAUDE.md/README agent tables) is **not
modified** this step — there is nothing to update because nothing was trimmed.
Steg 21 (version bump + count recompute) inherits an unchanged 19/27 baseline.
## Verification
- `test -f docs/remediation/overlap-measurement.md` → present (this file).
- Per-reviewer **catch** table present (one per gate) + cross-gate overlap matrix.
- No gate removed → count contract untouched; `EXPECT_AGENTS` stays 19. (The trim
branch's `test-runner.sh exit 0 + same-commit count update` is N/A — no trim.)
- `bash scripts/test-runner.sh` run for hygiene regardless → expect exit 0 (repo
green, nothing changed but a doc).