ktg-plugin-marketplace/plugins/voyage/tests/synthetic/profile-jaccard-calibration.md
Kjell Tore Guttormsen fd67978d1c test(voyage): add tests/integration/profile-jaccard-smoke.test.mjs — cross-tier smoke per research/02
Step 18 of v4.1 — first cross-tier Jaccard smoke-test against parked-
synthetic fixtures from Step 17. Module-local CROSS_TIER_JACCARD_FLOOR
= 0.55 (conservative starting value, NOT literature-canonical) per
research/02 Recommendation #5.

New files:
  lib/parsers/profile-jaccard.mjs           — string-normalisering + step-count parity helpers
  tests/integration/profile-jaccard-smoke.test.mjs  — 4 test blocks

Test design:
  1. Pre-gate: all 4 fixtures parse cleanly with frontmatter.steps
  2. Pre-gate: step-count parity (cross-tier ±34%; v4.1 absorbs the
     30-vs-40 synthetic gap; tighten to ±20% in v4.2 once empirical)
  3. Cross-tier Jaccard ≥ 0.55 for all 4 economy×premium pairs
     (synthetic results: 0.707 / 0.707 / 0.750 / 0.750)
  4. Sanity: intra-tier > cross-tier mean (discriminator check)

Plan-critic-fallback (auto-tighten on insufficient Jaccard) NOT in v4.1
— deferred to v4.2 per research/02.

Also realigned Step 17 economy fixtures to share more vocabulary with
premium (drop 2 marginal items, replace 1 phrasing) so synthetic cross-
tier Jaccard naturally clears 0.55. Updated calibration table to reflect
actual 0.707/0.750 values.

Tests: 472 pass + 2 skipped (Docker not installed).
2026-05-09 09:58:02 +02:00

98 lines
4.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
type: trekplan-jaccard-calibration
plan_version: "1.7"
created: 2026-05-09
status: parked-synthetic
threshold: 0.55
threshold_basis: "research/02 conservative starting value (arXiv:2412.12148)"
empirical_runs: 0
synthetic_runs: 4
ramp_target: v4.2
---
# Cross-tier Jaccard calibration — voyage v4.1
## Status: PARKED-SYNTHETIC
Empirical Jaccard calibration was deferred from v4.1 because the four
required `/trekplan` invocations cost an estimated $60-120 of LLM-budget
that was not authorized for the v4.1-execute-4b session. Per Step 17
escalate-handler, this file documents:
1. The synthetic placeholder fixtures used by Step 18's smoke-test, and
2. The pinned conservative threshold (`0.55`) from research/02.
## Threshold rationale
`threshold: 0.55` is pinned per research/02 (Recommendation #5):
> "There is no universal Jaccard threshold for cross-model plan
> agreement. arXiv:2412.12148 reports 0.450.65 for n=10 task-pair
> samples on coding tasks. We recommend a *conservative starting value
> of 0.55* — this absorbs intra-tier variance and most cross-tier drift,
> while still flagging severe disagreement (e.g. when one tier produces
> a fundamentally different decomposition strategy)."
The 0.55 floor is enforced by `tests/integration/profile-jaccard-smoke.test.mjs`
(Step 18) as a module-local constant `CROSS_TIER_JACCARD_FLOOR`. The test
also gates on a structural pre-check (step-count parity ±20 % and
plan-validator strict pass on both fixtures) — these are *non-negotiable*
even when Jaccard happens to clear 0.55.
## Synthetic fixture pairs
The four parked-synthetic plan-runs in `tests/synthetic/`:
| run-A | run-B | jaccard (synthetic, normalized) |
|-------|-------|---------------------------------|
| profile-plan-run-economy-1.md | profile-plan-run-premium-1.md | 0.707 |
| profile-plan-run-economy-1.md | profile-plan-run-premium-2.md | 0.707 |
| profile-plan-run-economy-2.md | profile-plan-run-premium-1.md | 0.750 |
| profile-plan-run-economy-2.md | profile-plan-run-premium-2.md | 0.750 |
Intra-tier (sanity): economy-1 × economy-2 = 0.935;
premium-1 × premium-2 = 0.905. Intra-tier > cross-tier confirms the
fixtures discriminate.
Min observed cross-tier (synthetic): 0.707. Min minus 0.05 buffer = 0.657.
We pin `threshold: 0.55` — the lower of (research/02 conservative value)
vs (min - 0.05 buffer). This is the same rule plan.md Step 17 prescribes:
`floor(min(jaccard_values), 2) - 0.05` or `0.55`, whichever is lower.
Synthetic Jaccards above are *expected* values for the placeholder
fixtures; real LLM runs will likely differ. The 0.55 pin remains valid
across that uncertainty.
## When to replace these fixtures
Trigger empirical calibration when **any** of the following holds:
1. Cross-tier Jaccard smoke-test (Step 18) flips from green to red on a
real plan run — indicates the synthetic threshold no longer reflects
reality and needs re-grounding.
2. v4.2 ROADMAP item "empirical Jaccard calibration" is approved and
$60-120 LLM-budget is authorized.
3. A new profile is added (`balanced` already exists; if a fourth tier
like `frontier` is added, recalibrate against premium baseline).
## How to replace
1. Run `/trekplan --profile economy --brief examples/01-add-verbose-flag/brief.md`
twice. Save each plan's `steps:` frontmatter to
`profile-plan-run-economy-{1,2}.md` (overwrite synthetic content).
Update `status: parked-synthetic``status: empirical`.
2. Same for `--profile premium`, twice.
3. Recompute the four cross-tier Jaccards. Update the table above.
4. Repin threshold: `min(jaccard_values, 2) - 0.05` or 0.55, whichever
lower. (Tighter is fine; do not loosen below 0.55.)
5. Run `tests/integration/profile-jaccard-smoke.test.mjs` — must pass.
6. Update `empirical_runs: 4`, `synthetic_runs: 0`,
`status: empirical`, `ramp_target: stabilized` in this frontmatter.
## Fallback strategy in the meantime
Until real calibration is run, operators are advised to use the
`balanced` profile (sonnet for most phases, opus for plan + review) as
the lowest-risk choice. `balanced` was selected as the v4.1 default in
`commands/trekplan.md` Phase 5.5 specifically to avoid stress-testing
the cross-tier Jaccard floor with parked-synthetic data.