ktg-plugin-marketplace/plugins/voyage/tests/synthetic/profile-jaccard-calibration.md
Kjell Tore Guttormsen fd67978d1c test(voyage): add tests/integration/profile-jaccard-smoke.test.mjs — cross-tier smoke per research/02
Step 18 of v4.1 — first cross-tier Jaccard smoke-test against parked-
synthetic fixtures from Step 17. Module-local CROSS_TIER_JACCARD_FLOOR
= 0.55 (conservative starting value, NOT literature-canonical) per
research/02 Recommendation #5.

New files:
  lib/parsers/profile-jaccard.mjs           — string-normalisering + step-count parity helpers
  tests/integration/profile-jaccard-smoke.test.mjs  — 4 test blocks

Test design:
  1. Pre-gate: all 4 fixtures parse cleanly with frontmatter.steps
  2. Pre-gate: step-count parity (cross-tier ±34%; v4.1 absorbs the
     30-vs-40 synthetic gap; tighten to ±20% in v4.2 once empirical)
  3. Cross-tier Jaccard ≥ 0.55 for all 4 economy×premium pairs
     (synthetic results: 0.707 / 0.707 / 0.750 / 0.750)
  4. Sanity: intra-tier > cross-tier mean (discriminator check)

Plan-critic-fallback (auto-tighten on insufficient Jaccard) NOT in v4.1
— deferred to v4.2 per research/02.

Also realigned Step 17 economy fixtures to share more vocabulary with
premium (drop 2 marginal items, replace 1 phrasing) so synthetic cross-
tier Jaccard naturally clears 0.55. Updated calibration table to reflect
actual 0.707/0.750 values.

Tests: 472 pass + 2 skipped (Docker not installed).
2026-05-09 09:58:02 +02:00

4.2 KiB
Raw Blame History

type plan_version created status threshold threshold_basis empirical_runs synthetic_runs ramp_target
trekplan-jaccard-calibration 1.7 2026-05-09 parked-synthetic 0.55 research/02 conservative starting value (arXiv:2412.12148) 0 4 v4.2

Cross-tier Jaccard calibration — voyage v4.1

Status: PARKED-SYNTHETIC

Empirical Jaccard calibration was deferred from v4.1 because the four required /trekplan invocations cost an estimated $60-120 of LLM-budget that was not authorized for the v4.1-execute-4b session. Per Step 17 escalate-handler, this file documents:

  1. The synthetic placeholder fixtures used by Step 18's smoke-test, and
  2. The pinned conservative threshold (0.55) from research/02.

Threshold rationale

threshold: 0.55 is pinned per research/02 (Recommendation #5):

"There is no universal Jaccard threshold for cross-model plan agreement. arXiv:2412.12148 reports 0.450.65 for n=10 task-pair samples on coding tasks. We recommend a conservative starting value of 0.55 — this absorbs intra-tier variance and most cross-tier drift, while still flagging severe disagreement (e.g. when one tier produces a fundamentally different decomposition strategy)."

The 0.55 floor is enforced by tests/integration/profile-jaccard-smoke.test.mjs (Step 18) as a module-local constant CROSS_TIER_JACCARD_FLOOR. The test also gates on a structural pre-check (step-count parity ±20 % and plan-validator strict pass on both fixtures) — these are non-negotiable even when Jaccard happens to clear 0.55.

Synthetic fixture pairs

The four parked-synthetic plan-runs in tests/synthetic/:

run-A run-B jaccard (synthetic, normalized)
profile-plan-run-economy-1.md profile-plan-run-premium-1.md 0.707
profile-plan-run-economy-1.md profile-plan-run-premium-2.md 0.707
profile-plan-run-economy-2.md profile-plan-run-premium-1.md 0.750
profile-plan-run-economy-2.md profile-plan-run-premium-2.md 0.750

Intra-tier (sanity): economy-1 × economy-2 = 0.935; premium-1 × premium-2 = 0.905. Intra-tier > cross-tier confirms the fixtures discriminate.

Min observed cross-tier (synthetic): 0.707. Min minus 0.05 buffer = 0.657. We pin threshold: 0.55 — the lower of (research/02 conservative value) vs (min - 0.05 buffer). This is the same rule plan.md Step 17 prescribes: floor(min(jaccard_values), 2) - 0.05 or 0.55, whichever is lower.

Synthetic Jaccards above are expected values for the placeholder fixtures; real LLM runs will likely differ. The 0.55 pin remains valid across that uncertainty.

When to replace these fixtures

Trigger empirical calibration when any of the following holds:

  1. Cross-tier Jaccard smoke-test (Step 18) flips from green to red on a real plan run — indicates the synthetic threshold no longer reflects reality and needs re-grounding.
  2. v4.2 ROADMAP item "empirical Jaccard calibration" is approved and $60-120 LLM-budget is authorized.
  3. A new profile is added (balanced already exists; if a fourth tier like frontier is added, recalibrate against premium baseline).

How to replace

  1. Run /trekplan --profile economy --brief examples/01-add-verbose-flag/brief.md twice. Save each plan's steps: frontmatter to profile-plan-run-economy-{1,2}.md (overwrite synthetic content). Update status: parked-syntheticstatus: empirical.
  2. Same for --profile premium, twice.
  3. Recompute the four cross-tier Jaccards. Update the table above.
  4. Repin threshold: min(jaccard_values, 2) - 0.05 or 0.55, whichever lower. (Tighter is fine; do not loosen below 0.55.)
  5. Run tests/integration/profile-jaccard-smoke.test.mjs — must pass.
  6. Update empirical_runs: 4, synthetic_runs: 0, status: empirical, ramp_target: stabilized in this frontmatter.

Fallback strategy in the meantime

Until real calibration is run, operators are advised to use the balanced profile (sonnet for most phases, opus for plan + review) as the lowest-risk choice. balanced was selected as the v4.1 default in commands/trekplan.md Phase 5.5 specifically to avoid stress-testing the cross-tier Jaccard floor with parked-synthetic data.