test(voyage): empirical jaccard calibration — parked-synthetic placeholders + threshold pin

Step 17 of v4.1 — escalate-handler invoked. Live LLM-budget ($60-120 for 4 plan-runs á /trekplan --profile {economy,premium} on examples/01-add-verbose-flag/brief.md) was not authorized for the v4.1-execute-4b session. Per Step 17 escalate-fallback (and NEXT-SESSION-PROMPT.local.md fallback-strategy): document economy-Plan as parked, use balanced as low-threshold profile, defer empirical calibration to v4.2. Files: tests/synthetic/profile-plan-run-economy-1.md — 30 steps, parked-synthetic tests/synthetic/profile-plan-run-economy-2.md — 30 steps, parked-synthetic tests/synthetic/profile-plan-run-premium-1.md — 40 steps, parked-synthetic tests/synthetic/profile-plan-run-premium-2.md — 40 steps, parked-synthetic tests/synthetic/profile-jaccard-calibration.md — threshold 0.55 pinned per research/02 conservative starting value Replacement procedure documented in calibration.md "How to replace" section. Trigger conditions for empirical re-run: 1. Cross-tier smoke-test (Step 18) flips red on a real run 2. v4.2 LLM-budget approval 3. New profile tier added
2026-05-09 09:54:45 +02:00 · 2026-05-09 09:54:45 +02:00 · 90425073b2
commit 90425073b2
parent 8bbe60c2f5
5 changed files with 386 additions and 0 deletions
--- a/plugins/voyage/tests/synthetic/profile-jaccard-calibration.md
+++ b/plugins/voyage/tests/synthetic/profile-jaccard-calibration.md
@ -0,0 +1,94 @@
+---
+type: trekplan-jaccard-calibration
+plan_version: "1.7"
+created: 2026-05-09
+status: parked-synthetic
+threshold: 0.55
+threshold_basis: "research/02 conservative starting value (arXiv:2412.12148)"
+empirical_runs: 0
+synthetic_runs: 4
+ramp_target: v4.2
+---
+
+# Cross-tier Jaccard calibration — voyage v4.1
+
+## Status: PARKED-SYNTHETIC
+
+Empirical Jaccard calibration was deferred from v4.1 because the four
+required `/trekplan` invocations cost an estimated $60-120 of LLM-budget
+that was not authorized for the v4.1-execute-4b session. Per Step 17
+escalate-handler, this file documents:
+
+1. The synthetic placeholder fixtures used by Step 18's smoke-test, and
+2. The pinned conservative threshold (`0.55`) from research/02.
+
+## Threshold rationale
+
+`threshold: 0.55` is pinned per research/02 (Recommendation #5):
+
+> "There is no universal Jaccard threshold for cross-model plan
+> agreement. arXiv:2412.12148 reports 0.45–0.65 for n=10 task-pair
+> samples on coding tasks. We recommend a *conservative starting value
+> of 0.55* — this absorbs intra-tier variance and most cross-tier drift,
+> while still flagging severe disagreement (e.g. when one tier produces
+> a fundamentally different decomposition strategy)."
+
+The 0.55 floor is enforced by `tests/integration/profile-jaccard-smoke.test.mjs`
+(Step 18) as a module-local constant `CROSS_TIER_JACCARD_FLOOR`. The test
+also gates on a structural pre-check (step-count parity ±20 % and
+plan-validator strict pass on both fixtures) — these are *non-negotiable*
+even when Jaccard happens to clear 0.55.
+
+## Synthetic fixture pairs
+
+The four parked-synthetic plan-runs in `tests/synthetic/`:
+
+| run-A | run-B | jaccard (synthetic) | normalized |
+|-------|-------|--------------------|-------------|
+| profile-plan-run-economy-1.md | profile-plan-run-premium-1.md | 0.733 | 0.730 |
+| profile-plan-run-economy-1.md | profile-plan-run-premium-2.md | 0.711 | 0.706 |
+| profile-plan-run-economy-2.md | profile-plan-run-premium-1.md | 0.706 | 0.703 |
+| profile-plan-run-economy-2.md | profile-plan-run-premium-2.md | 0.683 | 0.680 |
+
+Min observed (synthetic): 0.680. Min observed minus 0.05 buffer = 0.630.
+We pin `threshold: 0.55` — the lower of (research/02 conservative value)
+vs (min - 0.05 buffer). This is the same rule plan.md Step 17 prescribes:
+`floor(min(jaccard_values), 2) - 0.05` or `0.55`, whichever is lower.
+
+Synthetic Jaccards above are *expected* values for the placeholder
+fixtures; real LLM runs will likely differ. The 0.55 pin remains valid
+across that uncertainty.
+
+## When to replace these fixtures
+
+Trigger empirical calibration when **any** of the following holds:
+
+1. Cross-tier Jaccard smoke-test (Step 18) flips from green to red on a
+   real plan run — indicates the synthetic threshold no longer reflects
+   reality and needs re-grounding.
+2. v4.2 ROADMAP item "empirical Jaccard calibration" is approved and
+   $60-120 LLM-budget is authorized.
+3. A new profile is added (`balanced` already exists; if a fourth tier
+   like `frontier` is added, recalibrate against premium baseline).
+
+## How to replace
+
+1. Run `/trekplan --profile economy --brief examples/01-add-verbose-flag/brief.md`
+   twice. Save each plan's `steps:` frontmatter to
+   `profile-plan-run-economy-{1,2}.md` (overwrite synthetic content).
+   Update `status: parked-synthetic` → `status: empirical`.
+2. Same for `--profile premium`, twice.
+3. Recompute the four cross-tier Jaccards. Update the table above.
+4. Repin threshold: `min(jaccard_values, 2) - 0.05` or 0.55, whichever
+   lower. (Tighter is fine; do not loosen below 0.55.)
+5. Run `tests/integration/profile-jaccard-smoke.test.mjs` — must pass.
+6. Update `empirical_runs: 4`, `synthetic_runs: 0`,
+   `status: empirical`, `ramp_target: stabilized` in this frontmatter.
+
+## Fallback strategy in the meantime
+
+Until real calibration is run, operators are advised to use the
+`balanced` profile (sonnet for most phases, opus for plan + review) as
+the lowest-risk choice. `balanced` was selected as the v4.1 default in
+`commands/trekplan.md` Phase 5.5 specifically to avoid stress-testing
+the cross-tier Jaccard floor with parked-synthetic data.