--- type: trekplan-jaccard-calibration plan_version: "1.7" created: 2026-05-09 status: parked-synthetic threshold: 0.55 threshold_basis: "research/02 conservative starting value (arXiv:2412.12148)" empirical_runs: 0 synthetic_runs: 4 ramp_target: v4.2 --- # Cross-tier Jaccard calibration — voyage v4.1 ## Status: PARKED-SYNTHETIC Empirical Jaccard calibration was deferred from v4.1 because the four required `/trekplan` invocations cost an estimated $60-120 of LLM-budget that was not authorized for the v4.1-execute-4b session. Per Step 17 escalate-handler, this file documents: 1. The synthetic placeholder fixtures used by Step 18's smoke-test, and 2. The pinned conservative threshold (`0.55`) from research/02. ## Threshold rationale `threshold: 0.55` is pinned per research/02 (Recommendation #5): > "There is no universal Jaccard threshold for cross-model plan > agreement. arXiv:2412.12148 reports 0.45–0.65 for n=10 task-pair > samples on coding tasks. We recommend a *conservative starting value > of 0.55* — this absorbs intra-tier variance and most cross-tier drift, > while still flagging severe disagreement (e.g. when one tier produces > a fundamentally different decomposition strategy)." The 0.55 floor is enforced by `tests/integration/profile-jaccard-smoke.test.mjs` (Step 18) as a module-local constant `CROSS_TIER_JACCARD_FLOOR`. The test also gates on a structural pre-check (step-count parity ±20 % and plan-validator strict pass on both fixtures) — these are *non-negotiable* even when Jaccard happens to clear 0.55. ## Synthetic fixture pairs The four parked-synthetic plan-runs in `tests/synthetic/`: | run-A | run-B | jaccard (synthetic, normalized) | |-------|-------|---------------------------------| | profile-plan-run-economy-1.md | profile-plan-run-premium-1.md | 0.707 | | profile-plan-run-economy-1.md | profile-plan-run-premium-2.md | 0.707 | | profile-plan-run-economy-2.md | profile-plan-run-premium-1.md | 0.750 | | profile-plan-run-economy-2.md | profile-plan-run-premium-2.md | 0.750 | Intra-tier (sanity): economy-1 × economy-2 = 0.935; premium-1 × premium-2 = 0.905. Intra-tier > cross-tier confirms the fixtures discriminate. Min observed cross-tier (synthetic): 0.707. Min minus 0.05 buffer = 0.657. We pin `threshold: 0.55` — the lower of (research/02 conservative value) vs (min - 0.05 buffer). This is the same rule plan.md Step 17 prescribes: `floor(min(jaccard_values), 2) - 0.05` or `0.55`, whichever is lower. Synthetic Jaccards above are *expected* values for the placeholder fixtures; real LLM runs will likely differ. The 0.55 pin remains valid across that uncertainty. ## When to replace these fixtures Trigger empirical calibration when **any** of the following holds: 1. Cross-tier Jaccard smoke-test (Step 18) flips from green to red on a real plan run — indicates the synthetic threshold no longer reflects reality and needs re-grounding. 2. v4.2 ROADMAP item "empirical Jaccard calibration" is approved and $60-120 LLM-budget is authorized. 3. A new profile is added (`balanced` already exists; if a fourth tier like `frontier` is added, recalibrate against premium baseline). ## How to replace 1. Run `/trekplan --profile economy --brief examples/01-add-verbose-flag/brief.md` twice. Save each plan's `steps:` frontmatter to `profile-plan-run-economy-{1,2}.md` (overwrite synthetic content). Update `status: parked-synthetic` → `status: empirical`. 2. Same for `--profile premium`, twice. 3. Recompute the four cross-tier Jaccards. Update the table above. 4. Repin threshold: `min(jaccard_values, 2) - 0.05` or 0.55, whichever lower. (Tighter is fine; do not loosen below 0.55.) 5. Run `tests/integration/profile-jaccard-smoke.test.mjs` — must pass. 6. Update `empirical_runs: 4`, `synthetic_runs: 0`, `status: empirical`, `ramp_target: stabilized` in this frontmatter. ## Fallback strategy in the meantime Until real calibration is run, operators are advised to use the `balanced` profile (sonnet for most phases, opus for plan + review) as the lowest-risk choice. `balanced` was selected as the v4.1 default in `commands/trekplan.md` Phase 5.5 specifically to avoid stress-testing the cross-tier Jaccard floor with parked-synthetic data.