Kjell Tore Guttormsen 90425073b2 test(voyage): empirical jaccard calibration — parked-synthetic placeholders + threshold pin

Step 17 of v4.1 — escalate-handler invoked. Live LLM-budget ($60-120 for
4 plan-runs á /trekplan --profile {economy,premium} on
examples/01-add-verbose-flag/brief.md) was not authorized for the
v4.1-execute-4b session.

Per Step 17 escalate-fallback (and NEXT-SESSION-PROMPT.local.md
fallback-strategy): document economy-Plan as parked, use balanced as
low-threshold profile, defer empirical calibration to v4.2.

Files:
  tests/synthetic/profile-plan-run-economy-1.md   — 30 steps, parked-synthetic
  tests/synthetic/profile-plan-run-economy-2.md   — 30 steps, parked-synthetic
  tests/synthetic/profile-plan-run-premium-1.md   — 40 steps, parked-synthetic
  tests/synthetic/profile-plan-run-premium-2.md   — 40 steps, parked-synthetic
  tests/synthetic/profile-jaccard-calibration.md  — threshold 0.55 pinned per
                                                    research/02 conservative starting value

Replacement procedure documented in calibration.md "How to replace"
section. Trigger conditions for empirical re-run:
  1. Cross-tier smoke-test (Step 18) flips red on a real run
  2. v4.2 LLM-budget approval
  3. New profile tier added

2026-05-09 09:54:45 +02:00

4.1 KiB

Raw Blame History

type	plan_version	created	status	threshold	threshold_basis	empirical_runs	synthetic_runs	ramp_target
trekplan-jaccard-calibration	1.7	2026-05-09	parked-synthetic	0.55	research/02 conservative starting value (arXiv:2412.12148)	0	4	v4.2

Cross-tier Jaccard calibration — voyage v4.1

Status: PARKED-SYNTHETIC

Empirical Jaccard calibration was deferred from v4.1 because the four required /trekplan invocations cost an estimated $60-120 of LLM-budget that was not authorized for the v4.1-execute-4b session. Per Step 17 escalate-handler, this file documents:

The synthetic placeholder fixtures used by Step 18's smoke-test, and
The pinned conservative threshold (0.55) from research/02.

Threshold rationale

threshold: 0.55 is pinned per research/02 (Recommendation #5):

"There is no universal Jaccard threshold for cross-model plan agreement. arXiv:2412.12148 reports 0.45–0.65 for n=10 task-pair samples on coding tasks. We recommend a conservative starting value of 0.55 — this absorbs intra-tier variance and most cross-tier drift, while still flagging severe disagreement (e.g. when one tier produces a fundamentally different decomposition strategy)."

The 0.55 floor is enforced by tests/integration/profile-jaccard-smoke.test.mjs (Step 18) as a module-local constant CROSS_TIER_JACCARD_FLOOR. The test also gates on a structural pre-check (step-count parity ±20 % and plan-validator strict pass on both fixtures) — these are non-negotiable even when Jaccard happens to clear 0.55.

Synthetic fixture pairs

The four parked-synthetic plan-runs in tests/synthetic/:

run-A	run-B	jaccard (synthetic)	normalized
profile-plan-run-economy-1.md	profile-plan-run-premium-1.md	0.733	0.730
profile-plan-run-economy-1.md	profile-plan-run-premium-2.md	0.711	0.706
profile-plan-run-economy-2.md	profile-plan-run-premium-1.md	0.706	0.703
profile-plan-run-economy-2.md	profile-plan-run-premium-2.md	0.683	0.680

Min observed (synthetic): 0.680. Min observed minus 0.05 buffer = 0.630. We pin threshold: 0.55 — the lower of (research/02 conservative value) vs (min - 0.05 buffer). This is the same rule plan.md Step 17 prescribes: floor(min(jaccard_values), 2) - 0.05 or 0.55, whichever is lower.

Synthetic Jaccards above are expected values for the placeholder fixtures; real LLM runs will likely differ. The 0.55 pin remains valid across that uncertainty.

When to replace these fixtures

Trigger empirical calibration when any of the following holds:

Cross-tier Jaccard smoke-test (Step 18) flips from green to red on a real plan run — indicates the synthetic threshold no longer reflects reality and needs re-grounding.
v4.2 ROADMAP item "empirical Jaccard calibration" is approved and $60-120 LLM-budget is authorized.
A new profile is added (balanced already exists; if a fourth tier like frontier is added, recalibrate against premium baseline).

How to replace

Run /trekplan --profile economy --brief examples/01-add-verbose-flag/brief.md twice. Save each plan's steps: frontmatter to profile-plan-run-economy-{1,2}.md (overwrite synthetic content). Update status: parked-synthetic → status: empirical.
Same for --profile premium, twice.
Recompute the four cross-tier Jaccards. Update the table above.
Repin threshold: min(jaccard_values, 2) - 0.05 or 0.55, whichever lower. (Tighter is fine; do not loosen below 0.55.)
Run tests/integration/profile-jaccard-smoke.test.mjs — must pass.
Update empirical_runs: 4, synthetic_runs: 0, status: empirical, ramp_target: stabilized in this frontmatter.

Fallback strategy in the meantime

Until real calibration is run, operators are advised to use the balanced profile (sonnet for most phases, opus for plan + review) as the lowest-risk choice. balanced was selected as the v4.1 default in commands/trekplan.md Phase 5.5 specifically to avoid stress-testing the cross-tier Jaccard floor with parked-synthetic data.

4.1 KiB Raw Blame History Unescape Escape