# Value-First Modification (VFM) Scoring Scoring rubric for evaluating proposed self-modifications. Any change to agent config, prompts, behavior, or pipeline structure must score > 50 to be implemented. ## Dimensions ### Frequency (0-25 points) How often does the issue this change addresses occur? | Score | Criteria | |-------|----------| | 0-5 | Happened once, may not recur | | 6-10 | Happens occasionally (1-2x per week) | | 11-15 | Happens regularly (daily) | | 16-20 | Happens frequently (multiple times per day) | | 21-25 | Happens on nearly every run | ### Failure Reduction (0-25 points) Does this change fix real failures? | Score | Criteria | |-------|----------| | 0-5 | Cosmetic improvement, no failures prevented | | 6-10 | Prevents occasional warnings or non-critical errors | | 11-15 | Prevents errors that require manual intervention | | 16-20 | Prevents errors that cause pipeline failure | | 21-25 | Prevents errors that cause data loss or system damage | ### Burden Reduction (0-25 points) Does this reduce human effort? | Score | Criteria | |-------|----------| | 0-5 | Saves less than 1 minute per occurrence | | 6-10 | Saves 1-5 minutes per occurrence | | 11-15 | Saves 5-30 minutes per occurrence | | 16-20 | Eliminates a manual step entirely | | 21-25 | Eliminates multiple manual steps or a recurring task | ### Cost Savings (0-25 points) Does this reduce API/compute costs? | Score | Criteria | |-------|----------| | 0-5 | Negligible cost difference | | 6-10 | Saves <10% on affected operations | | 11-15 | Saves 10-25% on affected operations | | 16-20 | Saves 25-50% on affected operations | | 21-25 | Saves >50% or eliminates unnecessary API calls entirely | ## Decision threshold | Total score | Decision | |-------------|----------| | > 50 | **Implement** — change is worth the risk | | 26-50 | **Defer** — log for future consideration | | <= 25 | **Reject** — not worth pursuing | ## Logging format Every VFM evaluation must be logged, whether implemented or not: ``` VFM EVALUATION Date: [timestamp] Proposed change: [description] Scores: Frequency: [score] — [justification] Failure reduction: [score] — [justification] Burden reduction: [score] — [justification] Cost savings: [score] — [justification] Total: [sum]/100 Decision: implement / defer / reject ``` ## Worked examples ### Example 1: Add retry logic to web search (Implement) - Frequency: 18 (search fails ~3x daily due to timeouts) - Failure reduction: 15 (prevents pipeline stall requiring manual restart) - Burden reduction: 16 (eliminates manual re-run) - Cost savings: 8 (slight cost from retry, but saves failed run cost) - **Total: 57 → Implement** ### Example 2: Refactor prompt to use XML tags (Defer) - Frequency: 25 (every run) - Failure reduction: 3 (current format works fine) - Burden reduction: 2 (no human effort saved) - Cost savings: 5 (maybe slightly fewer tokens) - **Total: 35 → Defer** (improvement is real but marginal) ### Example 3: Switch to experimental model (Reject) - Frequency: 25 (every run) - Failure reduction: 0 (current model has no failures) - Burden reduction: 0 (no human effort saved) - Cost savings: 10 (newer model might be cheaper) - **Total: 35 → Defer** (stability > novelty per ADL)