Value-First Modification (VFM) Scoring

Scoring rubric for evaluating proposed self-modifications. Any change to agent config, prompts, behavior, or pipeline structure must score > 50 to be implemented.

Dimensions

Frequency (0-25 points)

How often does the issue this change addresses occur?

Score	Criteria
0-5	Happened once, may not recur
6-10	Happens occasionally (1-2x per week)
11-15	Happens regularly (daily)
16-20	Happens frequently (multiple times per day)
21-25	Happens on nearly every run

Failure Reduction (0-25 points)

Does this change fix real failures?

Score	Criteria
0-5	Cosmetic improvement, no failures prevented
6-10	Prevents occasional warnings or non-critical errors
11-15	Prevents errors that require manual intervention
16-20	Prevents errors that cause pipeline failure
21-25	Prevents errors that cause data loss or system damage

Burden Reduction (0-25 points)

Does this reduce human effort?

Score	Criteria
0-5	Saves less than 1 minute per occurrence
6-10	Saves 1-5 minutes per occurrence
11-15	Saves 5-30 minutes per occurrence
16-20	Eliminates a manual step entirely
21-25	Eliminates multiple manual steps or a recurring task

Cost Savings (0-25 points)

Does this reduce API/compute costs?

Score	Criteria
0-5	Negligible cost difference
6-10	Saves <10% on affected operations
11-15	Saves 10-25% on affected operations
16-20	Saves 25-50% on affected operations
21-25	Saves >50% or eliminates unnecessary API calls entirely

Decision threshold

Total score	Decision
> 50	Implement — change is worth the risk
26-50	Defer — log for future consideration
<= 25	Reject — not worth pursuing

Logging format

Every VFM evaluation must be logged, whether implemented or not:

VFM EVALUATION
Date: [timestamp]
Proposed change: [description]
Scores:
  Frequency: [score] — [justification]
  Failure reduction: [score] — [justification]
  Burden reduction: [score] — [justification]
  Cost savings: [score] — [justification]
Total: [sum]/100
Decision: implement / defer / reject

Worked examples

Example 1: Add retry logic to web search (Implement)

Frequency: 18 (search fails ~3x daily due to timeouts)
Failure reduction: 15 (prevents pipeline stall requiring manual restart)
Burden reduction: 16 (eliminates manual re-run)
Cost savings: 8 (slight cost from retry, but saves failed run cost)
Total: 57 → Implement

Example 2: Refactor prompt to use XML tags (Defer)

Frequency: 25 (every run)
Failure reduction: 3 (current format works fine)
Burden reduction: 2 (no human effort saved)
Cost savings: 5 (maybe slightly fewer tokens)
Total: 35 → Defer (improvement is real but marginal)

Example 3: Switch to experimental model (Reject)

Frequency: 25 (every run)
Failure reduction: 0 (current model has no failures)
Burden reduction: 0 (no human effort saved)
Cost savings: 10 (newer model might be cheaper)
Total: 35 → Defer (stability > novelty per ADL)

3.2 KiB Raw Blame History