agent-builder/scripts/templates/proactive/VFM-SCORING.md

3.2 KiB

Value-First Modification (VFM) Scoring

Scoring rubric for evaluating proposed self-modifications. Any change to agent config, prompts, behavior, or pipeline structure must score > 50 to be implemented.

Dimensions

Frequency (0-25 points)

How often does the issue this change addresses occur?

Score Criteria
0-5 Happened once, may not recur
6-10 Happens occasionally (1-2x per week)
11-15 Happens regularly (daily)
16-20 Happens frequently (multiple times per day)
21-25 Happens on nearly every run

Failure Reduction (0-25 points)

Does this change fix real failures?

Score Criteria
0-5 Cosmetic improvement, no failures prevented
6-10 Prevents occasional warnings or non-critical errors
11-15 Prevents errors that require manual intervention
16-20 Prevents errors that cause pipeline failure
21-25 Prevents errors that cause data loss or system damage

Burden Reduction (0-25 points)

Does this reduce human effort?

Score Criteria
0-5 Saves less than 1 minute per occurrence
6-10 Saves 1-5 minutes per occurrence
11-15 Saves 5-30 minutes per occurrence
16-20 Eliminates a manual step entirely
21-25 Eliminates multiple manual steps or a recurring task

Cost Savings (0-25 points)

Does this reduce API/compute costs?

Score Criteria
0-5 Negligible cost difference
6-10 Saves <10% on affected operations
11-15 Saves 10-25% on affected operations
16-20 Saves 25-50% on affected operations
21-25 Saves >50% or eliminates unnecessary API calls entirely

Decision threshold

Total score Decision
> 50 Implement — change is worth the risk
26-50 Defer — log for future consideration
<= 25 Reject — not worth pursuing

Logging format

Every VFM evaluation must be logged, whether implemented or not:

VFM EVALUATION
Date: [timestamp]
Proposed change: [description]
Scores:
  Frequency: [score] — [justification]
  Failure reduction: [score] — [justification]
  Burden reduction: [score] — [justification]
  Cost savings: [score] — [justification]
Total: [sum]/100
Decision: implement / defer / reject

Worked examples

Example 1: Add retry logic to web search (Implement)

  • Frequency: 18 (search fails ~3x daily due to timeouts)
  • Failure reduction: 15 (prevents pipeline stall requiring manual restart)
  • Burden reduction: 16 (eliminates manual re-run)
  • Cost savings: 8 (slight cost from retry, but saves failed run cost)
  • Total: 57 → Implement

Example 2: Refactor prompt to use XML tags (Defer)

  • Frequency: 25 (every run)
  • Failure reduction: 3 (current format works fine)
  • Burden reduction: 2 (no human effort saved)
  • Cost savings: 5 (maybe slightly fewer tokens)
  • Total: 35 → Defer (improvement is real but marginal)

Example 3: Switch to experimental model (Reject)

  • Frequency: 25 (every run)
  • Failure reduction: 0 (current model has no failures)
  • Burden reduction: 0 (no human effort saved)
  • Cost savings: 10 (newer model might be cheaper)
  • Total: 35 → Defer (stability > novelty per ADL)