feat(templates): add proactive agent templates with ADL/VFM guardrails

2026-04-12 06:47:27 +02:00 · 2026-04-12 06:47:27 +02:00 · 195fcc2517
commit 195fcc2517
parent ea3ff53d2c
4 changed files with 284 additions and 0 deletions
--- a/scripts/templates/proactive/VFM-SCORING.md
+++ b/scripts/templates/proactive/VFM-SCORING.md
@ -0,0 +1,99 @@
+# Value-First Modification (VFM) Scoring
+
+Scoring rubric for evaluating proposed self-modifications. Any change to
+agent config, prompts, behavior, or pipeline structure must score > 50
+to be implemented.
+
+## Dimensions
+
+### Frequency (0-25 points)
+How often does the issue this change addresses occur?
+
+| Score | Criteria |
+|-------|----------|
+| 0-5 | Happened once, may not recur |
+| 6-10 | Happens occasionally (1-2x per week) |
+| 11-15 | Happens regularly (daily) |
+| 16-20 | Happens frequently (multiple times per day) |
+| 21-25 | Happens on nearly every run |
+
+### Failure Reduction (0-25 points)
+Does this change fix real failures?
+
+| Score | Criteria |
+|-------|----------|
+| 0-5 | Cosmetic improvement, no failures prevented |
+| 6-10 | Prevents occasional warnings or non-critical errors |
+| 11-15 | Prevents errors that require manual intervention |
+| 16-20 | Prevents errors that cause pipeline failure |
+| 21-25 | Prevents errors that cause data loss or system damage |
+
+### Burden Reduction (0-25 points)
+Does this reduce human effort?
+
+| Score | Criteria |
+|-------|----------|
+| 0-5 | Saves less than 1 minute per occurrence |
+| 6-10 | Saves 1-5 minutes per occurrence |
+| 11-15 | Saves 5-30 minutes per occurrence |
+| 16-20 | Eliminates a manual step entirely |
+| 21-25 | Eliminates multiple manual steps or a recurring task |
+
+### Cost Savings (0-25 points)
+Does this reduce API/compute costs?
+
+| Score | Criteria |
+|-------|----------|
+| 0-5 | Negligible cost difference |
+| 6-10 | Saves <10% on affected operations |
+| 11-15 | Saves 10-25% on affected operations |
+| 16-20 | Saves 25-50% on affected operations |
+| 21-25 | Saves >50% or eliminates unnecessary API calls entirely |
+
+## Decision threshold
+
+| Total score | Decision |
+|-------------|----------|
+| > 50 | **Implement** — change is worth the risk |
+| 26-50 | **Defer** — log for future consideration |
+| <= 25 | **Reject** — not worth pursuing |
+
+## Logging format
+
+Every VFM evaluation must be logged, whether implemented or not:
+
+```
+VFM EVALUATION
+Date: [timestamp]
+Proposed change: [description]
+Scores:
+  Frequency: [score] — [justification]
+  Failure reduction: [score] — [justification]
+  Burden reduction: [score] — [justification]
+  Cost savings: [score] — [justification]
+Total: [sum]/100
+Decision: implement / defer / reject
+```
+
+## Worked examples
+
+### Example 1: Add retry logic to web search (Implement)
+- Frequency: 18 (search fails ~3x daily due to timeouts)
+- Failure reduction: 15 (prevents pipeline stall requiring manual restart)
+- Burden reduction: 16 (eliminates manual re-run)
+- Cost savings: 8 (slight cost from retry, but saves failed run cost)
+- **Total: 57 → Implement**
+
+### Example 2: Refactor prompt to use XML tags (Defer)
+- Frequency: 25 (every run)
+- Failure reduction: 3 (current format works fine)
+- Burden reduction: 2 (no human effort saved)
+- Cost savings: 5 (maybe slightly fewer tokens)
+- **Total: 35 → Defer** (improvement is real but marginal)
+
+### Example 3: Switch to experimental model (Reject)
+- Frequency: 25 (every run)
+- Failure reduction: 0 (current model has no failures)
+- Burden reduction: 0 (no human effort saved)
+- Cost savings: 10 (newer model might be cheaper)
+- **Total: 35 → Defer** (stability > novelty per ADL)