feat(templates): add proactive agent templates with ADL/VFM guardrails
This commit is contained in:
parent
ea3ff53d2c
commit
195fcc2517
4 changed files with 284 additions and 0 deletions
99
scripts/templates/proactive/VFM-SCORING.md
Normal file
99
scripts/templates/proactive/VFM-SCORING.md
Normal file
|
|
@ -0,0 +1,99 @@
|
|||
# Value-First Modification (VFM) Scoring
|
||||
|
||||
Scoring rubric for evaluating proposed self-modifications. Any change to
|
||||
agent config, prompts, behavior, or pipeline structure must score > 50
|
||||
to be implemented.
|
||||
|
||||
## Dimensions
|
||||
|
||||
### Frequency (0-25 points)
|
||||
How often does the issue this change addresses occur?
|
||||
|
||||
| Score | Criteria |
|
||||
|-------|----------|
|
||||
| 0-5 | Happened once, may not recur |
|
||||
| 6-10 | Happens occasionally (1-2x per week) |
|
||||
| 11-15 | Happens regularly (daily) |
|
||||
| 16-20 | Happens frequently (multiple times per day) |
|
||||
| 21-25 | Happens on nearly every run |
|
||||
|
||||
### Failure Reduction (0-25 points)
|
||||
Does this change fix real failures?
|
||||
|
||||
| Score | Criteria |
|
||||
|-------|----------|
|
||||
| 0-5 | Cosmetic improvement, no failures prevented |
|
||||
| 6-10 | Prevents occasional warnings or non-critical errors |
|
||||
| 11-15 | Prevents errors that require manual intervention |
|
||||
| 16-20 | Prevents errors that cause pipeline failure |
|
||||
| 21-25 | Prevents errors that cause data loss or system damage |
|
||||
|
||||
### Burden Reduction (0-25 points)
|
||||
Does this reduce human effort?
|
||||
|
||||
| Score | Criteria |
|
||||
|-------|----------|
|
||||
| 0-5 | Saves less than 1 minute per occurrence |
|
||||
| 6-10 | Saves 1-5 minutes per occurrence |
|
||||
| 11-15 | Saves 5-30 minutes per occurrence |
|
||||
| 16-20 | Eliminates a manual step entirely |
|
||||
| 21-25 | Eliminates multiple manual steps or a recurring task |
|
||||
|
||||
### Cost Savings (0-25 points)
|
||||
Does this reduce API/compute costs?
|
||||
|
||||
| Score | Criteria |
|
||||
|-------|----------|
|
||||
| 0-5 | Negligible cost difference |
|
||||
| 6-10 | Saves <10% on affected operations |
|
||||
| 11-15 | Saves 10-25% on affected operations |
|
||||
| 16-20 | Saves 25-50% on affected operations |
|
||||
| 21-25 | Saves >50% or eliminates unnecessary API calls entirely |
|
||||
|
||||
## Decision threshold
|
||||
|
||||
| Total score | Decision |
|
||||
|-------------|----------|
|
||||
| > 50 | **Implement** — change is worth the risk |
|
||||
| 26-50 | **Defer** — log for future consideration |
|
||||
| <= 25 | **Reject** — not worth pursuing |
|
||||
|
||||
## Logging format
|
||||
|
||||
Every VFM evaluation must be logged, whether implemented or not:
|
||||
|
||||
```
|
||||
VFM EVALUATION
|
||||
Date: [timestamp]
|
||||
Proposed change: [description]
|
||||
Scores:
|
||||
Frequency: [score] — [justification]
|
||||
Failure reduction: [score] — [justification]
|
||||
Burden reduction: [score] — [justification]
|
||||
Cost savings: [score] — [justification]
|
||||
Total: [sum]/100
|
||||
Decision: implement / defer / reject
|
||||
```
|
||||
|
||||
## Worked examples
|
||||
|
||||
### Example 1: Add retry logic to web search (Implement)
|
||||
- Frequency: 18 (search fails ~3x daily due to timeouts)
|
||||
- Failure reduction: 15 (prevents pipeline stall requiring manual restart)
|
||||
- Burden reduction: 16 (eliminates manual re-run)
|
||||
- Cost savings: 8 (slight cost from retry, but saves failed run cost)
|
||||
- **Total: 57 → Implement**
|
||||
|
||||
### Example 2: Refactor prompt to use XML tags (Defer)
|
||||
- Frequency: 25 (every run)
|
||||
- Failure reduction: 3 (current format works fine)
|
||||
- Burden reduction: 2 (no human effort saved)
|
||||
- Cost savings: 5 (maybe slightly fewer tokens)
|
||||
- **Total: 35 → Defer** (improvement is real but marginal)
|
||||
|
||||
### Example 3: Switch to experimental model (Reject)
|
||||
- Frequency: 25 (every run)
|
||||
- Failure reduction: 0 (current model has no failures)
|
||||
- Burden reduction: 0 (no human effort saved)
|
||||
- Cost savings: 10 (newer model might be cheaper)
|
||||
- **Total: 35 → Defer** (stability > novelty per ADL)
|
||||
Loading…
Add table
Add a link
Reference in a new issue