feat(templates): add pipeline optimization and self-healing templates

Session 5 step 21 — pipeline-optimizer writes RECOMMENDATIONS.md with
VFM pre-scores (never modifies pipeline files directly). self-healing
categorizes errors and applies recovery strategies with 5-attempt hard
cap, logging to healing-log.jsonl.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Kjell Tore Guttormsen 2026-04-12 06:51:38 +02:00
commit fa8bc86897
3 changed files with 456 additions and 0 deletions

View file

@ -0,0 +1,88 @@
# Pipeline Optimization and Self-Healing
Two tools for making agent pipelines more efficient and resilient over time.
## pipeline-optimizer.sh
Analyzes FEEDBACK.md and cost-events.jsonl to identify:
| Issue | Detection | Recommendation |
|-------|-----------|----------------|
| Bottleneck agent | Top-2 by cost event count, 1.5x+ avg | Batch tool calls or narrow task scope |
| Unnecessary revision loops | 3+ `loop-excess` pattern rows | Tighten acceptance criteria, add max-iterations guard |
| Underutilized agent | Appears in < 10% of pipeline runs | Remove from pipeline or combine with another agent |
| Cost outlier | Single run >= 3x average | Add per-run budget cap via budget-hook.sh |
Output is written to `RECOMMENDATIONS.md` with a VFM pre-score for each
recommendation. Higher VFM pre-scores mean more value per implementation effort.
**This script does not auto-implement anything.** All changes require
manual review and explicit approval. This is intentional — pipeline
restructuring is a high-stakes operation.
## self-healing.sh
Categorizes errors and applies targeted recovery strategies:
| Error Type | Recovery | Max Retries |
|------------|----------|-------------|
| `timeout` | Retry with shorter scope | 5 (hard cap) |
| `permission-denied` | Log and skip | 0 (no retry) |
| `tool-not-found` | Alert operator | 0 (no retry) |
| `api-error` | Exponential backoff (2^n seconds) | 3 |
| `content-quality` | Retry with stricter prompt | 2 |
**Hard cap: 5 total attempts regardless of category.** This follows the
OpenClaw pattern — unbounded retry loops are the most common cause of
runaway agent costs. The cap is non-negotiable.
After the hard cap is reached, the script exits with code 2 (escalate).
The caller is responsible for deciding whether to pause, alert a human,
or abort the pipeline run.
## Connection to feedback and VFM
```
feedback-collector.sh -> FEEDBACK.md -> performance-scorer.sh -> flagged agents
|
pipeline-optimizer.sh -> RECOMMENDATIONS.md
|
(manual review + approval)
|
prompt/pipeline update
|
new runs -> new feedback
```
VFM pre-scores in RECOMMENDATIONS.md use the same 0100 scale as
`scripts/templates/proactive/VFM-SCORING.md` (Step 11). They are
pre-scores, not final scores — the VFM evaluation still needs to run
when the task is scheduled. The pre-scores help prioritize which
recommendations to tackle first.
## Safety limits
- `pipeline-optimizer.sh`: read-only analysis — never modifies pipeline files
- `self-healing.sh`: max 5 attempts hard cap, permission errors never retried
- All events logged to `healing-log.jsonl` for audit trail
- No auto-escalation to external systems — exit codes only
## Usage
```bash
# Run optimizer for all pipelines
./optimization/pipeline-optimizer.sh
# Run optimizer for a specific pipeline
./optimization/pipeline-optimizer.sh --pipeline doc-pipeline
# Handle an error in a pipeline step
./optimization/self-healing.sh \
--error-type api-error \
--agent agent-writer \
--attempt 1 \
--context "OpenAI timeout on summarize call"
# Check healing log
cat healing-log.jsonl | python3 -m json.tool
```