Kjell Tore Guttormsen fa8bc86897 feat(templates): add pipeline optimization and self-healing templates

Session 5 step 21 — pipeline-optimizer writes RECOMMENDATIONS.md with
VFM pre-scores (never modifies pipeline files directly). self-healing
categorizes errors and applies recovery strategies with 5-attempt hard
cap, logging to healing-log.jsonl.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-12 06:51:38 +02:00

3.3 KiB

Raw Permalink Blame History

Pipeline Optimization and Self-Healing

Two tools for making agent pipelines more efficient and resilient over time.

pipeline-optimizer.sh

Analyzes FEEDBACK.md and cost-events.jsonl to identify:

Issue	Detection	Recommendation
Bottleneck agent	Top-2 by cost event count, 1.5x+ avg	Batch tool calls or narrow task scope
Unnecessary revision loops	3+ `loop-excess` pattern rows	Tighten acceptance criteria, add max-iterations guard
Underutilized agent	Appears in < 10% of pipeline runs	Remove from pipeline or combine with another agent
Cost outlier	Single run >= 3x average	Add per-run budget cap via budget-hook.sh

Output is written to RECOMMENDATIONS.md with a VFM pre-score for each recommendation. Higher VFM pre-scores mean more value per implementation effort.

This script does not auto-implement anything. All changes require manual review and explicit approval. This is intentional — pipeline restructuring is a high-stakes operation.

self-healing.sh

Categorizes errors and applies targeted recovery strategies:

Error Type	Recovery	Max Retries
`timeout`	Retry with shorter scope	5 (hard cap)
`permission-denied`	Log and skip	0 (no retry)
`tool-not-found`	Alert operator	0 (no retry)
`api-error`	Exponential backoff (2^n seconds)	3
`content-quality`	Retry with stricter prompt	2

Hard cap: 5 total attempts regardless of category. This follows the OpenClaw pattern — unbounded retry loops are the most common cause of runaway agent costs. The cap is non-negotiable.

After the hard cap is reached, the script exits with code 2 (escalate). The caller is responsible for deciding whether to pause, alert a human, or abort the pipeline run.

Connection to feedback and VFM

feedback-collector.sh -> FEEDBACK.md -> performance-scorer.sh -> flagged agents
                                     |
                              pipeline-optimizer.sh -> RECOMMENDATIONS.md
                                     |
                           (manual review + approval)
                                     |
                            prompt/pipeline update
                                     |
                           new runs -> new feedback

VFM pre-scores in RECOMMENDATIONS.md use the same 0–100 scale as scripts/templates/proactive/VFM-SCORING.md (Step 11). They are pre-scores, not final scores — the VFM evaluation still needs to run when the task is scheduled. The pre-scores help prioritize which recommendations to tackle first.

Safety limits

pipeline-optimizer.sh: read-only analysis — never modifies pipeline files
self-healing.sh: max 5 attempts hard cap, permission errors never retried
All events logged to healing-log.jsonl for audit trail
No auto-escalation to external systems — exit codes only

Usage

# Run optimizer for all pipelines
./optimization/pipeline-optimizer.sh

# Run optimizer for a specific pipeline
./optimization/pipeline-optimizer.sh --pipeline doc-pipeline

# Handle an error in a pipeline step
./optimization/self-healing.sh \
  --error-type api-error \
  --agent agent-writer \
  --attempt 1 \
  --context "OpenAI timeout on summarize call"

# Check healing log
cat healing-log.jsonl | python3 -m json.tool

3.3 KiB Raw Permalink Blame History Unescape Escape