agent-builder/scripts/templates/optimization/README.md
Kjell Tore Guttormsen fa8bc86897 feat(templates): add pipeline optimization and self-healing templates
Session 5 step 21 — pipeline-optimizer writes RECOMMENDATIONS.md with
VFM pre-scores (never modifies pipeline files directly). self-healing
categorizes errors and applies recovery strategies with 5-attempt hard
cap, logging to healing-log.jsonl.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 06:51:38 +02:00

3.3 KiB
Raw Permalink Blame History

Pipeline Optimization and Self-Healing

Two tools for making agent pipelines more efficient and resilient over time.

pipeline-optimizer.sh

Analyzes FEEDBACK.md and cost-events.jsonl to identify:

Issue Detection Recommendation
Bottleneck agent Top-2 by cost event count, 1.5x+ avg Batch tool calls or narrow task scope
Unnecessary revision loops 3+ loop-excess pattern rows Tighten acceptance criteria, add max-iterations guard
Underutilized agent Appears in < 10% of pipeline runs Remove from pipeline or combine with another agent
Cost outlier Single run >= 3x average Add per-run budget cap via budget-hook.sh

Output is written to RECOMMENDATIONS.md with a VFM pre-score for each recommendation. Higher VFM pre-scores mean more value per implementation effort.

This script does not auto-implement anything. All changes require manual review and explicit approval. This is intentional — pipeline restructuring is a high-stakes operation.

self-healing.sh

Categorizes errors and applies targeted recovery strategies:

Error Type Recovery Max Retries
timeout Retry with shorter scope 5 (hard cap)
permission-denied Log and skip 0 (no retry)
tool-not-found Alert operator 0 (no retry)
api-error Exponential backoff (2^n seconds) 3
content-quality Retry with stricter prompt 2

Hard cap: 5 total attempts regardless of category. This follows the OpenClaw pattern — unbounded retry loops are the most common cause of runaway agent costs. The cap is non-negotiable.

After the hard cap is reached, the script exits with code 2 (escalate). The caller is responsible for deciding whether to pause, alert a human, or abort the pipeline run.

Connection to feedback and VFM

feedback-collector.sh -> FEEDBACK.md -> performance-scorer.sh -> flagged agents
                                     |
                              pipeline-optimizer.sh -> RECOMMENDATIONS.md
                                     |
                           (manual review + approval)
                                     |
                            prompt/pipeline update
                                     |
                           new runs -> new feedback

VFM pre-scores in RECOMMENDATIONS.md use the same 0100 scale as scripts/templates/proactive/VFM-SCORING.md (Step 11). They are pre-scores, not final scores — the VFM evaluation still needs to run when the task is scheduled. The pre-scores help prioritize which recommendations to tackle first.

Safety limits

  • pipeline-optimizer.sh: read-only analysis — never modifies pipeline files
  • self-healing.sh: max 5 attempts hard cap, permission errors never retried
  • All events logged to healing-log.jsonl for audit trail
  • No auto-escalation to external systems — exit codes only

Usage

# Run optimizer for all pipelines
./optimization/pipeline-optimizer.sh

# Run optimizer for a specific pipeline
./optimization/pipeline-optimizer.sh --pipeline doc-pipeline

# Handle an error in a pipeline step
./optimization/self-healing.sh \
  --error-type api-error \
  --agent agent-writer \
  --attempt 1 \
  --context "OpenAI timeout on summarize call"

# Check healing log
cat healing-log.jsonl | python3 -m json.tool