Session 5 step 21 — pipeline-optimizer writes RECOMMENDATIONS.md with VFM pre-scores (never modifies pipeline files directly). self-healing categorizes errors and applies recovery strategies with 5-attempt hard cap, logging to healing-log.jsonl. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
88 lines
3.3 KiB
Markdown
88 lines
3.3 KiB
Markdown
# Pipeline Optimization and Self-Healing
|
||
|
||
Two tools for making agent pipelines more efficient and resilient over time.
|
||
|
||
## pipeline-optimizer.sh
|
||
|
||
Analyzes FEEDBACK.md and cost-events.jsonl to identify:
|
||
|
||
| Issue | Detection | Recommendation |
|
||
|-------|-----------|----------------|
|
||
| Bottleneck agent | Top-2 by cost event count, 1.5x+ avg | Batch tool calls or narrow task scope |
|
||
| Unnecessary revision loops | 3+ `loop-excess` pattern rows | Tighten acceptance criteria, add max-iterations guard |
|
||
| Underutilized agent | Appears in < 10% of pipeline runs | Remove from pipeline or combine with another agent |
|
||
| Cost outlier | Single run >= 3x average | Add per-run budget cap via budget-hook.sh |
|
||
|
||
Output is written to `RECOMMENDATIONS.md` with a VFM pre-score for each
|
||
recommendation. Higher VFM pre-scores mean more value per implementation effort.
|
||
|
||
**This script does not auto-implement anything.** All changes require
|
||
manual review and explicit approval. This is intentional — pipeline
|
||
restructuring is a high-stakes operation.
|
||
|
||
## self-healing.sh
|
||
|
||
Categorizes errors and applies targeted recovery strategies:
|
||
|
||
| Error Type | Recovery | Max Retries |
|
||
|------------|----------|-------------|
|
||
| `timeout` | Retry with shorter scope | 5 (hard cap) |
|
||
| `permission-denied` | Log and skip | 0 (no retry) |
|
||
| `tool-not-found` | Alert operator | 0 (no retry) |
|
||
| `api-error` | Exponential backoff (2^n seconds) | 3 |
|
||
| `content-quality` | Retry with stricter prompt | 2 |
|
||
|
||
**Hard cap: 5 total attempts regardless of category.** This follows the
|
||
OpenClaw pattern — unbounded retry loops are the most common cause of
|
||
runaway agent costs. The cap is non-negotiable.
|
||
|
||
After the hard cap is reached, the script exits with code 2 (escalate).
|
||
The caller is responsible for deciding whether to pause, alert a human,
|
||
or abort the pipeline run.
|
||
|
||
## Connection to feedback and VFM
|
||
|
||
```
|
||
feedback-collector.sh -> FEEDBACK.md -> performance-scorer.sh -> flagged agents
|
||
|
|
||
pipeline-optimizer.sh -> RECOMMENDATIONS.md
|
||
|
|
||
(manual review + approval)
|
||
|
|
||
prompt/pipeline update
|
||
|
|
||
new runs -> new feedback
|
||
```
|
||
|
||
VFM pre-scores in RECOMMENDATIONS.md use the same 0–100 scale as
|
||
`scripts/templates/proactive/VFM-SCORING.md` (Step 11). They are
|
||
pre-scores, not final scores — the VFM evaluation still needs to run
|
||
when the task is scheduled. The pre-scores help prioritize which
|
||
recommendations to tackle first.
|
||
|
||
## Safety limits
|
||
|
||
- `pipeline-optimizer.sh`: read-only analysis — never modifies pipeline files
|
||
- `self-healing.sh`: max 5 attempts hard cap, permission errors never retried
|
||
- All events logged to `healing-log.jsonl` for audit trail
|
||
- No auto-escalation to external systems — exit codes only
|
||
|
||
## Usage
|
||
|
||
```bash
|
||
# Run optimizer for all pipelines
|
||
./optimization/pipeline-optimizer.sh
|
||
|
||
# Run optimizer for a specific pipeline
|
||
./optimization/pipeline-optimizer.sh --pipeline doc-pipeline
|
||
|
||
# Handle an error in a pipeline step
|
||
./optimization/self-healing.sh \
|
||
--error-type api-error \
|
||
--agent agent-writer \
|
||
--attempt 1 \
|
||
--context "OpenAI timeout on summarize call"
|
||
|
||
# Check healing log
|
||
cat healing-log.jsonl | python3 -m json.tool
|
||
```
|