8 session blueprints covering all 27 steps across 3 waves: - Session 1: Foundation (rename + commands, Steps 1-5) - Session 2: Skills and templates (Steps 6-7) - Session 3: OpenClaw patterns (memory/heartbeat/proactive/cron, Steps 9-12) - Session 4: Paperclip patterns (context/goals/budget/governance/org-chart, Steps 14-18) - Session 5: Self-learning (feedback/optimization, Steps 20-21) - Session 6: Integration (Docker/transfer/5 more domains, Steps 22-24) - Session 7: Skill updates (memory/autonomy/orchestration/governance/MCP refs, Steps 13,19,25) - Session 8: Finalization (build command integration + v1.0, Steps 8,26,27) Also updates plan assumptions table with verified findings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
997 lines
34 KiB
Markdown
997 lines
34 KiB
Markdown
# Session 5: Self-Learning Systems
|
||
|
||
> Steps 20, 21 | Wave 1 | Depends on: none
|
||
|
||
## Dependencies
|
||
|
||
Entry condition: none (independent — creates new template directories only)
|
||
|
||
## Scope Fence
|
||
|
||
**Touch:**
|
||
- `scripts/templates/feedback/FEEDBACK.md` (new)
|
||
- `scripts/templates/feedback/feedback-collector.sh` (new)
|
||
- `scripts/templates/feedback/performance-scorer.sh` (new)
|
||
- `scripts/templates/feedback/README.md` (new)
|
||
- `scripts/templates/optimization/pipeline-optimizer.sh` (new)
|
||
- `scripts/templates/optimization/self-healing.sh` (new)
|
||
- `scripts/templates/optimization/README.md` (new)
|
||
|
||
**Never touch:**
|
||
- `commands/`
|
||
- `agents/`
|
||
- `skills/`
|
||
- `scripts/templates/heartbeat/`
|
||
- `scripts/templates/memory/`
|
||
- `scripts/templates/proactive/`
|
||
- `scripts/templates/cron/`
|
||
- `scripts/templates/goals/`
|
||
- `scripts/templates/budget/`
|
||
- `scripts/templates/governance/`
|
||
- `scripts/templates/org-chart/`
|
||
- `.claude-plugin/`, `CLAUDE.md`, `README.md`
|
||
|
||
---
|
||
|
||
## Step 20: Create feedback loop templates
|
||
|
||
### Files to create
|
||
|
||
**`scripts/templates/feedback/FEEDBACK.md`** — Feedback tracking file:
|
||
|
||
```markdown
|
||
# Feedback Log: {{PROJECT_NAME}}
|
||
|
||
> Append-only. One row per pipeline run. Reviewed by performance-scorer.sh.
|
||
|
||
## Feedback Table
|
||
|
||
| Date | Pipeline | Agent | Score | Issue | Resolution | Pattern |
|
||
|------|----------|-------|-------|-------|------------|---------|
|
||
| {{DATE}} | {{PIPELINE_NAME}} | {{AGENT_NAME}} | {{SCORE}}/100 | {{ISSUE_DESCRIPTION}} | {{RESOLUTION}} | {{PATTERN_TAG}} |
|
||
|
||
## Pattern Tags
|
||
|
||
Use consistent tags so performance-scorer.sh can detect recurring issues:
|
||
|
||
- `quality-low` — output below acceptance threshold
|
||
- `loop-excess` — more revision iterations than expected
|
||
- `timeout` — agent exceeded time budget
|
||
- `tool-fail` — tool call failed or returned unexpected result
|
||
- `cost-spike` — single run cost exceeded 3x average
|
||
- `scope-drift` — agent worked outside defined scope
|
||
- `hallucination` — output contained factual errors
|
||
|
||
## Notes
|
||
|
||
Scores are 0–100 as assigned by the reviewer agent or human reviewer.
|
||
A score below 60 triggers a flag in performance-scorer.sh.
|
||
Three or more rows with the same Pattern tag = recurring issue.
|
||
Recurring issues should drive prompt iteration or pipeline redesign.
|
||
```
|
||
|
||
**`scripts/templates/feedback/feedback-collector.sh`** — PostToolUse hook variant that appends feedback after pipeline completion:
|
||
|
||
```bash
|
||
#!/bin/bash
|
||
# PostToolUse hook: Collect feedback after pipeline completion.
|
||
# Bash 3.2 compatible. Uses python3 for JSON parsing and CSV/MD append.
|
||
#
|
||
# Triggered after a designated "review" tool call completes.
|
||
# Reads pipeline output and reviewer score, appends to FEEDBACK.md,
|
||
# and detects recurring patterns (3+ rows with same tag = recurring).
|
||
#
|
||
# Placeholders:
|
||
# {{WORKING_DIR}} - absolute path to project directory
|
||
# {{PIPELINE_NAME}} - name of the pipeline being tracked
|
||
# {{SCORE_THRESHOLD}} - minimum acceptable score (default: 60)
|
||
|
||
WORKING_DIR="{{WORKING_DIR}}"
|
||
PIPELINE_NAME="{{PIPELINE_NAME}}"
|
||
SCORE_THRESHOLD="${SCORE_THRESHOLD:-60}"
|
||
FEEDBACK_FILE="$WORKING_DIR/FEEDBACK.md"
|
||
HOOK_INPUT=$(cat)
|
||
|
||
# Only act on review tool calls
|
||
TOOL_NAME=$(echo "$HOOK_INPUT" | python3 -c "
|
||
import sys, json
|
||
try:
|
||
data = json.load(sys.stdin)
|
||
print(data.get('tool_name', ''))
|
||
except:
|
||
print('')
|
||
" 2>/dev/null)
|
||
|
||
if [ "$TOOL_NAME" != "review_pipeline" ] && [ "$TOOL_NAME" != "score_output" ]; then
|
||
exit 0
|
||
fi
|
||
|
||
# Extract score, agent, issue, resolution, pattern from hook input
|
||
python3 << PYEOF
|
||
import sys, json, re, os
|
||
from datetime import datetime
|
||
|
||
hook_input = """$HOOK_INPUT"""
|
||
feedback_file = "$FEEDBACK_FILE"
|
||
pipeline_name = "$PIPELINE_NAME"
|
||
score_threshold = int("$SCORE_THRESHOLD")
|
||
|
||
try:
|
||
data = json.loads(hook_input)
|
||
except Exception:
|
||
sys.exit(0)
|
||
|
||
tool_result = data.get('tool_result', '')
|
||
if isinstance(tool_result, dict):
|
||
tool_result = json.dumps(tool_result)
|
||
|
||
# Parse structured fields from tool result (expects JSON or key:value)
|
||
agent_name = os.environ.get('AGENT_NAME', 'unknown')
|
||
score = 0
|
||
issue = ''
|
||
resolution = ''
|
||
pattern = ''
|
||
|
||
try:
|
||
result_data = json.loads(tool_result)
|
||
agent_name = result_data.get('agent', agent_name)
|
||
score = int(result_data.get('score', 0))
|
||
issue = result_data.get('issue', '')
|
||
resolution = result_data.get('resolution', '')
|
||
pattern = result_data.get('pattern', '')
|
||
except Exception:
|
||
# Fallback: look for score: N in plain text
|
||
m = re.search(r'score[:\s]+(\d+)', tool_result, re.IGNORECASE)
|
||
if m:
|
||
score = int(m.group(1))
|
||
m = re.search(r'pattern[:\s]+(\S+)', tool_result, re.IGNORECASE)
|
||
if m:
|
||
pattern = m.group(1)
|
||
|
||
if score == 0 and not issue:
|
||
sys.exit(0)
|
||
|
||
date_str = datetime.utcnow().strftime('%Y-%m-%d')
|
||
row = f"| {date_str} | {pipeline_name} | {agent_name} | {score}/100 | {issue} | {resolution} | {pattern} |"
|
||
|
||
# Append to feedback table
|
||
if not os.path.exists(feedback_file):
|
||
print(f"Warning: {feedback_file} not found — skipping feedback append")
|
||
sys.exit(0)
|
||
|
||
with open(feedback_file, 'r') as f:
|
||
content = f.read()
|
||
|
||
# Insert row after the header row of the table
|
||
table_header = '| Date | Pipeline | Agent | Score | Issue | Resolution | Pattern |'
|
||
separator = '|------|----------|-------|-------|-------|------------|---------|'
|
||
placeholder_row = '| {{DATE}} | {{PIPELINE_NAME}} | {{AGENT_NAME}} | {{SCORE}}/100 | {{ISSUE_DESCRIPTION}} | {{RESOLUTION}} | {{PATTERN_TAG}} |'
|
||
|
||
if placeholder_row in content:
|
||
# Replace placeholder with real row + keep placeholder for next time
|
||
content = content.replace(placeholder_row, row + '\n' + placeholder_row)
|
||
elif separator in content:
|
||
content = content.replace(separator, separator + '\n' + row)
|
||
else:
|
||
content += '\n' + row + '\n'
|
||
|
||
with open(feedback_file, 'w') as f:
|
||
f.write(content)
|
||
|
||
print(f"Feedback recorded: score={score}, pattern={pattern}")
|
||
|
||
# Detect recurring patterns
|
||
if pattern:
|
||
pattern_count = content.count(f'| {pattern} |')
|
||
if pattern_count >= 3:
|
||
print(f"RECURRING PATTERN DETECTED: '{pattern}' appears {pattern_count} times")
|
||
print(f"Action required: review prompt or pipeline for '{pipeline_name}'")
|
||
|
||
# Flag low scores
|
||
if score < score_threshold and score > 0:
|
||
print(f"LOW SCORE: {score} < threshold {score_threshold} for agent {agent_name}")
|
||
PYEOF
|
||
|
||
exit 0
|
||
```
|
||
|
||
**`scripts/templates/feedback/performance-scorer.sh`** — Standalone scoring script that reads FEEDBACK.md and cost-events.jsonl:
|
||
|
||
```bash
|
||
#!/bin/bash
|
||
# Performance scorer: per-agent metrics from FEEDBACK.md + cost-events.jsonl.
|
||
# Bash 3.2 compatible. Uses python3 for all metrics computation.
|
||
#
|
||
# Metrics per agent:
|
||
# - Average score (0-100)
|
||
# - Error rate (rows with score < threshold / total rows)
|
||
# - Cost per run (from cost-events.jsonl, rough proxy)
|
||
# - Improvement trend: avg of last 10 scores vs. previous 10
|
||
#
|
||
# Flags agents below threshold (default 60/100).
|
||
#
|
||
# Usage:
|
||
# ./performance-scorer.sh # Score all agents
|
||
# ./performance-scorer.sh --agent {{AGENT}} # Score specific agent
|
||
# ./performance-scorer.sh --threshold 70 # Custom threshold
|
||
#
|
||
# Placeholders:
|
||
# {{WORKING_DIR}} - absolute path to project directory
|
||
|
||
WORKING_DIR="{{WORKING_DIR}}"
|
||
FEEDBACK_FILE="$WORKING_DIR/FEEDBACK.md"
|
||
COST_LOG="$WORKING_DIR/budget/cost-events.jsonl"
|
||
THRESHOLD="${2:-60}"
|
||
AGENT_FILTER=""
|
||
|
||
# Parse arguments (bash 3.2 compatible — no associative arrays)
|
||
while [ "$#" -gt 0 ]; do
|
||
case "$1" in
|
||
--agent) AGENT_FILTER="$2"; shift 2 ;;
|
||
--threshold) THRESHOLD="$2"; shift 2 ;;
|
||
*) shift ;;
|
||
esac
|
||
done
|
||
|
||
if [ ! -f "$FEEDBACK_FILE" ]; then
|
||
echo "No feedback file found at $FEEDBACK_FILE"
|
||
exit 0
|
||
fi
|
||
|
||
python3 << PYEOF
|
||
import re, json, os, sys
|
||
from collections import defaultdict
|
||
|
||
feedback_file = "$FEEDBACK_FILE"
|
||
cost_log = "$COST_LOG"
|
||
threshold = int("$THRESHOLD")
|
||
agent_filter = "$AGENT_FILTER"
|
||
|
||
# Parse FEEDBACK.md table rows
|
||
# Expected columns: Date, Pipeline, Agent, Score, Issue, Resolution, Pattern
|
||
feedback_rows = []
|
||
with open(feedback_file) as f:
|
||
in_table = False
|
||
header_seen = False
|
||
for line in f:
|
||
line = line.strip()
|
||
if '| Date |' in line:
|
||
in_table = True
|
||
header_seen = True
|
||
continue
|
||
if in_table and line.startswith('|---'):
|
||
continue
|
||
if in_table and line.startswith('|') and '{{' not in line and header_seen:
|
||
cols = [c.strip() for c in line.strip('|').split('|')]
|
||
if len(cols) >= 7:
|
||
try:
|
||
date = cols[0]
|
||
pipeline = cols[1]
|
||
agent = cols[2]
|
||
score_str = cols[3]
|
||
issue = cols[4]
|
||
resolution = cols[5]
|
||
pattern = cols[6]
|
||
# Parse score: "75/100" or "75"
|
||
score_m = re.match(r'(\d+)', score_str)
|
||
score = int(score_m.group(1)) if score_m else 0
|
||
feedback_rows.append({
|
||
'date': date,
|
||
'pipeline': pipeline,
|
||
'agent': agent,
|
||
'score': score,
|
||
'issue': issue,
|
||
'pattern': pattern
|
||
})
|
||
except (ValueError, IndexError):
|
||
pass
|
||
|
||
# Filter by agent if specified
|
||
if agent_filter:
|
||
feedback_rows = [r for r in feedback_rows if r['agent'] == agent_filter]
|
||
|
||
if not feedback_rows:
|
||
print("No feedback rows found.")
|
||
sys.exit(0)
|
||
|
||
# Read cost events if available
|
||
cost_by_agent = defaultdict(int)
|
||
if os.path.exists(cost_log):
|
||
with open(cost_log) as f:
|
||
for line in f:
|
||
line = line.strip()
|
||
if line:
|
||
try:
|
||
event = json.loads(line)
|
||
agent = event.get('agent', 'unknown')
|
||
cost_by_agent[agent] += 1 # event count as proxy
|
||
except Exception:
|
||
pass
|
||
|
||
# Compute per-agent metrics
|
||
agents = list(set(r['agent'] for r in feedback_rows))
|
||
|
||
print("PERFORMANCE SCORECARD")
|
||
print("=" * 60)
|
||
print(f"Threshold: {threshold}/100")
|
||
print(f"Total feedback rows: {len(feedback_rows)}")
|
||
print()
|
||
|
||
flagged = []
|
||
|
||
for agent in sorted(agents):
|
||
rows = [r for r in feedback_rows if r['agent'] == agent]
|
||
scores = [r['score'] for r in rows]
|
||
|
||
avg_score = sum(scores) / len(scores) if scores else 0
|
||
error_rate = len([s for s in scores if s < threshold]) / len(scores) if scores else 0
|
||
cost_events = cost_by_agent.get(agent, 0)
|
||
cost_per_run = cost_events / len(rows) if rows else 0
|
||
|
||
# Improvement trend: last 10 vs. prev 10
|
||
trend_str = "n/a (fewer than 20 runs)"
|
||
if len(scores) >= 20:
|
||
prev10 = scores[-20:-10]
|
||
last10 = scores[-10:]
|
||
prev_avg = sum(prev10) / len(prev10)
|
||
last_avg = sum(last10) / len(last10)
|
||
delta = last_avg - prev_avg
|
||
if delta > 5:
|
||
trend_str = f"improving (+{delta:.1f})"
|
||
elif delta < -5:
|
||
trend_str = f"declining ({delta:.1f})"
|
||
else:
|
||
trend_str = f"stable ({delta:+.1f})"
|
||
elif len(scores) >= 10:
|
||
last10 = scores[-10:]
|
||
trend_str = f"recent avg: {sum(last10)/len(last10):.1f} (need 20 runs for trend)"
|
||
|
||
# Pattern frequency
|
||
patterns = defaultdict(int)
|
||
for r in rows:
|
||
if r['pattern']:
|
||
patterns[r['pattern']] += 1
|
||
top_patterns = sorted(patterns.items(), key=lambda x: -x[1])[:3]
|
||
|
||
print(f"Agent: {agent}")
|
||
print(f" Runs: {len(rows)}")
|
||
print(f" Avg score: {avg_score:.1f}/100")
|
||
print(f" Error rate: {error_rate*100:.0f}% (score < {threshold})")
|
||
print(f" Cost/run: ~{cost_per_run:.1f} events (rough proxy)")
|
||
print(f" Trend: {trend_str}")
|
||
if top_patterns:
|
||
print(f" Top patterns: {', '.join(f'{p}({c})' for p, c in top_patterns)}")
|
||
print()
|
||
|
||
if avg_score < threshold:
|
||
flagged.append((agent, avg_score))
|
||
|
||
# Summary of flagged agents
|
||
if flagged:
|
||
print("FLAGGED AGENTS (below threshold)")
|
||
print("-" * 40)
|
||
for agent, avg in flagged:
|
||
print(f" {agent}: avg {avg:.1f} < {threshold}")
|
||
print()
|
||
print("Recommended actions:")
|
||
print(" 1. Review feedback rows for top patterns")
|
||
print(" 2. Iterate on agent system prompt")
|
||
print(" 3. Consider pipeline redesign if pattern is structural")
|
||
print(" 4. Run pipeline-optimizer.sh for bottleneck analysis")
|
||
else:
|
||
print("All agents above threshold.")
|
||
PYEOF
|
||
```
|
||
|
||
**`scripts/templates/feedback/README.md`** — Explains the feedback loop pattern:
|
||
|
||
```markdown
|
||
# Feedback Loop
|
||
|
||
Systematic feedback collection and performance scoring for agent pipelines.
|
||
|
||
## How it works
|
||
|
||
1. After each pipeline run, a reviewer agent (or human) assigns a score (0–100)
|
||
and categorizes any issues with a pattern tag.
|
||
2. `feedback-collector.sh` runs as a PostToolUse hook on `review_pipeline` or
|
||
`score_output` tool calls. It appends a row to `FEEDBACK.md`.
|
||
3. When 3+ rows share the same pattern tag, a recurring-pattern alert fires.
|
||
4. `performance-scorer.sh` reads `FEEDBACK.md` and `budget/cost-events.jsonl`
|
||
to compute per-agent metrics: average score, error rate, cost per run,
|
||
improvement trend (last 10 vs. previous 10 runs).
|
||
5. Agents scoring below the threshold (default 60/100) are flagged for review.
|
||
|
||
## Pattern tags
|
||
|
||
Consistent tags are required for pattern detection to work. Use the tags
|
||
defined in `FEEDBACK.md`. Add project-specific tags as needed — but be
|
||
consistent. Inconsistent tagging produces false negatives.
|
||
|
||
## Scoring → self-improvement connection
|
||
|
||
Feedback scores are the input to VFM (Value-for-Money) pre-scoring
|
||
defined in `scripts/templates/proactive/VFM-SCORING.md` (Step 11).
|
||
A low-scoring agent gets a lower VFM pre-score for future pipeline tasks,
|
||
making it less likely to be selected until its performance improves.
|
||
|
||
The feedback loop closes the improvement cycle:
|
||
1. Pipeline runs → reviewer assigns score + pattern tag
|
||
2. `feedback-collector.sh` appends to FEEDBACK.md
|
||
3. `performance-scorer.sh` flags underperforming agents
|
||
4. Developer reviews top patterns → iterates on agent prompt
|
||
5. New runs produce new feedback → trend shows improvement
|
||
6. VFM scores update automatically on next pipeline selection
|
||
|
||
## Example: prompt iteration driven by feedback
|
||
|
||
Suppose `agent-writer` repeatedly scores 45/100 with pattern `quality-low`:
|
||
|
||
```
|
||
| 2025-01-10 | doc-pipeline | agent-writer | 45/100 | Output too brief | Added detail requirement | quality-low |
|
||
| 2025-01-11 | doc-pipeline | agent-writer | 42/100 | Still too brief | Repeated instruction | quality-low |
|
||
| 2025-01-12 | doc-pipeline | agent-writer | 48/100 | Slightly better | — | quality-low |
|
||
```
|
||
|
||
After 3 rows: feedback-collector.sh fires the recurring-pattern alert.
|
||
performance-scorer.sh shows avg 45/100, error rate 100%.
|
||
Action: update agent-writer's system prompt with explicit length and
|
||
depth requirements. Next 10 runs show trend "improving (+18.3)".
|
||
|
||
## Integration
|
||
|
||
Add feedback-collector.sh as a PostToolUse hook in `.claude/settings.json`:
|
||
|
||
```json
|
||
{
|
||
"hooks": {
|
||
"PostToolUse": [{
|
||
"matcher": "review_pipeline",
|
||
"hooks": [{"type": "command", "command": "bash feedback/feedback-collector.sh"}]
|
||
}]
|
||
}
|
||
}
|
||
```
|
||
|
||
Run performance-scorer.sh on demand or as a scheduled report:
|
||
|
||
```bash
|
||
./feedback/performance-scorer.sh
|
||
./feedback/performance-scorer.sh --agent agent-writer --threshold 70
|
||
```
|
||
```
|
||
|
||
### Verify
|
||
|
||
```bash
|
||
bash -n /Users/ktg/repos/agent-builder/scripts/templates/feedback/feedback-collector.sh && bash -n /Users/ktg/repos/agent-builder/scripts/templates/feedback/performance-scorer.sh && echo "VALID"
|
||
```
|
||
Expected: `VALID`
|
||
|
||
### On failure: retry — fix bash syntax, then revert
|
||
|
||
### Checkpoint
|
||
```bash
|
||
git commit -m "feat(templates): add feedback loop and performance scoring templates"
|
||
```
|
||
|
||
---
|
||
|
||
## Step 21: Create pipeline optimization templates
|
||
|
||
### Files to create
|
||
|
||
**`scripts/templates/optimization/pipeline-optimizer.sh`** — Analyzes pipeline performance and generates recommendations:
|
||
|
||
```bash
|
||
#!/bin/bash
|
||
# Pipeline optimizer: identify bottlenecks, excess loops, cost outliers.
|
||
# Bash 3.2 compatible. Uses python3 for all analysis.
|
||
# Does NOT auto-implement any changes — produces RECOMMENDATIONS.md only.
|
||
#
|
||
# Analysis covers:
|
||
# - Bottleneck agents (highest avg duration or cost per run)
|
||
# - Unnecessary revision loops (agents that loop 3+ times on average)
|
||
# - Underutilized agents (invoked < 10% of pipeline runs)
|
||
# - Cost outliers (single run cost >= 3x average)
|
||
#
|
||
# Output: RECOMMENDATIONS.md with VFM pre-scores for each recommendation.
|
||
#
|
||
# Usage:
|
||
# ./pipeline-optimizer.sh
|
||
# ./pipeline-optimizer.sh --pipeline {{PIPELINE_NAME}}
|
||
#
|
||
# Placeholders:
|
||
# {{WORKING_DIR}} - absolute path to project directory
|
||
|
||
WORKING_DIR="{{WORKING_DIR}}"
|
||
FEEDBACK_FILE="$WORKING_DIR/FEEDBACK.md"
|
||
COST_LOG="$WORKING_DIR/budget/cost-events.jsonl"
|
||
RECOMMENDATIONS_FILE="$WORKING_DIR/RECOMMENDATIONS.md"
|
||
PIPELINE_FILTER=""
|
||
|
||
# Parse arguments (bash 3.2 compatible)
|
||
while [ "$#" -gt 0 ]; do
|
||
case "$1" in
|
||
--pipeline) PIPELINE_FILTER="$2"; shift 2 ;;
|
||
*) shift ;;
|
||
esac
|
||
done
|
||
|
||
python3 << PYEOF
|
||
import re, json, os, sys
|
||
from collections import defaultdict
|
||
from datetime import datetime
|
||
|
||
feedback_file = "$FEEDBACK_FILE"
|
||
cost_log = "$COST_LOG"
|
||
recommendations_file = "$RECOMMENDATIONS_FILE"
|
||
pipeline_filter = "$PIPELINE_FILTER"
|
||
|
||
# Parse FEEDBACK.md
|
||
feedback_rows = []
|
||
if os.path.exists(feedback_file):
|
||
with open(feedback_file) as f:
|
||
in_table = False
|
||
for line in f:
|
||
line = line.strip()
|
||
if '| Date |' in line:
|
||
in_table = True
|
||
continue
|
||
if in_table and line.startswith('|---'):
|
||
continue
|
||
if in_table and line.startswith('|') and '{{' not in line:
|
||
cols = [c.strip() for c in line.strip('|').split('|')]
|
||
if len(cols) >= 7:
|
||
try:
|
||
score_m = re.match(r'(\d+)', cols[3])
|
||
score = int(score_m.group(1)) if score_m else 0
|
||
feedback_rows.append({
|
||
'date': cols[0],
|
||
'pipeline': cols[1],
|
||
'agent': cols[2],
|
||
'score': score,
|
||
'issue': cols[4],
|
||
'pattern': cols[6]
|
||
})
|
||
except (ValueError, IndexError):
|
||
pass
|
||
|
||
# Filter by pipeline
|
||
if pipeline_filter:
|
||
feedback_rows = [r for r in feedback_rows if r['pipeline'] == pipeline_filter]
|
||
|
||
# Parse cost events
|
||
cost_events = []
|
||
if os.path.exists(cost_log):
|
||
with open(cost_log) as f:
|
||
for line in f:
|
||
line = line.strip()
|
||
if line:
|
||
try:
|
||
cost_events.append(json.loads(line))
|
||
except Exception:
|
||
pass
|
||
|
||
# Per-agent event counts (cost proxy)
|
||
cost_by_agent = defaultdict(list)
|
||
# Group by agent+date for per-run cost
|
||
run_costs = defaultdict(list)
|
||
for e in cost_events:
|
||
agent = e.get('agent', 'unknown')
|
||
date = e.get('timestamp', '')[:10]
|
||
run_key = f"{agent}:{date}"
|
||
cost_by_agent[agent].append(1)
|
||
run_costs[agent].append(1)
|
||
|
||
# Build recommendations
|
||
recommendations = []
|
||
|
||
# 1. Bottleneck agents: top 2 by event count
|
||
if cost_by_agent:
|
||
agent_totals = [(a, len(events)) for a, events in cost_by_agent.items()]
|
||
agent_totals.sort(key=lambda x: -x[1])
|
||
for agent, total in agent_totals[:2]:
|
||
all_costs = [len(v) for v in run_costs.values()]
|
||
avg_cost = sum(all_costs) / len(all_costs) if all_costs else 1
|
||
if total > avg_cost * 1.5:
|
||
recommendations.append({
|
||
'type': 'bottleneck',
|
||
'agent': agent,
|
||
'description': f"Agent '{agent}' accounts for {total} events vs avg {avg_cost:.0f}. "
|
||
f"Consider batching its tool calls or reducing its task scope.",
|
||
'vfm_prescore': 70
|
||
})
|
||
|
||
# 2. Unnecessary revision loops: agents with loop-excess pattern >= 3 times
|
||
pattern_by_agent = defaultdict(lambda: defaultdict(int))
|
||
for r in feedback_rows:
|
||
if r['pattern']:
|
||
pattern_by_agent[r['agent']][r['pattern']] += 1
|
||
|
||
for agent, patterns in pattern_by_agent.items():
|
||
if patterns.get('loop-excess', 0) >= 3:
|
||
count = patterns['loop-excess']
|
||
recommendations.append({
|
||
'type': 'loop-excess',
|
||
'agent': agent,
|
||
'description': f"Agent '{agent}' has {count} feedback rows tagged 'loop-excess'. "
|
||
f"Review pipeline revision criteria — tighten acceptance conditions "
|
||
f"or add a max-iterations guard (see self-healing.sh).",
|
||
'vfm_prescore': 80
|
||
})
|
||
|
||
# 3. Underutilized agents: invoked in < 10% of pipeline runs
|
||
if feedback_rows:
|
||
all_runs = set(r['date'] + ':' + r['pipeline'] for r in feedback_rows)
|
||
total_runs = len(all_runs) if all_runs else 1
|
||
agent_runs = defaultdict(set)
|
||
for r in feedback_rows:
|
||
agent_runs[r['agent']].add(r['date'] + ':' + r['pipeline'])
|
||
for agent, runs in agent_runs.items():
|
||
utilization = len(runs) / total_runs
|
||
if utilization < 0.1 and total_runs >= 10:
|
||
recommendations.append({
|
||
'type': 'underutilized',
|
||
'agent': agent,
|
||
'description': f"Agent '{agent}' appears in only {utilization*100:.0f}% of pipeline runs. "
|
||
f"Consider removing from the pipeline or combining with another agent.",
|
||
'vfm_prescore': 60
|
||
})
|
||
|
||
# 4. Cost outliers: single-run cost >= 3x average
|
||
if run_costs:
|
||
all_run_totals = []
|
||
for agent, runs in run_costs.items():
|
||
all_run_totals.extend(runs)
|
||
avg_run = sum(all_run_totals) / len(all_run_totals) if all_run_totals else 1
|
||
for agent, runs in run_costs.items():
|
||
for run_cost in runs:
|
||
if run_cost >= avg_run * 3:
|
||
recommendations.append({
|
||
'type': 'cost-outlier',
|
||
'agent': agent,
|
||
'description': f"Agent '{agent}' had a run costing {run_cost} events "
|
||
f"vs avg {avg_run:.1f} (3x+ threshold). "
|
||
f"Add per-run budget cap with budget-hook.sh.",
|
||
'vfm_prescore': 75
|
||
})
|
||
break # one recommendation per agent
|
||
|
||
# Write RECOMMENDATIONS.md
|
||
timestamp = datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%SZ')
|
||
pipeline_label = pipeline_filter if pipeline_filter else "all pipelines"
|
||
|
||
lines = [
|
||
f"# Pipeline Optimization Recommendations",
|
||
f"",
|
||
f"Generated: {timestamp}",
|
||
f"Scope: {pipeline_label}",
|
||
f"",
|
||
f"> These are recommendations only. No changes have been made.",
|
||
f"> Review each item and implement manually or with team approval.",
|
||
f"",
|
||
]
|
||
|
||
if recommendations:
|
||
lines.append(f"## Recommendations ({len(recommendations)} found)")
|
||
lines.append("")
|
||
for i, rec in enumerate(recommendations, 1):
|
||
lines.append(f"### R{i}: {rec['type'].upper()} — {rec['agent']}")
|
||
lines.append("")
|
||
lines.append(rec['description'])
|
||
lines.append("")
|
||
lines.append(f"**VFM pre-score:** {rec['vfm_prescore']}/100")
|
||
lines.append("")
|
||
else:
|
||
lines.append("## No recommendations")
|
||
lines.append("")
|
||
lines.append("No bottlenecks, excess loops, underutilized agents, or cost outliers detected.")
|
||
lines.append("")
|
||
|
||
lines.append("## Next steps")
|
||
lines.append("")
|
||
lines.append("1. Review each recommendation with the team")
|
||
lines.append("2. Prioritize by VFM pre-score (higher = more value per effort)")
|
||
lines.append("3. Implement approved changes one at a time")
|
||
lines.append("4. Run feedback-collector.sh for 10+ runs after each change")
|
||
lines.append("5. Re-run pipeline-optimizer.sh to confirm improvement")
|
||
|
||
with open(recommendations_file, 'w') as f:
|
||
f.write('\n'.join(lines) + '\n')
|
||
|
||
print(f"Recommendations written to {recommendations_file}")
|
||
print(f" Found: {len(recommendations)} recommendations")
|
||
for rec in recommendations:
|
||
print(f" - [{rec['type']}] {rec['agent']}: VFM pre-score {rec['vfm_prescore']}")
|
||
PYEOF
|
||
```
|
||
|
||
**`scripts/templates/optimization/self-healing.sh`** — Error recovery after agent/pipeline failures:
|
||
|
||
```bash
|
||
#!/bin/bash
|
||
# Self-healing: categorize errors and apply recovery strategies.
|
||
# Bash 3.2 compatible. Uses python3 for JSON/log parsing.
|
||
#
|
||
# Error categories and recovery strategies:
|
||
# timeout → retry with shorter task scope
|
||
# permission-denied → log and skip (do not retry)
|
||
# tool-not-found → log and alert, do not retry
|
||
# api-error → exponential backoff, max 3 retries
|
||
# content-quality → re-run with stricter prompt, max 2 retries
|
||
#
|
||
# Max total attempts: 5 (OpenClaw pattern — hard cap regardless of category).
|
||
# All recovery events logged to healing-log.jsonl.
|
||
#
|
||
# Usage:
|
||
# ./self-healing.sh --error-type <type> --agent <name> --attempt <n> --context <msg>
|
||
#
|
||
# Exit codes:
|
||
# 0 — recovery action taken (caller should retry)
|
||
# 1 — no recovery possible (caller should abort)
|
||
# 2 — max attempts reached (caller should escalate)
|
||
#
|
||
# Placeholders:
|
||
# {{WORKING_DIR}} - absolute path to project directory
|
||
|
||
WORKING_DIR="{{WORKING_DIR}}"
|
||
HEALING_LOG="$WORKING_DIR/healing-log.jsonl"
|
||
MAX_ATTEMPTS=5
|
||
|
||
ERROR_TYPE=""
|
||
AGENT_NAME=""
|
||
ATTEMPT=1
|
||
CONTEXT_MSG=""
|
||
|
||
# Parse arguments (bash 3.2 compatible)
|
||
while [ "$#" -gt 0 ]; do
|
||
case "$1" in
|
||
--error-type) ERROR_TYPE="$2"; shift 2 ;;
|
||
--agent) AGENT_NAME="$2"; shift 2 ;;
|
||
--attempt) ATTEMPT="$2"; shift 2 ;;
|
||
--context) CONTEXT_MSG="$2"; shift 2 ;;
|
||
*) shift ;;
|
||
esac
|
||
done
|
||
|
||
if [ -z "$ERROR_TYPE" ]; then
|
||
echo "Usage: $0 --error-type <type> --agent <name> --attempt <n> --context <msg>"
|
||
exit 1
|
||
fi
|
||
|
||
# Hard cap: max 5 attempts total
|
||
if [ "$ATTEMPT" -gt "$MAX_ATTEMPTS" ]; then
|
||
echo "MAX ATTEMPTS REACHED ($MAX_ATTEMPTS) for $AGENT_NAME. Escalating."
|
||
python3 -c "
|
||
import json, time, os
|
||
event = {
|
||
'timestamp': time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime()),
|
||
'agent': '$AGENT_NAME',
|
||
'error_type': '$ERROR_TYPE',
|
||
'attempt': $ATTEMPT,
|
||
'action': 'escalate',
|
||
'reason': 'max_attempts_reached',
|
||
'context': '$CONTEXT_MSG'
|
||
}
|
||
with open('$HEALING_LOG', 'a') as f:
|
||
f.write(json.dumps(event) + '\n')
|
||
print(json.dumps(event))
|
||
"
|
||
exit 2
|
||
fi
|
||
|
||
# Determine recovery action per category
|
||
RECOVERY_ACTION=""
|
||
RECOVERY_DETAIL=""
|
||
EXIT_CODE=0
|
||
|
||
case "$ERROR_TYPE" in
|
||
timeout)
|
||
RECOVERY_ACTION="retry_shorter"
|
||
RECOVERY_DETAIL="Re-run with reduced task scope. Split task if attempt >= 3."
|
||
if [ "$ATTEMPT" -ge 3 ]; then
|
||
RECOVERY_DETAIL="Attempt $ATTEMPT: recommend splitting task before retry."
|
||
fi
|
||
EXIT_CODE=0
|
||
;;
|
||
permission-denied)
|
||
RECOVERY_ACTION="skip"
|
||
RECOVERY_DETAIL="Permission errors cannot be auto-resolved. Log and skip. Notify operator."
|
||
EXIT_CODE=1
|
||
;;
|
||
tool-not-found)
|
||
RECOVERY_ACTION="alert"
|
||
RECOVERY_DETAIL="Tool not found — check agent config and hook registrations. Do not retry."
|
||
EXIT_CODE=1
|
||
;;
|
||
api-error)
|
||
# Exponential backoff: 2^(attempt-1) seconds, max 3 retries
|
||
if [ "$ATTEMPT" -le 3 ]; then
|
||
BACKOFF_SECS=$(python3 -c "print(min(2 ** ($ATTEMPT - 1), 16))")
|
||
RECOVERY_ACTION="retry_backoff"
|
||
RECOVERY_DETAIL="API error — wait ${BACKOFF_SECS}s then retry (attempt $ATTEMPT/3)."
|
||
sleep "$BACKOFF_SECS"
|
||
EXIT_CODE=0
|
||
else
|
||
RECOVERY_ACTION="abort"
|
||
RECOVERY_DETAIL="API error persists after 3 retries. Aborting."
|
||
EXIT_CODE=1
|
||
fi
|
||
;;
|
||
content-quality)
|
||
# Max 2 retries for quality issues
|
||
if [ "$ATTEMPT" -le 2 ]; then
|
||
RECOVERY_ACTION="retry_strict"
|
||
RECOVERY_DETAIL="Re-run with stricter prompt. Add explicit quality criteria (attempt $ATTEMPT/2)."
|
||
EXIT_CODE=0
|
||
else
|
||
RECOVERY_ACTION="escalate_quality"
|
||
RECOVERY_DETAIL="Content quality below threshold after 2 retries. Escalate to human review."
|
||
EXIT_CODE=2
|
||
fi
|
||
;;
|
||
*)
|
||
RECOVERY_ACTION="unknown"
|
||
RECOVERY_DETAIL="Unknown error type '$ERROR_TYPE'. Logging and aborting."
|
||
EXIT_CODE=1
|
||
;;
|
||
esac
|
||
|
||
# Log recovery event
|
||
python3 -c "
|
||
import json, time
|
||
event = {
|
||
'timestamp': time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime()),
|
||
'agent': '$AGENT_NAME',
|
||
'error_type': '$ERROR_TYPE',
|
||
'attempt': $ATTEMPT,
|
||
'action': '$RECOVERY_ACTION',
|
||
'detail': '$RECOVERY_DETAIL',
|
||
'context': '$CONTEXT_MSG'
|
||
}
|
||
with open('$HEALING_LOG', 'a') as f:
|
||
f.write(json.dumps(event) + '\n')
|
||
print(json.dumps(event, indent=2))
|
||
"
|
||
|
||
echo "Recovery: $RECOVERY_ACTION — $RECOVERY_DETAIL"
|
||
exit $EXIT_CODE
|
||
```
|
||
|
||
**`scripts/templates/optimization/README.md`** — Explains optimization and self-healing:
|
||
|
||
```markdown
|
||
# Pipeline Optimization and Self-Healing
|
||
|
||
Two tools for making agent pipelines more efficient and resilient over time.
|
||
|
||
## pipeline-optimizer.sh
|
||
|
||
Analyzes FEEDBACK.md and cost-events.jsonl to identify:
|
||
|
||
| Issue | Detection | Recommendation |
|
||
|-------|-----------|----------------|
|
||
| Bottleneck agent | Top-2 by cost event count, 1.5x+ avg | Batch tool calls or narrow task scope |
|
||
| Unnecessary revision loops | 3+ `loop-excess` pattern rows | Tighten acceptance criteria, add max-iterations guard |
|
||
| Underutilized agent | Appears in < 10% of pipeline runs | Remove from pipeline or combine with another agent |
|
||
| Cost outlier | Single run >= 3x average | Add per-run budget cap via budget-hook.sh |
|
||
|
||
Output is written to `RECOMMENDATIONS.md` with a VFM pre-score for each
|
||
recommendation. Higher VFM pre-scores mean more value per implementation effort.
|
||
|
||
**This script does not auto-implement anything.** All changes require
|
||
manual review and explicit approval. This is intentional — pipeline
|
||
restructuring is a high-stakes operation.
|
||
|
||
## self-healing.sh
|
||
|
||
Categorizes errors and applies targeted recovery strategies:
|
||
|
||
| Error Type | Recovery | Max Retries |
|
||
|------------|----------|-------------|
|
||
| `timeout` | Retry with shorter scope | 5 (hard cap) |
|
||
| `permission-denied` | Log and skip | 0 (no retry) |
|
||
| `tool-not-found` | Alert operator | 0 (no retry) |
|
||
| `api-error` | Exponential backoff (2^n seconds) | 3 |
|
||
| `content-quality` | Retry with stricter prompt | 2 |
|
||
|
||
**Hard cap: 5 total attempts regardless of category.** This follows the
|
||
OpenClaw pattern — unbounded retry loops are the most common cause of
|
||
runaway agent costs. The cap is non-negotiable.
|
||
|
||
After the hard cap is reached, the script exits with code 2 (escalate).
|
||
The caller is responsible for deciding whether to pause, alert a human,
|
||
or abort the pipeline run.
|
||
|
||
## Connection to feedback and VFM
|
||
|
||
```
|
||
feedback-collector.sh → FEEDBACK.md → performance-scorer.sh → flagged agents
|
||
↓
|
||
pipeline-optimizer.sh → RECOMMENDATIONS.md
|
||
↓
|
||
(manual review + approval)
|
||
↓
|
||
prompt/pipeline update
|
||
↓
|
||
new runs → new feedback
|
||
```
|
||
|
||
VFM pre-scores in RECOMMENDATIONS.md use the same 0–100 scale as
|
||
`scripts/templates/proactive/VFM-SCORING.md` (Step 11). They are
|
||
pre-scores, not final scores — the VFM evaluation still needs to run
|
||
when the task is scheduled. The pre-scores help prioritize which
|
||
recommendations to tackle first.
|
||
|
||
## Safety limits
|
||
|
||
- `pipeline-optimizer.sh`: read-only analysis — never modifies pipeline files
|
||
- `self-healing.sh`: max 5 attempts hard cap, permission errors never retried
|
||
- All events logged to `healing-log.jsonl` for audit trail
|
||
- No auto-escalation to external systems — exit codes only
|
||
|
||
## Usage
|
||
|
||
```bash
|
||
# Run optimizer for all pipelines
|
||
./optimization/pipeline-optimizer.sh
|
||
|
||
# Run optimizer for a specific pipeline
|
||
./optimization/pipeline-optimizer.sh --pipeline doc-pipeline
|
||
|
||
# Handle an error in a pipeline step
|
||
./optimization/self-healing.sh \
|
||
--error-type api-error \
|
||
--agent agent-writer \
|
||
--attempt 1 \
|
||
--context "OpenAI timeout on summarize call"
|
||
|
||
# Check healing log
|
||
cat healing-log.jsonl | python3 -m json.tool
|
||
```
|
||
```
|
||
|
||
### Verify
|
||
|
||
```bash
|
||
bash -n /Users/ktg/repos/agent-builder/scripts/templates/optimization/pipeline-optimizer.sh && bash -n /Users/ktg/repos/agent-builder/scripts/templates/optimization/self-healing.sh && echo "VALID"
|
||
```
|
||
Expected: `VALID`
|
||
|
||
### On failure: retry — fix bash syntax, then revert
|
||
|
||
### Checkpoint
|
||
```bash
|
||
git commit -m "feat(templates): add pipeline optimization and self-healing templates"
|
||
```
|
||
|
||
---
|
||
|
||
## Exit Condition
|
||
|
||
- [ ] `ls /Users/ktg/repos/agent-builder/scripts/templates/feedback/ | wc -l` → 4
|
||
- [ ] `ls /Users/ktg/repos/agent-builder/scripts/templates/optimization/ | wc -l` → 3
|
||
- [ ] `bash -n /Users/ktg/repos/agent-builder/scripts/templates/feedback/feedback-collector.sh` → no errors
|
||
- [ ] `bash -n /Users/ktg/repos/agent-builder/scripts/templates/feedback/performance-scorer.sh` → no errors
|
||
- [ ] `bash -n /Users/ktg/repos/agent-builder/scripts/templates/optimization/pipeline-optimizer.sh` → no errors
|
||
- [ ] `bash -n /Users/ktg/repos/agent-builder/scripts/templates/optimization/self-healing.sh` → no errors
|
||
- [ ] FEEDBACK.md contains `| Date | Pipeline | Agent | Score | Issue | Resolution | Pattern |` header row
|
||
- [ ] performance-scorer.sh computes improvement trend (last 10 vs. prev 10)
|
||
- [ ] pipeline-optimizer.sh writes to RECOMMENDATIONS.md and does NOT modify any pipeline files
|
||
- [ ] self-healing.sh exits 2 when attempt > 5 (hard cap enforced)
|
||
- [ ] healing-log.jsonl referenced in self-healing.sh
|
||
- [ ] All bash scripts are 3.2 compatible (no associative arrays, no mapfile, no `|&`)
|
||
|
||
## Quality Criteria
|
||
|
||
- Feedback table columns match the 7-column spec (Date, Pipeline, Agent, Score, Issue, Resolution, Pattern)
|
||
- Pattern detection fires at exactly 3 occurrences (not 2, not 4)
|
||
- Performance-scorer.sh improvement trend correctly computes last 10 vs. previous 10 scores
|
||
- pipeline-optimizer.sh detects all 4 issue types: bottleneck, loop-excess, underutilized, cost-outlier
|
||
- VFM pre-scores in RECOMMENDATIONS.md use the same 0–100 scale as VFM-SCORING.md (Step 11)
|
||
- self-healing.sh hard cap is exactly 5 (OpenClaw pattern) — not configurable
|
||
- permission-denied and tool-not-found errors are never retried (exit 1 immediately)
|
||
- api-error uses exponential backoff: 1s, 2s, 4s (2^0, 2^1, 2^2) before aborting at attempt 4
|
||
- content-quality escalates to human review (exit 2) after 2 retries, not abort
|
||
- All scripts use `#!/bin/bash` shebang and are bash 3.2 compatible
|