Kjell Tore Guttormsen 1a776bdeb2 docs(plans): create session blueprints for Agent Factory execution

8 session blueprints covering all 27 steps across 3 waves:
- Session 1: Foundation (rename + commands, Steps 1-5)
- Session 2: Skills and templates (Steps 6-7)
- Session 3: OpenClaw patterns (memory/heartbeat/proactive/cron, Steps 9-12)
- Session 4: Paperclip patterns (context/goals/budget/governance/org-chart, Steps 14-18)
- Session 5: Self-learning (feedback/optimization, Steps 20-21)
- Session 6: Integration (Docker/transfer/5 more domains, Steps 22-24)
- Session 7: Skill updates (memory/autonomy/orchestration/governance/MCP refs, Steps 13,19,25)
- Session 8: Finalization (build command integration + v1.0, Steps 8,26,27)

Also updates plan assumptions table with verified findings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-04-11 11:21:17 +02:00

34 KiB

Raw Blame History

Session 5: Self-Learning Systems

Steps 20, 21 | Wave 1 | Depends on: none

Dependencies

Entry condition: none (independent — creates new template directories only)

Scope Fence

Touch:

scripts/templates/feedback/FEEDBACK.md (new)
scripts/templates/feedback/feedback-collector.sh (new)
scripts/templates/feedback/performance-scorer.sh (new)
scripts/templates/feedback/README.md (new)
scripts/templates/optimization/pipeline-optimizer.sh (new)
scripts/templates/optimization/self-healing.sh (new)
scripts/templates/optimization/README.md (new)

Never touch:

commands/
agents/
skills/
scripts/templates/heartbeat/
scripts/templates/memory/
scripts/templates/proactive/
scripts/templates/cron/
scripts/templates/goals/
scripts/templates/budget/
scripts/templates/governance/
scripts/templates/org-chart/
.claude-plugin/, CLAUDE.md, README.md

Step 20: Create feedback loop templates

Files to create

scripts/templates/feedback/FEEDBACK.md — Feedback tracking file:

# Feedback Log: {{PROJECT_NAME}}

> Append-only. One row per pipeline run. Reviewed by performance-scorer.sh.

## Feedback Table

| Date | Pipeline | Agent | Score | Issue | Resolution | Pattern |
|------|----------|-------|-------|-------|------------|---------|
| {{DATE}} | {{PIPELINE_NAME}} | {{AGENT_NAME}} | {{SCORE}}/100 | {{ISSUE_DESCRIPTION}} | {{RESOLUTION}} | {{PATTERN_TAG}} |

## Pattern Tags

Use consistent tags so performance-scorer.sh can detect recurring issues:

- `quality-low` — output below acceptance threshold
- `loop-excess` — more revision iterations than expected
- `timeout` — agent exceeded time budget
- `tool-fail` — tool call failed or returned unexpected result
- `cost-spike` — single run cost exceeded 3x average
- `scope-drift` — agent worked outside defined scope
- `hallucination` — output contained factual errors

## Notes

Scores are 0–100 as assigned by the reviewer agent or human reviewer.
A score below 60 triggers a flag in performance-scorer.sh.
Three or more rows with the same Pattern tag = recurring issue.
Recurring issues should drive prompt iteration or pipeline redesign.

scripts/templates/feedback/feedback-collector.sh — PostToolUse hook variant that appends feedback after pipeline completion:

#!/bin/bash
# PostToolUse hook: Collect feedback after pipeline completion.
# Bash 3.2 compatible. Uses python3 for JSON parsing and CSV/MD append.
#
# Triggered after a designated "review" tool call completes.
# Reads pipeline output and reviewer score, appends to FEEDBACK.md,
# and detects recurring patterns (3+ rows with same tag = recurring).
#
# Placeholders:
#   {{WORKING_DIR}} - absolute path to project directory
#   {{PIPELINE_NAME}} - name of the pipeline being tracked
#   {{SCORE_THRESHOLD}} - minimum acceptable score (default: 60)

WORKING_DIR="{{WORKING_DIR}}"
PIPELINE_NAME="{{PIPELINE_NAME}}"
SCORE_THRESHOLD="${SCORE_THRESHOLD:-60}"
FEEDBACK_FILE="$WORKING_DIR/FEEDBACK.md"
HOOK_INPUT=$(cat)

# Only act on review tool calls
TOOL_NAME=$(echo "$HOOK_INPUT" | python3 -c "
import sys, json
try:
    data = json.load(sys.stdin)
    print(data.get('tool_name', ''))
except:
    print('')
" 2>/dev/null)

if [ "$TOOL_NAME" != "review_pipeline" ] && [ "$TOOL_NAME" != "score_output" ]; then
  exit 0
fi

# Extract score, agent, issue, resolution, pattern from hook input
python3 << PYEOF
import sys, json, re, os
from datetime import datetime

hook_input = """$HOOK_INPUT"""
feedback_file = "$FEEDBACK_FILE"
pipeline_name = "$PIPELINE_NAME"
score_threshold = int("$SCORE_THRESHOLD")

try:
    data = json.loads(hook_input)
except Exception:
    sys.exit(0)

tool_result = data.get('tool_result', '')
if isinstance(tool_result, dict):
    tool_result = json.dumps(tool_result)

# Parse structured fields from tool result (expects JSON or key:value)
agent_name = os.environ.get('AGENT_NAME', 'unknown')
score = 0
issue = ''
resolution = ''
pattern = ''

try:
    result_data = json.loads(tool_result)
    agent_name = result_data.get('agent', agent_name)
    score = int(result_data.get('score', 0))
    issue = result_data.get('issue', '')
    resolution = result_data.get('resolution', '')
    pattern = result_data.get('pattern', '')
except Exception:
    # Fallback: look for score: N in plain text
    m = re.search(r'score[:\s]+(\d+)', tool_result, re.IGNORECASE)
    if m:
        score = int(m.group(1))
    m = re.search(r'pattern[:\s]+(\S+)', tool_result, re.IGNORECASE)
    if m:
        pattern = m.group(1)

if score == 0 and not issue:
    sys.exit(0)

date_str = datetime.utcnow().strftime('%Y-%m-%d')
row = f"| {date_str} | {pipeline_name} | {agent_name} | {score}/100 | {issue} | {resolution} | {pattern} |"

# Append to feedback table
if not os.path.exists(feedback_file):
    print(f"Warning: {feedback_file} not found — skipping feedback append")
    sys.exit(0)

with open(feedback_file, 'r') as f:
    content = f.read()

# Insert row after the header row of the table
table_header = '| Date | Pipeline | Agent | Score | Issue | Resolution | Pattern |'
separator = '|------|----------|-------|-------|-------|------------|---------|'
placeholder_row = '| {{DATE}} | {{PIPELINE_NAME}} | {{AGENT_NAME}} | {{SCORE}}/100 | {{ISSUE_DESCRIPTION}} | {{RESOLUTION}} | {{PATTERN_TAG}} |'

if placeholder_row in content:
    # Replace placeholder with real row + keep placeholder for next time
    content = content.replace(placeholder_row, row + '\n' + placeholder_row)
elif separator in content:
    content = content.replace(separator, separator + '\n' + row)
else:
    content += '\n' + row + '\n'

with open(feedback_file, 'w') as f:
    f.write(content)

print(f"Feedback recorded: score={score}, pattern={pattern}")

# Detect recurring patterns
if pattern:
    pattern_count = content.count(f'| {pattern} |')
    if pattern_count >= 3:
        print(f"RECURRING PATTERN DETECTED: '{pattern}' appears {pattern_count} times")
        print(f"Action required: review prompt or pipeline for '{pipeline_name}'")

# Flag low scores
if score < score_threshold and score > 0:
    print(f"LOW SCORE: {score} < threshold {score_threshold} for agent {agent_name}")
PYEOF

exit 0

scripts/templates/feedback/performance-scorer.sh — Standalone scoring script that reads FEEDBACK.md and cost-events.jsonl:

#!/bin/bash
# Performance scorer: per-agent metrics from FEEDBACK.md + cost-events.jsonl.
# Bash 3.2 compatible. Uses python3 for all metrics computation.
#
# Metrics per agent:
#   - Average score (0-100)
#   - Error rate (rows with score < threshold / total rows)
#   - Cost per run (from cost-events.jsonl, rough proxy)
#   - Improvement trend: avg of last 10 scores vs. previous 10
#
# Flags agents below threshold (default 60/100).
#
# Usage:
#   ./performance-scorer.sh                     # Score all agents
#   ./performance-scorer.sh --agent {{AGENT}}   # Score specific agent
#   ./performance-scorer.sh --threshold 70      # Custom threshold
#
# Placeholders:
#   {{WORKING_DIR}} - absolute path to project directory

WORKING_DIR="{{WORKING_DIR}}"
FEEDBACK_FILE="$WORKING_DIR/FEEDBACK.md"
COST_LOG="$WORKING_DIR/budget/cost-events.jsonl"
THRESHOLD="${2:-60}"
AGENT_FILTER=""

# Parse arguments (bash 3.2 compatible — no associative arrays)
while [ "$#" -gt 0 ]; do
  case "$1" in
    --agent) AGENT_FILTER="$2"; shift 2 ;;
    --threshold) THRESHOLD="$2"; shift 2 ;;
    *) shift ;;
  esac
done

if [ ! -f "$FEEDBACK_FILE" ]; then
  echo "No feedback file found at $FEEDBACK_FILE"
  exit 0
fi

python3 << PYEOF
import re, json, os, sys
from collections import defaultdict

feedback_file = "$FEEDBACK_FILE"
cost_log = "$COST_LOG"
threshold = int("$THRESHOLD")
agent_filter = "$AGENT_FILTER"

# Parse FEEDBACK.md table rows
# Expected columns: Date, Pipeline, Agent, Score, Issue, Resolution, Pattern
feedback_rows = []
with open(feedback_file) as f:
    in_table = False
    header_seen = False
    for line in f:
        line = line.strip()
        if '| Date |' in line:
            in_table = True
            header_seen = True
            continue
        if in_table and line.startswith('|---'):
            continue
        if in_table and line.startswith('|') and '{{' not in line and header_seen:
            cols = [c.strip() for c in line.strip('|').split('|')]
            if len(cols) >= 7:
                try:
                    date = cols[0]
                    pipeline = cols[1]
                    agent = cols[2]
                    score_str = cols[3]
                    issue = cols[4]
                    resolution = cols[5]
                    pattern = cols[6]
                    # Parse score: "75/100" or "75"
                    score_m = re.match(r'(\d+)', score_str)
                    score = int(score_m.group(1)) if score_m else 0
                    feedback_rows.append({
                        'date': date,
                        'pipeline': pipeline,
                        'agent': agent,
                        'score': score,
                        'issue': issue,
                        'pattern': pattern
                    })
                except (ValueError, IndexError):
                    pass

# Filter by agent if specified
if agent_filter:
    feedback_rows = [r for r in feedback_rows if r['agent'] == agent_filter]

if not feedback_rows:
    print("No feedback rows found.")
    sys.exit(0)

# Read cost events if available
cost_by_agent = defaultdict(int)
if os.path.exists(cost_log):
    with open(cost_log) as f:
        for line in f:
            line = line.strip()
            if line:
                try:
                    event = json.loads(line)
                    agent = event.get('agent', 'unknown')
                    cost_by_agent[agent] += 1  # event count as proxy
                except Exception:
                    pass

# Compute per-agent metrics
agents = list(set(r['agent'] for r in feedback_rows))

print("PERFORMANCE SCORECARD")
print("=" * 60)
print(f"Threshold: {threshold}/100")
print(f"Total feedback rows: {len(feedback_rows)}")
print()

flagged = []

for agent in sorted(agents):
    rows = [r for r in feedback_rows if r['agent'] == agent]
    scores = [r['score'] for r in rows]

    avg_score = sum(scores) / len(scores) if scores else 0
    error_rate = len([s for s in scores if s < threshold]) / len(scores) if scores else 0
    cost_events = cost_by_agent.get(agent, 0)
    cost_per_run = cost_events / len(rows) if rows else 0

    # Improvement trend: last 10 vs. prev 10
    trend_str = "n/a (fewer than 20 runs)"
    if len(scores) >= 20:
        prev10 = scores[-20:-10]
        last10 = scores[-10:]
        prev_avg = sum(prev10) / len(prev10)
        last_avg = sum(last10) / len(last10)
        delta = last_avg - prev_avg
        if delta > 5:
            trend_str = f"improving (+{delta:.1f})"
        elif delta < -5:
            trend_str = f"declining ({delta:.1f})"
        else:
            trend_str = f"stable ({delta:+.1f})"
    elif len(scores) >= 10:
        last10 = scores[-10:]
        trend_str = f"recent avg: {sum(last10)/len(last10):.1f} (need 20 runs for trend)"

    # Pattern frequency
    patterns = defaultdict(int)
    for r in rows:
        if r['pattern']:
            patterns[r['pattern']] += 1
    top_patterns = sorted(patterns.items(), key=lambda x: -x[1])[:3]

    print(f"Agent: {agent}")
    print(f"  Runs:          {len(rows)}")
    print(f"  Avg score:     {avg_score:.1f}/100")
    print(f"  Error rate:    {error_rate*100:.0f}% (score < {threshold})")
    print(f"  Cost/run:      ~{cost_per_run:.1f} events (rough proxy)")
    print(f"  Trend:         {trend_str}")
    if top_patterns:
        print(f"  Top patterns:  {', '.join(f'{p}({c})' for p, c in top_patterns)}")
    print()

    if avg_score < threshold:
        flagged.append((agent, avg_score))

# Summary of flagged agents
if flagged:
    print("FLAGGED AGENTS (below threshold)")
    print("-" * 40)
    for agent, avg in flagged:
        print(f"  {agent}: avg {avg:.1f} < {threshold}")
    print()
    print("Recommended actions:")
    print("  1. Review feedback rows for top patterns")
    print("  2. Iterate on agent system prompt")
    print("  3. Consider pipeline redesign if pattern is structural")
    print("  4. Run pipeline-optimizer.sh for bottleneck analysis")
else:
    print("All agents above threshold.")
PYEOF

scripts/templates/feedback/README.md — Explains the feedback loop pattern:

# Feedback Loop

Systematic feedback collection and performance scoring for agent pipelines.

## How it works

1. After each pipeline run, a reviewer agent (or human) assigns a score (0–100)
   and categorizes any issues with a pattern tag.
2. `feedback-collector.sh` runs as a PostToolUse hook on `review_pipeline` or
   `score_output` tool calls. It appends a row to `FEEDBACK.md`.
3. When 3+ rows share the same pattern tag, a recurring-pattern alert fires.
4. `performance-scorer.sh` reads `FEEDBACK.md` and `budget/cost-events.jsonl`
   to compute per-agent metrics: average score, error rate, cost per run,
   improvement trend (last 10 vs. previous 10 runs).
5. Agents scoring below the threshold (default 60/100) are flagged for review.

## Pattern tags

Consistent tags are required for pattern detection to work. Use the tags
defined in `FEEDBACK.md`. Add project-specific tags as needed — but be
consistent. Inconsistent tagging produces false negatives.

## Scoring → self-improvement connection

Feedback scores are the input to VFM (Value-for-Money) pre-scoring
defined in `scripts/templates/proactive/VFM-SCORING.md` (Step 11).
A low-scoring agent gets a lower VFM pre-score for future pipeline tasks,
making it less likely to be selected until its performance improves.

The feedback loop closes the improvement cycle:
1. Pipeline runs → reviewer assigns score + pattern tag
2. `feedback-collector.sh` appends to FEEDBACK.md
3. `performance-scorer.sh` flags underperforming agents
4. Developer reviews top patterns → iterates on agent prompt
5. New runs produce new feedback → trend shows improvement
6. VFM scores update automatically on next pipeline selection

## Example: prompt iteration driven by feedback

Suppose `agent-writer` repeatedly scores 45/100 with pattern `quality-low`:


After 3 rows: feedback-collector.sh fires the recurring-pattern alert.
performance-scorer.sh shows avg 45/100, error rate 100%.
Action: update agent-writer's system prompt with explicit length and
depth requirements. Next 10 runs show trend "improving (+18.3)".

## Integration

Add feedback-collector.sh as a PostToolUse hook in `.claude/settings.json`:

```json
{
  "hooks": {
    "PostToolUse": [{
      "matcher": "review_pipeline",
      "hooks": [{"type": "command", "command": "bash feedback/feedback-collector.sh"}]
    }]
  }
}

Run performance-scorer.sh on demand or as a scheduled report:

./feedback/performance-scorer.sh
./feedback/performance-scorer.sh --agent agent-writer --threshold 70


### Verify

```bash
bash -n /Users/ktg/repos/agent-builder/scripts/templates/feedback/feedback-collector.sh && bash -n /Users/ktg/repos/agent-builder/scripts/templates/feedback/performance-scorer.sh && echo "VALID"

Expected: VALID

On failure: retry — fix bash syntax, then revert

Checkpoint

git commit -m "feat(templates): add feedback loop and performance scoring templates"

Step 21: Create pipeline optimization templates

Files to create

scripts/templates/optimization/pipeline-optimizer.sh — Analyzes pipeline performance and generates recommendations:

#!/bin/bash
# Pipeline optimizer: identify bottlenecks, excess loops, cost outliers.
# Bash 3.2 compatible. Uses python3 for all analysis.
# Does NOT auto-implement any changes — produces RECOMMENDATIONS.md only.
#
# Analysis covers:
#   - Bottleneck agents (highest avg duration or cost per run)
#   - Unnecessary revision loops (agents that loop 3+ times on average)
#   - Underutilized agents (invoked < 10% of pipeline runs)
#   - Cost outliers (single run cost >= 3x average)
#
# Output: RECOMMENDATIONS.md with VFM pre-scores for each recommendation.
#
# Usage:
#   ./pipeline-optimizer.sh
#   ./pipeline-optimizer.sh --pipeline {{PIPELINE_NAME}}
#
# Placeholders:
#   {{WORKING_DIR}} - absolute path to project directory

WORKING_DIR="{{WORKING_DIR}}"
FEEDBACK_FILE="$WORKING_DIR/FEEDBACK.md"
COST_LOG="$WORKING_DIR/budget/cost-events.jsonl"
RECOMMENDATIONS_FILE="$WORKING_DIR/RECOMMENDATIONS.md"
PIPELINE_FILTER=""

# Parse arguments (bash 3.2 compatible)
while [ "$#" -gt 0 ]; do
  case "$1" in
    --pipeline) PIPELINE_FILTER="$2"; shift 2 ;;
    *) shift ;;
  esac
done

python3 << PYEOF
import re, json, os, sys
from collections import defaultdict
from datetime import datetime

feedback_file = "$FEEDBACK_FILE"
cost_log = "$COST_LOG"
recommendations_file = "$RECOMMENDATIONS_FILE"
pipeline_filter = "$PIPELINE_FILTER"

# Parse FEEDBACK.md
feedback_rows = []
if os.path.exists(feedback_file):
    with open(feedback_file) as f:
        in_table = False
        for line in f:
            line = line.strip()
            if '| Date |' in line:
                in_table = True
                continue
            if in_table and line.startswith('|---'):
                continue
            if in_table and line.startswith('|') and '{{' not in line:
                cols = [c.strip() for c in line.strip('|').split('|')]
                if len(cols) >= 7:
                    try:
                        score_m = re.match(r'(\d+)', cols[3])
                        score = int(score_m.group(1)) if score_m else 0
                        feedback_rows.append({
                            'date': cols[0],
                            'pipeline': cols[1],
                            'agent': cols[2],
                            'score': score,
                            'issue': cols[4],
                            'pattern': cols[6]
                        })
                    except (ValueError, IndexError):
                        pass

# Filter by pipeline
if pipeline_filter:
    feedback_rows = [r for r in feedback_rows if r['pipeline'] == pipeline_filter]

# Parse cost events
cost_events = []
if os.path.exists(cost_log):
    with open(cost_log) as f:
        for line in f:
            line = line.strip()
            if line:
                try:
                    cost_events.append(json.loads(line))
                except Exception:
                    pass

# Per-agent event counts (cost proxy)
cost_by_agent = defaultdict(list)
# Group by agent+date for per-run cost
run_costs = defaultdict(list)
for e in cost_events:
    agent = e.get('agent', 'unknown')
    date = e.get('timestamp', '')[:10]
    run_key = f"{agent}:{date}"
    cost_by_agent[agent].append(1)
    run_costs[agent].append(1)

# Build recommendations
recommendations = []

# 1. Bottleneck agents: top 2 by event count
if cost_by_agent:
    agent_totals = [(a, len(events)) for a, events in cost_by_agent.items()]
    agent_totals.sort(key=lambda x: -x[1])
    for agent, total in agent_totals[:2]:
        all_costs = [len(v) for v in run_costs.values()]
        avg_cost = sum(all_costs) / len(all_costs) if all_costs else 1
        if total > avg_cost * 1.5:
            recommendations.append({
                'type': 'bottleneck',
                'agent': agent,
                'description': f"Agent '{agent}' accounts for {total} events vs avg {avg_cost:.0f}. "
                               f"Consider batching its tool calls or reducing its task scope.",
                'vfm_prescore': 70
            })

# 2. Unnecessary revision loops: agents with loop-excess pattern >= 3 times
pattern_by_agent = defaultdict(lambda: defaultdict(int))
for r in feedback_rows:
    if r['pattern']:
        pattern_by_agent[r['agent']][r['pattern']] += 1

for agent, patterns in pattern_by_agent.items():
    if patterns.get('loop-excess', 0) >= 3:
        count = patterns['loop-excess']
        recommendations.append({
            'type': 'loop-excess',
            'agent': agent,
            'description': f"Agent '{agent}' has {count} feedback rows tagged 'loop-excess'. "
                           f"Review pipeline revision criteria — tighten acceptance conditions "
                           f"or add a max-iterations guard (see self-healing.sh).",
            'vfm_prescore': 80
        })

# 3. Underutilized agents: invoked in < 10% of pipeline runs
if feedback_rows:
    all_runs = set(r['date'] + ':' + r['pipeline'] for r in feedback_rows)
    total_runs = len(all_runs) if all_runs else 1
    agent_runs = defaultdict(set)
    for r in feedback_rows:
        agent_runs[r['agent']].add(r['date'] + ':' + r['pipeline'])
    for agent, runs in agent_runs.items():
        utilization = len(runs) / total_runs
        if utilization < 0.1 and total_runs >= 10:
            recommendations.append({
                'type': 'underutilized',
                'agent': agent,
                'description': f"Agent '{agent}' appears in only {utilization*100:.0f}% of pipeline runs. "
                               f"Consider removing from the pipeline or combining with another agent.",
                'vfm_prescore': 60
            })

# 4. Cost outliers: single-run cost >= 3x average
if run_costs:
    all_run_totals = []
    for agent, runs in run_costs.items():
        all_run_totals.extend(runs)
    avg_run = sum(all_run_totals) / len(all_run_totals) if all_run_totals else 1
    for agent, runs in run_costs.items():
        for run_cost in runs:
            if run_cost >= avg_run * 3:
                recommendations.append({
                    'type': 'cost-outlier',
                    'agent': agent,
                    'description': f"Agent '{agent}' had a run costing {run_cost} events "
                                   f"vs avg {avg_run:.1f} (3x+ threshold). "
                                   f"Add per-run budget cap with budget-hook.sh.",
                    'vfm_prescore': 75
                })
                break  # one recommendation per agent

# Write RECOMMENDATIONS.md
timestamp = datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%SZ')
pipeline_label = pipeline_filter if pipeline_filter else "all pipelines"

lines = [
    f"# Pipeline Optimization Recommendations",
    f"",
    f"Generated: {timestamp}",
    f"Scope: {pipeline_label}",
    f"",
    f"> These are recommendations only. No changes have been made.",
    f"> Review each item and implement manually or with team approval.",
    f"",
]

if recommendations:
    lines.append(f"## Recommendations ({len(recommendations)} found)")
    lines.append("")
    for i, rec in enumerate(recommendations, 1):
        lines.append(f"### R{i}: {rec['type'].upper()} — {rec['agent']}")
        lines.append("")
        lines.append(rec['description'])
        lines.append("")
        lines.append(f"**VFM pre-score:** {rec['vfm_prescore']}/100")
        lines.append("")
else:
    lines.append("## No recommendations")
    lines.append("")
    lines.append("No bottlenecks, excess loops, underutilized agents, or cost outliers detected.")
    lines.append("")

lines.append("## Next steps")
lines.append("")
lines.append("1. Review each recommendation with the team")
lines.append("2. Prioritize by VFM pre-score (higher = more value per effort)")
lines.append("3. Implement approved changes one at a time")
lines.append("4. Run feedback-collector.sh for 10+ runs after each change")
lines.append("5. Re-run pipeline-optimizer.sh to confirm improvement")

with open(recommendations_file, 'w') as f:
    f.write('\n'.join(lines) + '\n')

print(f"Recommendations written to {recommendations_file}")
print(f"  Found: {len(recommendations)} recommendations")
for rec in recommendations:
    print(f"  - [{rec['type']}] {rec['agent']}: VFM pre-score {rec['vfm_prescore']}")
PYEOF

scripts/templates/optimization/self-healing.sh — Error recovery after agent/pipeline failures:

#!/bin/bash
# Self-healing: categorize errors and apply recovery strategies.
# Bash 3.2 compatible. Uses python3 for JSON/log parsing.
#
# Error categories and recovery strategies:
#   timeout          → retry with shorter task scope
#   permission-denied → log and skip (do not retry)
#   tool-not-found   → log and alert, do not retry
#   api-error        → exponential backoff, max 3 retries
#   content-quality  → re-run with stricter prompt, max 2 retries
#
# Max total attempts: 5 (OpenClaw pattern — hard cap regardless of category).
# All recovery events logged to healing-log.jsonl.
#
# Usage:
#   ./self-healing.sh --error-type <type> --agent <name> --attempt <n> --context <msg>
#
# Exit codes:
#   0 — recovery action taken (caller should retry)
#   1 — no recovery possible (caller should abort)
#   2 — max attempts reached (caller should escalate)
#
# Placeholders:
#   {{WORKING_DIR}} - absolute path to project directory

WORKING_DIR="{{WORKING_DIR}}"
HEALING_LOG="$WORKING_DIR/healing-log.jsonl"
MAX_ATTEMPTS=5

ERROR_TYPE=""
AGENT_NAME=""
ATTEMPT=1
CONTEXT_MSG=""

# Parse arguments (bash 3.2 compatible)
while [ "$#" -gt 0 ]; do
  case "$1" in
    --error-type) ERROR_TYPE="$2"; shift 2 ;;
    --agent) AGENT_NAME="$2"; shift 2 ;;
    --attempt) ATTEMPT="$2"; shift 2 ;;
    --context) CONTEXT_MSG="$2"; shift 2 ;;
    *) shift ;;
  esac
done

if [ -z "$ERROR_TYPE" ]; then
  echo "Usage: $0 --error-type <type> --agent <name> --attempt <n> --context <msg>"
  exit 1
fi

# Hard cap: max 5 attempts total
if [ "$ATTEMPT" -gt "$MAX_ATTEMPTS" ]; then
  echo "MAX ATTEMPTS REACHED ($MAX_ATTEMPTS) for $AGENT_NAME. Escalating."
  python3 -c "
import json, time, os
event = {
    'timestamp': time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime()),
    'agent': '$AGENT_NAME',
    'error_type': '$ERROR_TYPE',
    'attempt': $ATTEMPT,
    'action': 'escalate',
    'reason': 'max_attempts_reached',
    'context': '$CONTEXT_MSG'
}
with open('$HEALING_LOG', 'a') as f:
    f.write(json.dumps(event) + '\n')
print(json.dumps(event))
"
  exit 2
fi

# Determine recovery action per category
RECOVERY_ACTION=""
RECOVERY_DETAIL=""
EXIT_CODE=0

case "$ERROR_TYPE" in
  timeout)
    RECOVERY_ACTION="retry_shorter"
    RECOVERY_DETAIL="Re-run with reduced task scope. Split task if attempt >= 3."
    if [ "$ATTEMPT" -ge 3 ]; then
      RECOVERY_DETAIL="Attempt $ATTEMPT: recommend splitting task before retry."
    fi
    EXIT_CODE=0
    ;;
  permission-denied)
    RECOVERY_ACTION="skip"
    RECOVERY_DETAIL="Permission errors cannot be auto-resolved. Log and skip. Notify operator."
    EXIT_CODE=1
    ;;
  tool-not-found)
    RECOVERY_ACTION="alert"
    RECOVERY_DETAIL="Tool not found — check agent config and hook registrations. Do not retry."
    EXIT_CODE=1
    ;;
  api-error)
    # Exponential backoff: 2^(attempt-1) seconds, max 3 retries
    if [ "$ATTEMPT" -le 3 ]; then
      BACKOFF_SECS=$(python3 -c "print(min(2 ** ($ATTEMPT - 1), 16))")
      RECOVERY_ACTION="retry_backoff"
      RECOVERY_DETAIL="API error — wait ${BACKOFF_SECS}s then retry (attempt $ATTEMPT/3)."
      sleep "$BACKOFF_SECS"
      EXIT_CODE=0
    else
      RECOVERY_ACTION="abort"
      RECOVERY_DETAIL="API error persists after 3 retries. Aborting."
      EXIT_CODE=1
    fi
    ;;
  content-quality)
    # Max 2 retries for quality issues
    if [ "$ATTEMPT" -le 2 ]; then
      RECOVERY_ACTION="retry_strict"
      RECOVERY_DETAIL="Re-run with stricter prompt. Add explicit quality criteria (attempt $ATTEMPT/2)."
      EXIT_CODE=0
    else
      RECOVERY_ACTION="escalate_quality"
      RECOVERY_DETAIL="Content quality below threshold after 2 retries. Escalate to human review."
      EXIT_CODE=2
    fi
    ;;
  *)
    RECOVERY_ACTION="unknown"
    RECOVERY_DETAIL="Unknown error type '$ERROR_TYPE'. Logging and aborting."
    EXIT_CODE=1
    ;;
esac

# Log recovery event
python3 -c "
import json, time
event = {
    'timestamp': time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime()),
    'agent': '$AGENT_NAME',
    'error_type': '$ERROR_TYPE',
    'attempt': $ATTEMPT,
    'action': '$RECOVERY_ACTION',
    'detail': '$RECOVERY_DETAIL',
    'context': '$CONTEXT_MSG'
}
with open('$HEALING_LOG', 'a') as f:
    f.write(json.dumps(event) + '\n')
print(json.dumps(event, indent=2))
"

echo "Recovery: $RECOVERY_ACTION — $RECOVERY_DETAIL"
exit $EXIT_CODE

scripts/templates/optimization/README.md — Explains optimization and self-healing:

# Pipeline Optimization and Self-Healing

Two tools for making agent pipelines more efficient and resilient over time.

## pipeline-optimizer.sh

Analyzes FEEDBACK.md and cost-events.jsonl to identify:

| Issue | Detection | Recommendation |
|-------|-----------|----------------|
| Bottleneck agent | Top-2 by cost event count, 1.5x+ avg | Batch tool calls or narrow task scope |
| Unnecessary revision loops | 3+ `loop-excess` pattern rows | Tighten acceptance criteria, add max-iterations guard |
| Underutilized agent | Appears in < 10% of pipeline runs | Remove from pipeline or combine with another agent |
| Cost outlier | Single run >= 3x average | Add per-run budget cap via budget-hook.sh |

Output is written to `RECOMMENDATIONS.md` with a VFM pre-score for each
recommendation. Higher VFM pre-scores mean more value per implementation effort.

**This script does not auto-implement anything.** All changes require
manual review and explicit approval. This is intentional — pipeline
restructuring is a high-stakes operation.

## self-healing.sh

Categorizes errors and applies targeted recovery strategies:

| Error Type | Recovery | Max Retries |
|------------|----------|-------------|
| `timeout` | Retry with shorter scope | 5 (hard cap) |
| `permission-denied` | Log and skip | 0 (no retry) |
| `tool-not-found` | Alert operator | 0 (no retry) |
| `api-error` | Exponential backoff (2^n seconds) | 3 |
| `content-quality` | Retry with stricter prompt | 2 |

**Hard cap: 5 total attempts regardless of category.** This follows the
OpenClaw pattern — unbounded retry loops are the most common cause of
runaway agent costs. The cap is non-negotiable.

After the hard cap is reached, the script exits with code 2 (escalate).
The caller is responsible for deciding whether to pause, alert a human,
or abort the pipeline run.

## Connection to feedback and VFM

feedback-collector.sh → FEEDBACK.md → performance-scorer.sh → flagged agents ↓ pipeline-optimizer.sh → RECOMMENDATIONS.md ↓ (manual review + approval) ↓ prompt/pipeline update ↓ new runs → new feedback


VFM pre-scores in RECOMMENDATIONS.md use the same 0–100 scale as
`scripts/templates/proactive/VFM-SCORING.md` (Step 11). They are
pre-scores, not final scores — the VFM evaluation still needs to run
when the task is scheduled. The pre-scores help prioritize which
recommendations to tackle first.

## Safety limits

- `pipeline-optimizer.sh`: read-only analysis — never modifies pipeline files
- `self-healing.sh`: max 5 attempts hard cap, permission errors never retried
- All events logged to `healing-log.jsonl` for audit trail
- No auto-escalation to external systems — exit codes only

## Usage

```bash
# Run optimizer for all pipelines
./optimization/pipeline-optimizer.sh

# Run optimizer for a specific pipeline
./optimization/pipeline-optimizer.sh --pipeline doc-pipeline

# Handle an error in a pipeline step
./optimization/self-healing.sh \
  --error-type api-error \
  --agent agent-writer \
  --attempt 1 \
  --context "OpenAI timeout on summarize call"

# Check healing log
cat healing-log.jsonl | python3 -m json.tool


### Verify

```bash
bash -n /Users/ktg/repos/agent-builder/scripts/templates/optimization/pipeline-optimizer.sh && bash -n /Users/ktg/repos/agent-builder/scripts/templates/optimization/self-healing.sh && echo "VALID"

Expected: VALID

On failure: retry — fix bash syntax, then revert

Checkpoint

git commit -m "feat(templates): add pipeline optimization and self-healing templates"

Exit Condition

ls /Users/ktg/repos/agent-builder/scripts/templates/feedback/ | wc -l → 4
ls /Users/ktg/repos/agent-builder/scripts/templates/optimization/ | wc -l → 3
bash -n /Users/ktg/repos/agent-builder/scripts/templates/feedback/feedback-collector.sh → no errors
bash -n /Users/ktg/repos/agent-builder/scripts/templates/feedback/performance-scorer.sh → no errors
bash -n /Users/ktg/repos/agent-builder/scripts/templates/optimization/pipeline-optimizer.sh → no errors
bash -n /Users/ktg/repos/agent-builder/scripts/templates/optimization/self-healing.sh → no errors
FEEDBACK.md contains | Date | Pipeline | Agent | Score | Issue | Resolution | Pattern | header row
performance-scorer.sh computes improvement trend (last 10 vs. prev 10)
pipeline-optimizer.sh writes to RECOMMENDATIONS.md and does NOT modify any pipeline files
self-healing.sh exits 2 when attempt > 5 (hard cap enforced)
healing-log.jsonl referenced in self-healing.sh
All bash scripts are 3.2 compatible (no associative arrays, no mapfile, no |&)

Quality Criteria

Feedback table columns match the 7-column spec (Date, Pipeline, Agent, Score, Issue, Resolution, Pattern)
Pattern detection fires at exactly 3 occurrences (not 2, not 4)
Performance-scorer.sh improvement trend correctly computes last 10 vs. previous 10 scores
pipeline-optimizer.sh detects all 4 issue types: bottleneck, loop-excess, underutilized, cost-outlier
VFM pre-scores in RECOMMENDATIONS.md use the same 0–100 scale as VFM-SCORING.md (Step 11)
self-healing.sh hard cap is exactly 5 (OpenClaw pattern) — not configurable
permission-denied and tool-not-found errors are never retried (exit 1 immediately)
api-error uses exponential backoff: 1s, 2s, 4s (2^0, 2^1, 2^2) before aborting at attempt 4
content-quality escalates to human review (exit 2) after 2 retries, not abort
All scripts use #!/bin/bash shebang and are bash 3.2 compatible

34 KiB Raw Blame History Unescape Escape

Session 5: Self-Learning Systems

Dependencies

Scope Fence

Step 20: Create feedback loop templates

Files to create

On failure: retry — fix bash syntax, then revert

Checkpoint

Step 21: Create pipeline optimization templates

Files to create

On failure: retry — fix bash syntax, then revert

Checkpoint

Exit Condition

Quality Criteria

34 KiB

Raw Blame History