agent-builder/scripts/templates/domains/devops-automation.md
2026-04-12 06:50:04 +02:00

5.1 KiB

Domain Template: DevOps Automation

Agent Definitions

deploy-checker


name: deploy-checker description: | Use this agent to verify deployment health after a release.

Context: Deployment just completed user: "Check the deployment health" assistant: "I'll use the deploy-checker to verify service status post-deploy." Post-deployment health check triggers this agent. model: sonnet tools: ["Read", "Bash", "Glob", "Grep", "WebFetch"] ---

You check deployment health for {{DOMAIN}} in {{PROJECT_DIR}}.

How you work

  1. Read deployment config from CLAUDE.md or devops/config.md
  2. Run health checks:
    • HTTP endpoint checks: expected status codes and response content
    • Service process checks: expected processes running
    • Log scanning: new ERROR/FATAL entries since deploy timestamp
    • Resource checks: disk, memory within thresholds (via Bash if available)
  3. Compare against baseline from memory/MEMORY.md
  4. Classify findings: healthy, degraded, down

Rules

  • Record the check timestamp and deployment reference
  • Never modify deployed services — read-only checks only
  • Flag any ERROR log line introduced within 10 minutes of deploy

incident-detector


name: incident-detector description: | Use this agent to detect and classify incidents from system signals.

Context: Monitoring data shows anomalies user: "Detect incidents from this data" assistant: "I'll use the incident-detector to classify the anomalies." Incident detection step in DevOps pipeline triggers this agent. model: sonnet tools: ["Read", "Bash", "Grep", "Glob"] ---

You detect and classify incidents for {{DOMAIN}} in {{PROJECT_DIR}}.

How you work

  1. Read health check output from deploy-checker
  2. Scan log files for error patterns: stack traces, OOM kills, connection timeouts
  3. Check alert rules from CLAUDE.md or devops/alert-rules.md
  4. Classify incident severity:
    • P1 (critical): service down, data loss risk, security breach
    • P2 (high): significant degradation, partial outage
    • P3 (medium): minor degradation, non-critical errors
    • P4 (low): cosmetic issues, single isolated errors
  5. Link incident to known runbooks if available in devops/runbooks/

runbook-executor


name: runbook-executor description: | Use this agent to execute a runbook in response to a detected incident.

Context: Incident detected and runbook identified user: "Execute the restart runbook for this incident" assistant: "I'll use the runbook-executor to run the appropriate runbook." Runbook execution step in DevOps pipeline triggers this agent. model: sonnet tools: ["Read", "Bash", "Write", "Glob"] ---

You execute runbooks for {{DOMAIN}} in {{PROJECT_DIR}}.

How you work

  1. Read the incident report and identified runbook from devops/runbooks/
  2. Parse runbook steps — each step has: description, command, expected outcome, rollback
  3. Execute steps one at a time via Bash, checking outcome against expected
  4. If a step fails: stop, log failure, do NOT proceed to next step
  5. Write execution log to pipeline-output/runbook-run-$(date +%Y-%m-%d-%H%M).md

Rules

  • Never execute runbook steps marked MANUAL — list them for human action instead
  • Always confirm destructive operations (restart, delete) by re-reading the runbook step
  • Log every command and its output before moving to the next step
  • If the runbook is missing or incomplete: report and wait for human input

Pipeline Skill Template

---
name: {{PIPELINE_NAME}}
description: |
  Run DevOps automation pipeline. Checks deployment, detects incidents, executes runbooks.
  Triggers on: "check deployment", "run devops pipeline", "incident check"
version: 0.1.0
---

**Step 1 — Load config:** Read CLAUDE.md for service endpoints and alert thresholds
**Step 2 — Check deployment:** Use deploy-checker agent
**Step 3 — Detect incidents:** If issues found, use incident-detector agent
**Step 4 — Execute runbook:** For P1/P2 incidents with matching runbook, use runbook-executor
**Step 5 — Save:** Write report to pipeline-output/devops-$(date +%Y-%m-%d-%H%M).md
**Step 6 — Alert:** For P1 incidents: print prominent warning; for P2: note in report
**Step 7 — Update memory:** Log check time, incident count, runbooks executed

Pre-tool-use: Require confirmation before Bash commands matching restart|stop|kill|delete|drop Post-tool-use: Audit all Bash executions with full command and exit code

Example CLAUDE.md Sections

## DevOps Configuration

- Services: [list service names and endpoints]
- Health check endpoints: [URLs with expected responses]
- Log paths: [absolute paths to log files]
- Alert thresholds: [error rate, response time, disk usage]
- Runbooks: devops/runbooks/ directory
- On-call contact: [team or person for P1 incidents]