From 7e60b28c8da7eed5f792b7a663a2bb71314e52c3 Mon Sep 17 00:00:00 2001 From: Kjell Tore Guttormsen Date: Sat, 9 May 2026 09:51:44 +0200 Subject: [PATCH] =?UTF-8?q?docs(voyage):=20add=20docs/observability.md=20?= =?UTF-8?q?=E2=80=94=20operator=20quickstart=20for=20v4.1=20OTel=20export?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Step 15 of v4.1 — operator-facing observability docs (151 lines, target ≥80). Sections: - Overview (JSONL is default, OTel is opt-in) - Activating OTel export (VOYAGE_EXPORT_MODE) - Output formats (Prometheus textfile vs OTLP/HTTP) - Environment variables matrix - Docker Compose quickstart (cross-link to examples/observability/) - Stats schema (cross-link to tests/fixtures/jsonl-schemas.md) - Security (CWE-22, CWE-918, CWE-212 mitigations + min-versions per CVE) - Limitations (Stop-hook normal-exit only, no retry, NFR best-effort) - Cost-estimering disclaimer (per brief Risk-tabell) --- plugins/voyage/docs/observability.md | 151 +++++++++++++++++++++++++++ 1 file changed, 151 insertions(+) create mode 100644 plugins/voyage/docs/observability.md diff --git a/plugins/voyage/docs/observability.md b/plugins/voyage/docs/observability.md new file mode 100644 index 0000000..1eea08b --- /dev/null +++ b/plugins/voyage/docs/observability.md @@ -0,0 +1,151 @@ +# Observability — voyage v4.1 + +This document describes the *opt-in* OpenTelemetry / Prometheus export +path added in v4.1. The default JSONL stats stream +(`${CLAUDE_PLUGIN_DATA}/trek*-stats.jsonl`) remains unchanged — it is the +canonical event log and continues to be written regardless of OTel mode. + +## Overview + +Voyage v4.0 wrote per-command stats to JSONL files only. Operators who +wanted dashboards had to roll their own log-pipeline. v4.1 adds a Stop-hook +called `hooks/scripts/otel-export.mjs` that, when activated via +`VOYAGE_EXPORT_MODE`, transforms the JSONL records into either a +Prometheus textfile or OTLP/HTTP push at session-end. + +The hook is *additive*. With `VOYAGE_EXPORT_MODE=off` (default), the +binary exits silently and no work is done — your existing JSONL workflow +is untouched. + +## Activating OTel export + +Set `VOYAGE_EXPORT_MODE` in the shell before invoking any voyage command: + +```bash +# Default — no export, JSONL only +unset VOYAGE_EXPORT_MODE # equivalent to VOYAGE_EXPORT_MODE=off + +# Path A — Prometheus textfile (recommended for local dashboards) +export VOYAGE_EXPORT_MODE=textfile +export VOYAGE_TEXTFILE_DIR=/var/lib/node_exporter/textfile + +# Path B — OTLP/HTTP push (recommended for centralized telemetry) +export VOYAGE_EXPORT_MODE=otlp +export VOYAGE_OTEL_ENDPOINT=https://otel.example.com/v1/metrics +``` + +`hooks/hooks.json` wires the Stop event to `otel-export.mjs`, so the +export runs automatically when Claude Code finishes a session. No manual +invocation is required. + +## Output formats + +| Mode | Wire format | Endpoint shape | Cardinality cap | +|------|-------------|----------------|-----------------| +| `textfile` | Prometheus exposition format (text) | local file: `${VOYAGE_TEXTFILE_DIR}/voyage.prom` | low — voyage controls labels | +| `otlp` | OTLP/JSON v1.0 metric ResourceMetrics | HTTPS POST: `${VOYAGE_OTEL_ENDPOINT}` | low — same allowlist as textfile | +| `off` | (none) | — | — | + +Both formats apply the **same field allowlist** — see +`lib/exporters/field-allowlist.mjs` for the per-schema list. Fields not in +the allowlist are dropped before export. This is a CWE-212 mitigation: +operator-defined endpoints must never receive accidentally-leaked +operator-private data (paths, prompts, brief content). + +## Environment variables + +| Variable | Default | Purpose | +|----------|---------|---------| +| `VOYAGE_EXPORT_MODE` | `off` | One of `off` / `textfile` / `otlp` | +| `VOYAGE_TEXTFILE_DIR` | `${CLAUDE_PLUGIN_DATA}` | Directory for `voyage.prom` (textfile mode) | +| `VOYAGE_OTEL_ENDPOINT` | _(none)_ | HTTPS URL for OTLP/HTTP POST | +| `VOYAGE_OTEL_ALLOW_PRIVATE` | _(unset)_ | Set to `1` to allow loopback / RFC1918 endpoints | + +## Docker Compose quickstart + +A pre-pinned local stack lives at `examples/observability/`: + +```bash +cd examples/observability +mkdir -p voyage-textfile +docker compose up -d +``` + +This brings up Prometheus, Grafana, node-exporter (textfile mode), and +otel-collector (OTLP mode) on `localhost`. See +`examples/observability/README.md` for endpoint URLs and version pins. + +## Stats schema + +Each Voyage command emits one JSONL record per significant event. Schemas +are documented in `tests/fixtures/jsonl-schemas.md` (Step 1 of v4.1) and +locked by `tests/lib/profile-stats-fields.test.mjs`. + +The exporter applies the field allowlist defined in +`lib/exporters/field-allowlist.mjs`. Adding a new field to the JSONL +schema does **not** automatically expose it in OTel — you must add it to +the allowlist explicitly. This is intentional: `${CLAUDE_PLUGIN_DATA}` is +trusted local storage; OTel endpoints are operator-controlled and may be +external. + +## Security + +The exporter is hardened against three CWE classes: + +- **CWE-22 (path traversal)** — `lib/exporters/path-validator.mjs` + rejects relative paths, symlinks, and paths outside `allowedRoots` + (`VOYAGE_TEXTFILE_DIR` and `CLAUDE_PLUGIN_DATA`). Tested in + `tests/hooks/otel-export-validators.test.mjs`. +- **CWE-918 (SSRF)** — `lib/exporters/endpoint-validator.mjs` requires + HTTPS, blocks loopback (127.0.0.0/8) and RFC1918 (10/8, 172.16/12, + 192.168/16), unless `VOYAGE_OTEL_ALLOW_PRIVATE=1` is set explicitly. + Cloud metadata endpoints (169.254.169.254) are permanently blocked. +- **CWE-212 (improper data sanitization)** — every record passes through + `lib/exporters/field-allowlist.mjs` before any I/O. Adding a field to + the JSONL stream does not expose it externally; operators must update + the allowlist intentionally. + +### Minimum versions per CVE history + +| Component | Minimum version | Reason | +|-----------|-----------------|--------| +| `otel/opentelemetry-collector-contrib` | `0.115.0` | post-CVE-2024-42368 | +| `prom/prometheus` | `3.0.1` | OOM regression fix in 2.x | +| `prom/node-exporter` | `1.10.2` | textfile collector path normalization | +| `grafana/grafana` | `11.4.0` | datasource provisioning hardening | + +## Limitations + +- **Stop-hook is normal-exit only.** If Claude Code crashes or is killed + with SIGKILL, the final session's metrics are not flushed. Use + `--resume` on next start to recover plan/progress state; the missing + session will not appear in dashboards. +- **Tail-latency NFR is best-effort.** Textfile mode targets <5 ms p99, + OTLP <1500 ms (AbortController guards). If the network endpoint is + slow, the timeout fires and stats for that session are dropped — the + hook always exits 0 to avoid blocking session shutdown. +- **No retry on transport failure.** Stop-hook runs at most once per + session. If the OTLP endpoint is unreachable, that session's metrics + are lost. Production deployments should use `textfile` + a robust + scrape pipeline (node-exporter, vector, otel-collector with persistent + queue) to handle delivery semantics. +- **No per-tenant labelling.** v4.1 emits flat metrics with command and + schema_id labels only. Multi-tenant deployments needing per-user or + per-project segmentation should layer a relabel stage in their + collector or use external metadata. + +## Cost-estimering disclaimer + +The ROUGE-L / Jaccard / character n-gram thresholds in v4.1 are +*starting points*, not contractual SLAs. The brief Risk-tabell explicitly +flags these as anslag — they were calibrated against synthetic plan +runs (Step 17) using `economy` and `premium` profiles. Real cross-tier +agreement varies by task complexity. Treat the thresholds as smoke-test +floors; tighten them in v4.2 once you have ≥10 production runs of data. + +## See also + +- `examples/observability/` — local Docker Compose stack +- `tests/fixtures/jsonl-schemas.md` — canonical record shapes +- `lib/exporters/field-allowlist.mjs` — per-schema allowed fields +- `hooks/scripts/otel-export.mjs` — Stop-hook orchestrator