# Observability — voyage v4.1 This document describes the *opt-in* OpenTelemetry / Prometheus export path added in v4.1. The default JSONL stats stream (`${CLAUDE_PLUGIN_DATA}/trek*-stats.jsonl`) remains unchanged — it is the canonical event log and continues to be written regardless of OTel mode. ## Overview Voyage v4.0 wrote per-command stats to JSONL files only. Operators who wanted dashboards had to roll their own log-pipeline. v4.1 adds a Stop-hook called `hooks/scripts/otel-export.mjs` that, when activated via `VOYAGE_EXPORT_MODE`, transforms the JSONL records into either a Prometheus textfile or OTLP/HTTP push at session-end. The hook is *additive*. With `VOYAGE_EXPORT_MODE=off` (default), the binary exits silently and no work is done — your existing JSONL workflow is untouched. ## Activating OTel export Set `VOYAGE_EXPORT_MODE` in the shell before invoking any voyage command: ```bash # Default — no export, JSONL only unset VOYAGE_EXPORT_MODE # equivalent to VOYAGE_EXPORT_MODE=off # Path A — Prometheus textfile (recommended for local dashboards) export VOYAGE_EXPORT_MODE=textfile export VOYAGE_TEXTFILE_DIR=/var/lib/node_exporter/textfile # Path B — OTLP/HTTP push (recommended for centralized telemetry) export VOYAGE_EXPORT_MODE=otlp export VOYAGE_OTEL_ENDPOINT=https://otel.example.com/v1/metrics ``` `hooks/hooks.json` wires the Stop event to `otel-export.mjs`, so the export runs automatically when Claude Code finishes a session. No manual invocation is required. ## Output formats | Mode | Wire format | Endpoint shape | Cardinality cap | |------|-------------|----------------|-----------------| | `textfile` | Prometheus exposition format (text) | local file: `${VOYAGE_TEXTFILE_DIR}/voyage.prom` | low — voyage controls labels | | `otlp` | OTLP/JSON v1.0 metric ResourceMetrics | HTTPS POST: `${VOYAGE_OTEL_ENDPOINT}` | low — same allowlist as textfile | | `off` | (none) | — | — | Both formats apply the **same field allowlist** — see `lib/exporters/field-allowlist.mjs` for the per-schema list. Fields not in the allowlist are dropped before export. This is a CWE-212 mitigation: operator-defined endpoints must never receive accidentally-leaked operator-private data (paths, prompts, brief content). ## Environment variables | Variable | Default | Purpose | |----------|---------|---------| | `VOYAGE_EXPORT_MODE` | `off` | One of `off` / `textfile` / `otlp` | | `VOYAGE_TEXTFILE_DIR` | `${CLAUDE_PLUGIN_DATA}` | Directory for `voyage.prom` (textfile mode) | | `VOYAGE_OTEL_ENDPOINT` | _(none)_ | HTTPS URL for OTLP/HTTP POST | | `VOYAGE_OTEL_ALLOW_PRIVATE` | _(unset)_ | Set to `1` to allow loopback / RFC1918 endpoints | ## Docker Compose quickstart A pre-pinned local stack lives at `examples/observability/`: ```bash cd examples/observability mkdir -p voyage-textfile docker compose up -d ``` This brings up Prometheus, Grafana, node-exporter (textfile mode), and otel-collector (OTLP mode) on `localhost`. See `examples/observability/README.md` for endpoint URLs and version pins. ## Stats schema Each Voyage command emits one JSONL record per significant event. Schemas are documented in `tests/fixtures/jsonl-schemas.md` (Step 1 of v4.1) and locked by `tests/lib/profile-stats-fields.test.mjs`. The exporter applies the field allowlist defined in `lib/exporters/field-allowlist.mjs`. Adding a new field to the JSONL schema does **not** automatically expose it in OTel — you must add it to the allowlist explicitly. This is intentional: `${CLAUDE_PLUGIN_DATA}` is trusted local storage; OTel endpoints are operator-controlled and may be external. ## Security The exporter is hardened against three CWE classes: - **CWE-22 (path traversal)** — `lib/exporters/path-validator.mjs` rejects relative paths, symlinks, and paths outside `allowedRoots` (`VOYAGE_TEXTFILE_DIR` and `CLAUDE_PLUGIN_DATA`). Tested in `tests/hooks/otel-export-validators.test.mjs`. - **CWE-918 (SSRF)** — `lib/exporters/endpoint-validator.mjs` requires HTTPS, blocks loopback (127.0.0.0/8) and RFC1918 (10/8, 172.16/12, 192.168/16), unless `VOYAGE_OTEL_ALLOW_PRIVATE=1` is set explicitly. Cloud metadata endpoints (169.254.169.254) are permanently blocked. - **CWE-212 (improper data sanitization)** — every record passes through `lib/exporters/field-allowlist.mjs` before any I/O. Adding a field to the JSONL stream does not expose it externally; operators must update the allowlist intentionally. ### Minimum versions per CVE history | Component | Minimum version | Reason | |-----------|-----------------|--------| | `otel/opentelemetry-collector-contrib` | `0.115.0` | post-CVE-2024-42368 | | `prom/prometheus` | `3.0.1` | OOM regression fix in 2.x | | `prom/node-exporter` | `1.10.2` | textfile collector path normalization | | `grafana/grafana` | `11.4.0` | datasource provisioning hardening | ## Limitations - **Stop-hook is normal-exit only.** If Claude Code crashes or is killed with SIGKILL, the final session's metrics are not flushed. Use `--resume` on next start to recover plan/progress state; the missing session will not appear in dashboards. - **Tail-latency NFR is best-effort.** Textfile mode targets <5 ms p99, OTLP <1500 ms (AbortController guards). If the network endpoint is slow, the timeout fires and stats for that session are dropped — the hook always exits 0 to avoid blocking session shutdown. - **No retry on transport failure.** Stop-hook runs at most once per session. If the OTLP endpoint is unreachable, that session's metrics are lost. Production deployments should use `textfile` + a robust scrape pipeline (node-exporter, vector, otel-collector with persistent queue) to handle delivery semantics. - **No per-tenant labelling.** v4.1 emits flat metrics with command and schema_id labels only. Multi-tenant deployments needing per-user or per-project segmentation should layer a relabel stage in their collector or use external metadata. ## Cost-estimering disclaimer The ROUGE-L / Jaccard / character n-gram thresholds in v4.1 are *starting points*, not contractual SLAs. The brief Risk-tabell explicitly flags these as anslag — they were calibrated against synthetic plan runs (Step 17) using `economy` and `premium` profiles. Real cross-tier agreement varies by task complexity. Treat the thresholds as smoke-test floors; tighten them in v4.2 once you have ≥10 production runs of data. ## See also - `examples/observability/` — local Docker Compose stack - `tests/fixtures/jsonl-schemas.md` — canonical record shapes - `lib/exporters/field-allowlist.mjs` — per-schema allowed fields - `hooks/scripts/otel-export.mjs` — Stop-hook orchestrator