Step 15 of v4.1 — operator-facing observability docs (151 lines, target ≥80). Sections: - Overview (JSONL is default, OTel is opt-in) - Activating OTel export (VOYAGE_EXPORT_MODE) - Output formats (Prometheus textfile vs OTLP/HTTP) - Environment variables matrix - Docker Compose quickstart (cross-link to examples/observability/) - Stats schema (cross-link to tests/fixtures/jsonl-schemas.md) - Security (CWE-22, CWE-918, CWE-212 mitigations + min-versions per CVE) - Limitations (Stop-hook normal-exit only, no retry, NFR best-effort) - Cost-estimering disclaimer (per brief Risk-tabell)
6.6 KiB
Observability — voyage v4.1
This document describes the opt-in OpenTelemetry / Prometheus export
path added in v4.1. The default JSONL stats stream
(${CLAUDE_PLUGIN_DATA}/trek*-stats.jsonl) remains unchanged — it is the
canonical event log and continues to be written regardless of OTel mode.
Overview
Voyage v4.0 wrote per-command stats to JSONL files only. Operators who
wanted dashboards had to roll their own log-pipeline. v4.1 adds a Stop-hook
called hooks/scripts/otel-export.mjs that, when activated via
VOYAGE_EXPORT_MODE, transforms the JSONL records into either a
Prometheus textfile or OTLP/HTTP push at session-end.
The hook is additive. With VOYAGE_EXPORT_MODE=off (default), the
binary exits silently and no work is done — your existing JSONL workflow
is untouched.
Activating OTel export
Set VOYAGE_EXPORT_MODE in the shell before invoking any voyage command:
# Default — no export, JSONL only
unset VOYAGE_EXPORT_MODE # equivalent to VOYAGE_EXPORT_MODE=off
# Path A — Prometheus textfile (recommended for local dashboards)
export VOYAGE_EXPORT_MODE=textfile
export VOYAGE_TEXTFILE_DIR=/var/lib/node_exporter/textfile
# Path B — OTLP/HTTP push (recommended for centralized telemetry)
export VOYAGE_EXPORT_MODE=otlp
export VOYAGE_OTEL_ENDPOINT=https://otel.example.com/v1/metrics
hooks/hooks.json wires the Stop event to otel-export.mjs, so the
export runs automatically when Claude Code finishes a session. No manual
invocation is required.
Output formats
| Mode | Wire format | Endpoint shape | Cardinality cap |
|---|---|---|---|
textfile |
Prometheus exposition format (text) | local file: ${VOYAGE_TEXTFILE_DIR}/voyage.prom |
low — voyage controls labels |
otlp |
OTLP/JSON v1.0 metric ResourceMetrics | HTTPS POST: ${VOYAGE_OTEL_ENDPOINT} |
low — same allowlist as textfile |
off |
(none) | — | — |
Both formats apply the same field allowlist — see
lib/exporters/field-allowlist.mjs for the per-schema list. Fields not in
the allowlist are dropped before export. This is a CWE-212 mitigation:
operator-defined endpoints must never receive accidentally-leaked
operator-private data (paths, prompts, brief content).
Environment variables
| Variable | Default | Purpose |
|---|---|---|
VOYAGE_EXPORT_MODE |
off |
One of off / textfile / otlp |
VOYAGE_TEXTFILE_DIR |
${CLAUDE_PLUGIN_DATA} |
Directory for voyage.prom (textfile mode) |
VOYAGE_OTEL_ENDPOINT |
(none) | HTTPS URL for OTLP/HTTP POST |
VOYAGE_OTEL_ALLOW_PRIVATE |
(unset) | Set to 1 to allow loopback / RFC1918 endpoints |
Docker Compose quickstart
A pre-pinned local stack lives at examples/observability/:
cd examples/observability
mkdir -p voyage-textfile
docker compose up -d
This brings up Prometheus, Grafana, node-exporter (textfile mode), and
otel-collector (OTLP mode) on localhost. See
examples/observability/README.md for endpoint URLs and version pins.
Stats schema
Each Voyage command emits one JSONL record per significant event. Schemas
are documented in tests/fixtures/jsonl-schemas.md (Step 1 of v4.1) and
locked by tests/lib/profile-stats-fields.test.mjs.
The exporter applies the field allowlist defined in
lib/exporters/field-allowlist.mjs. Adding a new field to the JSONL
schema does not automatically expose it in OTel — you must add it to
the allowlist explicitly. This is intentional: ${CLAUDE_PLUGIN_DATA} is
trusted local storage; OTel endpoints are operator-controlled and may be
external.
Security
The exporter is hardened against three CWE classes:
- CWE-22 (path traversal) —
lib/exporters/path-validator.mjsrejects relative paths, symlinks, and paths outsideallowedRoots(VOYAGE_TEXTFILE_DIRandCLAUDE_PLUGIN_DATA). Tested intests/hooks/otel-export-validators.test.mjs. - CWE-918 (SSRF) —
lib/exporters/endpoint-validator.mjsrequires HTTPS, blocks loopback (127.0.0.0/8) and RFC1918 (10/8, 172.16/12, 192.168/16), unlessVOYAGE_OTEL_ALLOW_PRIVATE=1is set explicitly. Cloud metadata endpoints (169.254.169.254) are permanently blocked. - CWE-212 (improper data sanitization) — every record passes through
lib/exporters/field-allowlist.mjsbefore any I/O. Adding a field to the JSONL stream does not expose it externally; operators must update the allowlist intentionally.
Minimum versions per CVE history
| Component | Minimum version | Reason |
|---|---|---|
otel/opentelemetry-collector-contrib |
0.115.0 |
post-CVE-2024-42368 |
prom/prometheus |
3.0.1 |
OOM regression fix in 2.x |
prom/node-exporter |
1.10.2 |
textfile collector path normalization |
grafana/grafana |
11.4.0 |
datasource provisioning hardening |
Limitations
- Stop-hook is normal-exit only. If Claude Code crashes or is killed
with SIGKILL, the final session's metrics are not flushed. Use
--resumeon next start to recover plan/progress state; the missing session will not appear in dashboards. - Tail-latency NFR is best-effort. Textfile mode targets <5 ms p99, OTLP <1500 ms (AbortController guards). If the network endpoint is slow, the timeout fires and stats for that session are dropped — the hook always exits 0 to avoid blocking session shutdown.
- No retry on transport failure. Stop-hook runs at most once per
session. If the OTLP endpoint is unreachable, that session's metrics
are lost. Production deployments should use
textfile+ a robust scrape pipeline (node-exporter, vector, otel-collector with persistent queue) to handle delivery semantics. - No per-tenant labelling. v4.1 emits flat metrics with command and schema_id labels only. Multi-tenant deployments needing per-user or per-project segmentation should layer a relabel stage in their collector or use external metadata.
Cost-estimering disclaimer
The ROUGE-L / Jaccard / character n-gram thresholds in v4.1 are
starting points, not contractual SLAs. The brief Risk-tabell explicitly
flags these as anslag — they were calibrated against synthetic plan
runs (Step 17) using economy and premium profiles. Real cross-tier
agreement varies by task complexity. Treat the thresholds as smoke-test
floors; tighten them in v4.2 once you have ≥10 production runs of data.
See also
examples/observability/— local Docker Compose stacktests/fixtures/jsonl-schemas.md— canonical record shapeslib/exporters/field-allowlist.mjs— per-schema allowed fieldshooks/scripts/otel-export.mjs— Stop-hook orchestrator