Step 15 of v4.1 — operator-facing observability docs (151 lines, target ≥80). Sections: - Overview (JSONL is default, OTel is opt-in) - Activating OTel export (VOYAGE_EXPORT_MODE) - Output formats (Prometheus textfile vs OTLP/HTTP) - Environment variables matrix - Docker Compose quickstart (cross-link to examples/observability/) - Stats schema (cross-link to tests/fixtures/jsonl-schemas.md) - Security (CWE-22, CWE-918, CWE-212 mitigations + min-versions per CVE) - Limitations (Stop-hook normal-exit only, no retry, NFR best-effort) - Cost-estimering disclaimer (per brief Risk-tabell)
151 lines
6.6 KiB
Markdown
151 lines
6.6 KiB
Markdown
# Observability — voyage v4.1
|
|
|
|
This document describes the *opt-in* OpenTelemetry / Prometheus export
|
|
path added in v4.1. The default JSONL stats stream
|
|
(`${CLAUDE_PLUGIN_DATA}/trek*-stats.jsonl`) remains unchanged — it is the
|
|
canonical event log and continues to be written regardless of OTel mode.
|
|
|
|
## Overview
|
|
|
|
Voyage v4.0 wrote per-command stats to JSONL files only. Operators who
|
|
wanted dashboards had to roll their own log-pipeline. v4.1 adds a Stop-hook
|
|
called `hooks/scripts/otel-export.mjs` that, when activated via
|
|
`VOYAGE_EXPORT_MODE`, transforms the JSONL records into either a
|
|
Prometheus textfile or OTLP/HTTP push at session-end.
|
|
|
|
The hook is *additive*. With `VOYAGE_EXPORT_MODE=off` (default), the
|
|
binary exits silently and no work is done — your existing JSONL workflow
|
|
is untouched.
|
|
|
|
## Activating OTel export
|
|
|
|
Set `VOYAGE_EXPORT_MODE` in the shell before invoking any voyage command:
|
|
|
|
```bash
|
|
# Default — no export, JSONL only
|
|
unset VOYAGE_EXPORT_MODE # equivalent to VOYAGE_EXPORT_MODE=off
|
|
|
|
# Path A — Prometheus textfile (recommended for local dashboards)
|
|
export VOYAGE_EXPORT_MODE=textfile
|
|
export VOYAGE_TEXTFILE_DIR=/var/lib/node_exporter/textfile
|
|
|
|
# Path B — OTLP/HTTP push (recommended for centralized telemetry)
|
|
export VOYAGE_EXPORT_MODE=otlp
|
|
export VOYAGE_OTEL_ENDPOINT=https://otel.example.com/v1/metrics
|
|
```
|
|
|
|
`hooks/hooks.json` wires the Stop event to `otel-export.mjs`, so the
|
|
export runs automatically when Claude Code finishes a session. No manual
|
|
invocation is required.
|
|
|
|
## Output formats
|
|
|
|
| Mode | Wire format | Endpoint shape | Cardinality cap |
|
|
|------|-------------|----------------|-----------------|
|
|
| `textfile` | Prometheus exposition format (text) | local file: `${VOYAGE_TEXTFILE_DIR}/voyage.prom` | low — voyage controls labels |
|
|
| `otlp` | OTLP/JSON v1.0 metric ResourceMetrics | HTTPS POST: `${VOYAGE_OTEL_ENDPOINT}` | low — same allowlist as textfile |
|
|
| `off` | (none) | — | — |
|
|
|
|
Both formats apply the **same field allowlist** — see
|
|
`lib/exporters/field-allowlist.mjs` for the per-schema list. Fields not in
|
|
the allowlist are dropped before export. This is a CWE-212 mitigation:
|
|
operator-defined endpoints must never receive accidentally-leaked
|
|
operator-private data (paths, prompts, brief content).
|
|
|
|
## Environment variables
|
|
|
|
| Variable | Default | Purpose |
|
|
|----------|---------|---------|
|
|
| `VOYAGE_EXPORT_MODE` | `off` | One of `off` / `textfile` / `otlp` |
|
|
| `VOYAGE_TEXTFILE_DIR` | `${CLAUDE_PLUGIN_DATA}` | Directory for `voyage.prom` (textfile mode) |
|
|
| `VOYAGE_OTEL_ENDPOINT` | _(none)_ | HTTPS URL for OTLP/HTTP POST |
|
|
| `VOYAGE_OTEL_ALLOW_PRIVATE` | _(unset)_ | Set to `1` to allow loopback / RFC1918 endpoints |
|
|
|
|
## Docker Compose quickstart
|
|
|
|
A pre-pinned local stack lives at `examples/observability/`:
|
|
|
|
```bash
|
|
cd examples/observability
|
|
mkdir -p voyage-textfile
|
|
docker compose up -d
|
|
```
|
|
|
|
This brings up Prometheus, Grafana, node-exporter (textfile mode), and
|
|
otel-collector (OTLP mode) on `localhost`. See
|
|
`examples/observability/README.md` for endpoint URLs and version pins.
|
|
|
|
## Stats schema
|
|
|
|
Each Voyage command emits one JSONL record per significant event. Schemas
|
|
are documented in `tests/fixtures/jsonl-schemas.md` (Step 1 of v4.1) and
|
|
locked by `tests/lib/profile-stats-fields.test.mjs`.
|
|
|
|
The exporter applies the field allowlist defined in
|
|
`lib/exporters/field-allowlist.mjs`. Adding a new field to the JSONL
|
|
schema does **not** automatically expose it in OTel — you must add it to
|
|
the allowlist explicitly. This is intentional: `${CLAUDE_PLUGIN_DATA}` is
|
|
trusted local storage; OTel endpoints are operator-controlled and may be
|
|
external.
|
|
|
|
## Security
|
|
|
|
The exporter is hardened against three CWE classes:
|
|
|
|
- **CWE-22 (path traversal)** — `lib/exporters/path-validator.mjs`
|
|
rejects relative paths, symlinks, and paths outside `allowedRoots`
|
|
(`VOYAGE_TEXTFILE_DIR` and `CLAUDE_PLUGIN_DATA`). Tested in
|
|
`tests/hooks/otel-export-validators.test.mjs`.
|
|
- **CWE-918 (SSRF)** — `lib/exporters/endpoint-validator.mjs` requires
|
|
HTTPS, blocks loopback (127.0.0.0/8) and RFC1918 (10/8, 172.16/12,
|
|
192.168/16), unless `VOYAGE_OTEL_ALLOW_PRIVATE=1` is set explicitly.
|
|
Cloud metadata endpoints (169.254.169.254) are permanently blocked.
|
|
- **CWE-212 (improper data sanitization)** — every record passes through
|
|
`lib/exporters/field-allowlist.mjs` before any I/O. Adding a field to
|
|
the JSONL stream does not expose it externally; operators must update
|
|
the allowlist intentionally.
|
|
|
|
### Minimum versions per CVE history
|
|
|
|
| Component | Minimum version | Reason |
|
|
|-----------|-----------------|--------|
|
|
| `otel/opentelemetry-collector-contrib` | `0.115.0` | post-CVE-2024-42368 |
|
|
| `prom/prometheus` | `3.0.1` | OOM regression fix in 2.x |
|
|
| `prom/node-exporter` | `1.10.2` | textfile collector path normalization |
|
|
| `grafana/grafana` | `11.4.0` | datasource provisioning hardening |
|
|
|
|
## Limitations
|
|
|
|
- **Stop-hook is normal-exit only.** If Claude Code crashes or is killed
|
|
with SIGKILL, the final session's metrics are not flushed. Use
|
|
`--resume` on next start to recover plan/progress state; the missing
|
|
session will not appear in dashboards.
|
|
- **Tail-latency NFR is best-effort.** Textfile mode targets <5 ms p99,
|
|
OTLP <1500 ms (AbortController guards). If the network endpoint is
|
|
slow, the timeout fires and stats for that session are dropped — the
|
|
hook always exits 0 to avoid blocking session shutdown.
|
|
- **No retry on transport failure.** Stop-hook runs at most once per
|
|
session. If the OTLP endpoint is unreachable, that session's metrics
|
|
are lost. Production deployments should use `textfile` + a robust
|
|
scrape pipeline (node-exporter, vector, otel-collector with persistent
|
|
queue) to handle delivery semantics.
|
|
- **No per-tenant labelling.** v4.1 emits flat metrics with command and
|
|
schema_id labels only. Multi-tenant deployments needing per-user or
|
|
per-project segmentation should layer a relabel stage in their
|
|
collector or use external metadata.
|
|
|
|
## Cost-estimering disclaimer
|
|
|
|
The ROUGE-L / Jaccard / character n-gram thresholds in v4.1 are
|
|
*starting points*, not contractual SLAs. The brief Risk-tabell explicitly
|
|
flags these as anslag — they were calibrated against synthetic plan
|
|
runs (Step 17) using `economy` and `premium` profiles. Real cross-tier
|
|
agreement varies by task complexity. Treat the thresholds as smoke-test
|
|
floors; tighten them in v4.2 once you have ≥10 production runs of data.
|
|
|
|
## See also
|
|
|
|
- `examples/observability/` — local Docker Compose stack
|
|
- `tests/fixtures/jsonl-schemas.md` — canonical record shapes
|
|
- `lib/exporters/field-allowlist.mjs` — per-schema allowed fields
|
|
- `hooks/scripts/otel-export.mjs` — Stop-hook orchestrator
|