ktg-plugin-marketplace/plugins/voyage/docs/observability.md
Kjell Tore Guttormsen 7e60b28c8d docs(voyage): add docs/observability.md — operator quickstart for v4.1 OTel export
Step 15 of v4.1 — operator-facing observability docs (151 lines, target ≥80).
Sections:
  - Overview (JSONL is default, OTel is opt-in)
  - Activating OTel export (VOYAGE_EXPORT_MODE)
  - Output formats (Prometheus textfile vs OTLP/HTTP)
  - Environment variables matrix
  - Docker Compose quickstart (cross-link to examples/observability/)
  - Stats schema (cross-link to tests/fixtures/jsonl-schemas.md)
  - Security (CWE-22, CWE-918, CWE-212 mitigations + min-versions per CVE)
  - Limitations (Stop-hook normal-exit only, no retry, NFR best-effort)
  - Cost-estimering disclaimer (per brief Risk-tabell)
2026-05-09 09:51:44 +02:00

151 lines
6.6 KiB
Markdown

# Observability — voyage v4.1
This document describes the *opt-in* OpenTelemetry / Prometheus export
path added in v4.1. The default JSONL stats stream
(`${CLAUDE_PLUGIN_DATA}/trek*-stats.jsonl`) remains unchanged — it is the
canonical event log and continues to be written regardless of OTel mode.
## Overview
Voyage v4.0 wrote per-command stats to JSONL files only. Operators who
wanted dashboards had to roll their own log-pipeline. v4.1 adds a Stop-hook
called `hooks/scripts/otel-export.mjs` that, when activated via
`VOYAGE_EXPORT_MODE`, transforms the JSONL records into either a
Prometheus textfile or OTLP/HTTP push at session-end.
The hook is *additive*. With `VOYAGE_EXPORT_MODE=off` (default), the
binary exits silently and no work is done — your existing JSONL workflow
is untouched.
## Activating OTel export
Set `VOYAGE_EXPORT_MODE` in the shell before invoking any voyage command:
```bash
# Default — no export, JSONL only
unset VOYAGE_EXPORT_MODE # equivalent to VOYAGE_EXPORT_MODE=off
# Path A — Prometheus textfile (recommended for local dashboards)
export VOYAGE_EXPORT_MODE=textfile
export VOYAGE_TEXTFILE_DIR=/var/lib/node_exporter/textfile
# Path B — OTLP/HTTP push (recommended for centralized telemetry)
export VOYAGE_EXPORT_MODE=otlp
export VOYAGE_OTEL_ENDPOINT=https://otel.example.com/v1/metrics
```
`hooks/hooks.json` wires the Stop event to `otel-export.mjs`, so the
export runs automatically when Claude Code finishes a session. No manual
invocation is required.
## Output formats
| Mode | Wire format | Endpoint shape | Cardinality cap |
|------|-------------|----------------|-----------------|
| `textfile` | Prometheus exposition format (text) | local file: `${VOYAGE_TEXTFILE_DIR}/voyage.prom` | low — voyage controls labels |
| `otlp` | OTLP/JSON v1.0 metric ResourceMetrics | HTTPS POST: `${VOYAGE_OTEL_ENDPOINT}` | low — same allowlist as textfile |
| `off` | (none) | — | — |
Both formats apply the **same field allowlist** — see
`lib/exporters/field-allowlist.mjs` for the per-schema list. Fields not in
the allowlist are dropped before export. This is a CWE-212 mitigation:
operator-defined endpoints must never receive accidentally-leaked
operator-private data (paths, prompts, brief content).
## Environment variables
| Variable | Default | Purpose |
|----------|---------|---------|
| `VOYAGE_EXPORT_MODE` | `off` | One of `off` / `textfile` / `otlp` |
| `VOYAGE_TEXTFILE_DIR` | `${CLAUDE_PLUGIN_DATA}` | Directory for `voyage.prom` (textfile mode) |
| `VOYAGE_OTEL_ENDPOINT` | _(none)_ | HTTPS URL for OTLP/HTTP POST |
| `VOYAGE_OTEL_ALLOW_PRIVATE` | _(unset)_ | Set to `1` to allow loopback / RFC1918 endpoints |
## Docker Compose quickstart
A pre-pinned local stack lives at `examples/observability/`:
```bash
cd examples/observability
mkdir -p voyage-textfile
docker compose up -d
```
This brings up Prometheus, Grafana, node-exporter (textfile mode), and
otel-collector (OTLP mode) on `localhost`. See
`examples/observability/README.md` for endpoint URLs and version pins.
## Stats schema
Each Voyage command emits one JSONL record per significant event. Schemas
are documented in `tests/fixtures/jsonl-schemas.md` (Step 1 of v4.1) and
locked by `tests/lib/profile-stats-fields.test.mjs`.
The exporter applies the field allowlist defined in
`lib/exporters/field-allowlist.mjs`. Adding a new field to the JSONL
schema does **not** automatically expose it in OTel — you must add it to
the allowlist explicitly. This is intentional: `${CLAUDE_PLUGIN_DATA}` is
trusted local storage; OTel endpoints are operator-controlled and may be
external.
## Security
The exporter is hardened against three CWE classes:
- **CWE-22 (path traversal)** — `lib/exporters/path-validator.mjs`
rejects relative paths, symlinks, and paths outside `allowedRoots`
(`VOYAGE_TEXTFILE_DIR` and `CLAUDE_PLUGIN_DATA`). Tested in
`tests/hooks/otel-export-validators.test.mjs`.
- **CWE-918 (SSRF)** — `lib/exporters/endpoint-validator.mjs` requires
HTTPS, blocks loopback (127.0.0.0/8) and RFC1918 (10/8, 172.16/12,
192.168/16), unless `VOYAGE_OTEL_ALLOW_PRIVATE=1` is set explicitly.
Cloud metadata endpoints (169.254.169.254) are permanently blocked.
- **CWE-212 (improper data sanitization)** — every record passes through
`lib/exporters/field-allowlist.mjs` before any I/O. Adding a field to
the JSONL stream does not expose it externally; operators must update
the allowlist intentionally.
### Minimum versions per CVE history
| Component | Minimum version | Reason |
|-----------|-----------------|--------|
| `otel/opentelemetry-collector-contrib` | `0.115.0` | post-CVE-2024-42368 |
| `prom/prometheus` | `3.0.1` | OOM regression fix in 2.x |
| `prom/node-exporter` | `1.10.2` | textfile collector path normalization |
| `grafana/grafana` | `11.4.0` | datasource provisioning hardening |
## Limitations
- **Stop-hook is normal-exit only.** If Claude Code crashes or is killed
with SIGKILL, the final session's metrics are not flushed. Use
`--resume` on next start to recover plan/progress state; the missing
session will not appear in dashboards.
- **Tail-latency NFR is best-effort.** Textfile mode targets <5 ms p99,
OTLP <1500 ms (AbortController guards). If the network endpoint is
slow, the timeout fires and stats for that session are dropped — the
hook always exits 0 to avoid blocking session shutdown.
- **No retry on transport failure.** Stop-hook runs at most once per
session. If the OTLP endpoint is unreachable, that session's metrics
are lost. Production deployments should use `textfile` + a robust
scrape pipeline (node-exporter, vector, otel-collector with persistent
queue) to handle delivery semantics.
- **No per-tenant labelling.** v4.1 emits flat metrics with command and
schema_id labels only. Multi-tenant deployments needing per-user or
per-project segmentation should layer a relabel stage in their
collector or use external metadata.
## Cost-estimering disclaimer
The ROUGE-L / Jaccard / character n-gram thresholds in v4.1 are
*starting points*, not contractual SLAs. The brief Risk-tabell explicitly
flags these as anslag — they were calibrated against synthetic plan
runs (Step 17) using `economy` and `premium` profiles. Real cross-tier
agreement varies by task complexity. Treat the thresholds as smoke-test
floors; tighten them in v4.2 once you have ≥10 production runs of data.
## See also
- `examples/observability/` — local Docker Compose stack
- `tests/fixtures/jsonl-schemas.md` — canonical record shapes
- `lib/exporters/field-allowlist.mjs` — per-schema allowed fields
- `hooks/scripts/otel-export.mjs` — Stop-hook orchestrator