ktg-plugin-marketplace/plugins/voyage/docs/observability.md
Kjell Tore Guttormsen 7e60b28c8d docs(voyage): add docs/observability.md — operator quickstart for v4.1 OTel export
Step 15 of v4.1 — operator-facing observability docs (151 lines, target ≥80).
Sections:
  - Overview (JSONL is default, OTel is opt-in)
  - Activating OTel export (VOYAGE_EXPORT_MODE)
  - Output formats (Prometheus textfile vs OTLP/HTTP)
  - Environment variables matrix
  - Docker Compose quickstart (cross-link to examples/observability/)
  - Stats schema (cross-link to tests/fixtures/jsonl-schemas.md)
  - Security (CWE-22, CWE-918, CWE-212 mitigations + min-versions per CVE)
  - Limitations (Stop-hook normal-exit only, no retry, NFR best-effort)
  - Cost-estimering disclaimer (per brief Risk-tabell)
2026-05-09 09:51:44 +02:00

6.6 KiB

Observability — voyage v4.1

This document describes the opt-in OpenTelemetry / Prometheus export path added in v4.1. The default JSONL stats stream (${CLAUDE_PLUGIN_DATA}/trek*-stats.jsonl) remains unchanged — it is the canonical event log and continues to be written regardless of OTel mode.

Overview

Voyage v4.0 wrote per-command stats to JSONL files only. Operators who wanted dashboards had to roll their own log-pipeline. v4.1 adds a Stop-hook called hooks/scripts/otel-export.mjs that, when activated via VOYAGE_EXPORT_MODE, transforms the JSONL records into either a Prometheus textfile or OTLP/HTTP push at session-end.

The hook is additive. With VOYAGE_EXPORT_MODE=off (default), the binary exits silently and no work is done — your existing JSONL workflow is untouched.

Activating OTel export

Set VOYAGE_EXPORT_MODE in the shell before invoking any voyage command:

# Default — no export, JSONL only
unset VOYAGE_EXPORT_MODE                  # equivalent to VOYAGE_EXPORT_MODE=off

# Path A — Prometheus textfile (recommended for local dashboards)
export VOYAGE_EXPORT_MODE=textfile
export VOYAGE_TEXTFILE_DIR=/var/lib/node_exporter/textfile

# Path B — OTLP/HTTP push (recommended for centralized telemetry)
export VOYAGE_EXPORT_MODE=otlp
export VOYAGE_OTEL_ENDPOINT=https://otel.example.com/v1/metrics

hooks/hooks.json wires the Stop event to otel-export.mjs, so the export runs automatically when Claude Code finishes a session. No manual invocation is required.

Output formats

Mode Wire format Endpoint shape Cardinality cap
textfile Prometheus exposition format (text) local file: ${VOYAGE_TEXTFILE_DIR}/voyage.prom low — voyage controls labels
otlp OTLP/JSON v1.0 metric ResourceMetrics HTTPS POST: ${VOYAGE_OTEL_ENDPOINT} low — same allowlist as textfile
off (none)

Both formats apply the same field allowlist — see lib/exporters/field-allowlist.mjs for the per-schema list. Fields not in the allowlist are dropped before export. This is a CWE-212 mitigation: operator-defined endpoints must never receive accidentally-leaked operator-private data (paths, prompts, brief content).

Environment variables

Variable Default Purpose
VOYAGE_EXPORT_MODE off One of off / textfile / otlp
VOYAGE_TEXTFILE_DIR ${CLAUDE_PLUGIN_DATA} Directory for voyage.prom (textfile mode)
VOYAGE_OTEL_ENDPOINT (none) HTTPS URL for OTLP/HTTP POST
VOYAGE_OTEL_ALLOW_PRIVATE (unset) Set to 1 to allow loopback / RFC1918 endpoints

Docker Compose quickstart

A pre-pinned local stack lives at examples/observability/:

cd examples/observability
mkdir -p voyage-textfile
docker compose up -d

This brings up Prometheus, Grafana, node-exporter (textfile mode), and otel-collector (OTLP mode) on localhost. See examples/observability/README.md for endpoint URLs and version pins.

Stats schema

Each Voyage command emits one JSONL record per significant event. Schemas are documented in tests/fixtures/jsonl-schemas.md (Step 1 of v4.1) and locked by tests/lib/profile-stats-fields.test.mjs.

The exporter applies the field allowlist defined in lib/exporters/field-allowlist.mjs. Adding a new field to the JSONL schema does not automatically expose it in OTel — you must add it to the allowlist explicitly. This is intentional: ${CLAUDE_PLUGIN_DATA} is trusted local storage; OTel endpoints are operator-controlled and may be external.

Security

The exporter is hardened against three CWE classes:

  • CWE-22 (path traversal)lib/exporters/path-validator.mjs rejects relative paths, symlinks, and paths outside allowedRoots (VOYAGE_TEXTFILE_DIR and CLAUDE_PLUGIN_DATA). Tested in tests/hooks/otel-export-validators.test.mjs.
  • CWE-918 (SSRF)lib/exporters/endpoint-validator.mjs requires HTTPS, blocks loopback (127.0.0.0/8) and RFC1918 (10/8, 172.16/12, 192.168/16), unless VOYAGE_OTEL_ALLOW_PRIVATE=1 is set explicitly. Cloud metadata endpoints (169.254.169.254) are permanently blocked.
  • CWE-212 (improper data sanitization) — every record passes through lib/exporters/field-allowlist.mjs before any I/O. Adding a field to the JSONL stream does not expose it externally; operators must update the allowlist intentionally.

Minimum versions per CVE history

Component Minimum version Reason
otel/opentelemetry-collector-contrib 0.115.0 post-CVE-2024-42368
prom/prometheus 3.0.1 OOM regression fix in 2.x
prom/node-exporter 1.10.2 textfile collector path normalization
grafana/grafana 11.4.0 datasource provisioning hardening

Limitations

  • Stop-hook is normal-exit only. If Claude Code crashes or is killed with SIGKILL, the final session's metrics are not flushed. Use --resume on next start to recover plan/progress state; the missing session will not appear in dashboards.
  • Tail-latency NFR is best-effort. Textfile mode targets <5 ms p99, OTLP <1500 ms (AbortController guards). If the network endpoint is slow, the timeout fires and stats for that session are dropped — the hook always exits 0 to avoid blocking session shutdown.
  • No retry on transport failure. Stop-hook runs at most once per session. If the OTLP endpoint is unreachable, that session's metrics are lost. Production deployments should use textfile + a robust scrape pipeline (node-exporter, vector, otel-collector with persistent queue) to handle delivery semantics.
  • No per-tenant labelling. v4.1 emits flat metrics with command and schema_id labels only. Multi-tenant deployments needing per-user or per-project segmentation should layer a relabel stage in their collector or use external metadata.

Cost-estimering disclaimer

The ROUGE-L / Jaccard / character n-gram thresholds in v4.1 are starting points, not contractual SLAs. The brief Risk-tabell explicitly flags these as anslag — they were calibrated against synthetic plan runs (Step 17) using economy and premium profiles. Real cross-tier agreement varies by task complexity. Treat the thresholds as smoke-test floors; tighten them in v4.2 once you have ≥10 production runs of data.

See also

  • examples/observability/ — local Docker Compose stack
  • tests/fixtures/jsonl-schemas.md — canonical record shapes
  • lib/exporters/field-allowlist.mjs — per-schema allowed fields
  • hooks/scripts/otel-export.mjs — Stop-hook orchestrator