feat(llm-security)!: v7.0.0 commit 6 — tests, docs, version bump

Final commit in the trustworthy-scoring series. Bundles verdict cutoff
alignment, the last suite of tests, and all documentation touch-points
that quote version numbers or describe v7.0.0 behaviour.

Verdict/band co-monotonicity
- `scanners/lib/severity.mjs` — verdict cutoffs moved from 61/21 to 65/15
  so `BLOCK >= 65`, `WARNING >= 15` locks onto the v2 riskBand() boundaries.
  Prevents "BLOCK / Medium band" contradictions under the v2 formula.

Scanner hardening (bug fixes from v7.0.0 testing)
- `scanners/entropy-scanner.mjs` — `policy_source` now uses
  `existsSync('.llm-security/policy.json')` instead of value-based check.
  Old heuristic always reported 'policy.json' because DEFAULT_POLICY now
  carries an `entropy.thresholds` section.
- `scanners/lib/file-discovery.mjs` — `.sass` and GPU shader extensions
  (`.glsl, .frag, .vert, .shader, .wgsl`) added to TEXT_EXTENSIONS. Without
  this, shader files were invisible to file-discovery, so they were never
  counted as skipped by the entropy-scanner extension filter.

Tests
- `tests/scanners/entropy-context.test.mjs` (new, 24 tests) — A. File-ext
  skip (4), B. Line-level rules 11-17 (8), C. Policy overrides (3).
  Fixtures generate 80-char base64 payloads at runtime via
  `crypto.randomBytes` to dodge the plugin's own pre-edit credential hook
  on the test source.
- `tests/lib/severity.test.mjs` — rewritten with v2 scoring table (70
  tests total, was 52).
- `tests/lib/output.test.mjs:243` — "1 critical = score 80" under v2
  (was 25 under v1).
- Full suite: 1485/1485 green (was 1461).

Docs
- `CHANGELOG.md` — v7.0.0 entry with BREAKING CHANGES section.
- `README.md` (plugin + marketplace root) — version badge, history table,
  plugin-card version string, test count.
- `CLAUDE.md` — header version, "v7.0.0 — Trustworthy scoring" summary
  paragraph at the top.
- `docs/security-hardening-guide.md` — new section 6 "Calibration & false
  positives" documenting v2 formula, context-aware entropy scanner,
  typosquat allowlist, and §6.4 tuning workflow. Existing "Recommended
  baseline" section renumbered to §7.

Version bump
- `6.6.0 -> 7.0.0` across package.json, .claude-plugin/plugin.json,
  scanners/ide-extension-scanner.mjs VERSION const, README badge,
  CLAUDE.md header, marketplace root README card.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Kjell Tore Guttormsen 2026-04-19 22:26:35 +02:00
commit 6f86de937a
14 changed files with 515 additions and 85 deletions

View file

@ -147,7 +147,103 @@ attacks but does not eliminate them.
---
## 6. Recommended baseline for production
## 6. Calibration & false positives (v7.0.0+)
Security scanners live or die by their signal-to-noise ratio. A scanner that
cries "extreme" on every project destroys its own credibility — users learn
to ignore findings, and genuine threats slip past. v7.0.0 ships three
calibration layers to keep that from happening.
### 6.1 Risk-score v2 formula
The v1 formula was a sum-and-cap: `critical*25 + high*10 + medium*4 + low*1`,
capped at 100. Every non-trivial scan collapsed to 100/Extreme regardless of
actual distribution. A codebase with 2 mediums and 100 lows scored the same
as a codebase with 5 criticals.
v2 (`scanners/lib/severity.mjs`) is severity-dominated and log-scaled within
tier:
| Finding mix | Score range | Band |
|-------------|-------------|------|
| Critical present | 7095 (1=80, 2=86, 4=90, 10=95) | Critical/Extreme |
| High only | 4065 (1=48, 5=60, 17=65) | High |
| Medium only | 1535 (1=20, 5=28, 50=33) | Medium |
| Low only | 111 (1=4, 10=11) | Low |
| None | 0 | Low |
Verdict cutoffs (`BLOCK ≥65`, `WARNING ≥15`) are locked to the `riskBand()`
boundaries so you can't get a "BLOCK / Medium band" contradiction. The legacy
formula is kept as `riskScoreV1()` for reference only.
**CI impact:** Pipelines with `--fail-on high` keep working (the severity
gate is unaffected). Pipelines with score-based thresholds need recalibration
— old `score >= 21` corresponds roughly to new `score >= 15`.
### 6.2 Context-aware entropy scanner
The entropy scanner flags high-Shannon-entropy strings as possible
credentials. On codebases heavy with shader code, bundled JS, CSS-in-JS or
SQL it produced astronomical false-positive rates. v7.0.0 adds three
suppression layers:
1. **File-extension skip** — whole files with these extensions are never
inspected for entropy findings: `.glsl, .frag, .vert, .shader, .wgsl,
.css, .scss, .sass, .less, .svg` + compound `.min.js, .min.css, .map`. A
skip counter (`calibration.files_skipped_by_extension`) is reported in the
scanner envelope.
2. **Line-level rules 1117** — applied when a line contains any of: GLSL
keywords (`uniform`, `vec3`, `texture2D`…), CSS-in-JS templates
(`styled.…`), inline `<svg>` markup, ffmpeg `filter_complex` syntax,
browser `User-Agent` strings, SQL DDL on a dedicated line
(`^\s*(SELECT|INSERT|…)`), or `throw new Error(\`…\`)` templates.
3. **Per-project policy override**`.llm-security/policy.json` `entropy`
section supports:
```json
{
"entropy": {
"thresholds": {
"critical": { "entropy": 5.4, "minLen": 128 },
"high": { "entropy": 5.1, "minLen": 64 },
"medium": { "entropy": 4.7, "minLen": 40 }
},
"suppress_extensions": [".custom"],
"suppress_line_patterns": ["MY_VENDOR_MARKER"],
"suppress_paths": ["vendored/", "generated/"]
}
}
```
The synthesizer agent reports calibration prominently if >80 % of files were
skipped (signals a policy so aggressive the scan is effectively bypassed)
and omits it silently if <5 % were skipped.
### 6.3 Typosquat allowlist
The DEP scanner flags Levenshtein-close package names against a top-N list
to catch typosquats (`lod-ash`, `expres`). On real codebases this tripped on
short-name tools like `knip`, `nx`, `tsx`, `uv`, `ruff`. v7.0.0 extends
`knowledge/typosquat-allowlist.json` with 22 npm + 5 PyPI entries for modern
tools.
### 6.4 Tuning workflow
1. Run `/security deep-scan` on a representative codebase.
2. Read `calibration.files_skipped_by_extension` and `files_skipped_by_path`
from the envelope — are they reasonable?
3. Review the top 10 findings. For each false positive, pick the narrowest
suppression that catches it:
- Whole extension noisy → `suppress_extensions`
- One line pattern recurring → `suppress_line_patterns`
- Whole directory vendored → `suppress_paths`
4. Raise thresholds only as a last resort — you're hiding real signal.
5. Re-scan and verify verdict/band/score make sense relative to the finding
set.
---
## 7. Recommended baseline for production
1. Set `CLAUDE_CODE_EFFORT_LEVEL=xhigh` for audit and planning sessions.
2. Set `ENABLE_PROMPT_CACHING_1H=1` globally — reduces cost, does not weaken
@ -155,9 +251,11 @@ attacks but does not eliminate them.
3. All three plugin hook modes: start at `warn`, promote to `block` after
baselining.
4. Keep sandbox wrappers enabled (default on macOS / Linux).
5. Periodically run `/security posture` (13-category scorecard) and
5. Periodically run `/security posture` (16-category scorecard) and
`/security dashboard` (cross-project view) to catch drift.
6. After first `/security deep-scan`, run the §6.4 tuning workflow once to
calibrate the noise floor for your codebase.
---
**Last updated:** 2026-04-17 for v6.2.0.
**Last updated:** 2026-04-19 for v7.0.0.