feat(llm-security)!: v7.0.0 commit 6 — tests, docs, version bump
Final commit in the trustworthy-scoring series. Bundles verdict cutoff
alignment, the last suite of tests, and all documentation touch-points
that quote version numbers or describe v7.0.0 behaviour.
Verdict/band co-monotonicity
- `scanners/lib/severity.mjs` — verdict cutoffs moved from 61/21 to 65/15
so `BLOCK >= 65`, `WARNING >= 15` locks onto the v2 riskBand() boundaries.
Prevents "BLOCK / Medium band" contradictions under the v2 formula.
Scanner hardening (bug fixes from v7.0.0 testing)
- `scanners/entropy-scanner.mjs` — `policy_source` now uses
`existsSync('.llm-security/policy.json')` instead of value-based check.
Old heuristic always reported 'policy.json' because DEFAULT_POLICY now
carries an `entropy.thresholds` section.
- `scanners/lib/file-discovery.mjs` — `.sass` and GPU shader extensions
(`.glsl, .frag, .vert, .shader, .wgsl`) added to TEXT_EXTENSIONS. Without
this, shader files were invisible to file-discovery, so they were never
counted as skipped by the entropy-scanner extension filter.
Tests
- `tests/scanners/entropy-context.test.mjs` (new, 24 tests) — A. File-ext
skip (4), B. Line-level rules 11-17 (8), C. Policy overrides (3).
Fixtures generate 80-char base64 payloads at runtime via
`crypto.randomBytes` to dodge the plugin's own pre-edit credential hook
on the test source.
- `tests/lib/severity.test.mjs` — rewritten with v2 scoring table (70
tests total, was 52).
- `tests/lib/output.test.mjs:243` — "1 critical = score 80" under v2
(was 25 under v1).
- Full suite: 1485/1485 green (was 1461).
Docs
- `CHANGELOG.md` — v7.0.0 entry with BREAKING CHANGES section.
- `README.md` (plugin + marketplace root) — version badge, history table,
plugin-card version string, test count.
- `CLAUDE.md` — header version, "v7.0.0 — Trustworthy scoring" summary
paragraph at the top.
- `docs/security-hardening-guide.md` — new section 6 "Calibration & false
positives" documenting v2 formula, context-aware entropy scanner,
typosquat allowlist, and §6.4 tuning workflow. Existing "Recommended
baseline" section renumbered to §7.
Version bump
- `6.6.0 -> 7.0.0` across package.json, .claude-plugin/plugin.json,
scanners/ide-extension-scanner.mjs VERSION const, README badge,
CLAUDE.md header, marketplace root README card.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
915aca69e4
commit
6f86de937a
14 changed files with 515 additions and 85 deletions
|
|
@ -147,7 +147,103 @@ attacks but does not eliminate them.
|
|||
|
||||
---
|
||||
|
||||
## 6. Recommended baseline for production
|
||||
## 6. Calibration & false positives (v7.0.0+)
|
||||
|
||||
Security scanners live or die by their signal-to-noise ratio. A scanner that
|
||||
cries "extreme" on every project destroys its own credibility — users learn
|
||||
to ignore findings, and genuine threats slip past. v7.0.0 ships three
|
||||
calibration layers to keep that from happening.
|
||||
|
||||
### 6.1 Risk-score v2 formula
|
||||
|
||||
The v1 formula was a sum-and-cap: `critical*25 + high*10 + medium*4 + low*1`,
|
||||
capped at 100. Every non-trivial scan collapsed to 100/Extreme regardless of
|
||||
actual distribution. A codebase with 2 mediums and 100 lows scored the same
|
||||
as a codebase with 5 criticals.
|
||||
|
||||
v2 (`scanners/lib/severity.mjs`) is severity-dominated and log-scaled within
|
||||
tier:
|
||||
|
||||
| Finding mix | Score range | Band |
|
||||
|-------------|-------------|------|
|
||||
| Critical present | 70–95 (1=80, 2=86, 4=90, 10=95) | Critical/Extreme |
|
||||
| High only | 40–65 (1=48, 5=60, 17=65) | High |
|
||||
| Medium only | 15–35 (1=20, 5=28, 50=33) | Medium |
|
||||
| Low only | 1–11 (1=4, 10=11) | Low |
|
||||
| None | 0 | Low |
|
||||
|
||||
Verdict cutoffs (`BLOCK ≥65`, `WARNING ≥15`) are locked to the `riskBand()`
|
||||
boundaries so you can't get a "BLOCK / Medium band" contradiction. The legacy
|
||||
formula is kept as `riskScoreV1()` for reference only.
|
||||
|
||||
**CI impact:** Pipelines with `--fail-on high` keep working (the severity
|
||||
gate is unaffected). Pipelines with score-based thresholds need recalibration
|
||||
— old `score >= 21` corresponds roughly to new `score >= 15`.
|
||||
|
||||
### 6.2 Context-aware entropy scanner
|
||||
|
||||
The entropy scanner flags high-Shannon-entropy strings as possible
|
||||
credentials. On codebases heavy with shader code, bundled JS, CSS-in-JS or
|
||||
SQL it produced astronomical false-positive rates. v7.0.0 adds three
|
||||
suppression layers:
|
||||
|
||||
1. **File-extension skip** — whole files with these extensions are never
|
||||
inspected for entropy findings: `.glsl, .frag, .vert, .shader, .wgsl,
|
||||
.css, .scss, .sass, .less, .svg` + compound `.min.js, .min.css, .map`. A
|
||||
skip counter (`calibration.files_skipped_by_extension`) is reported in the
|
||||
scanner envelope.
|
||||
2. **Line-level rules 11–17** — applied when a line contains any of: GLSL
|
||||
keywords (`uniform`, `vec3`, `texture2D`…), CSS-in-JS templates
|
||||
(`styled.…`), inline `<svg>` markup, ffmpeg `filter_complex` syntax,
|
||||
browser `User-Agent` strings, SQL DDL on a dedicated line
|
||||
(`^\s*(SELECT|INSERT|…)`), or `throw new Error(\`…\`)` templates.
|
||||
3. **Per-project policy override** — `.llm-security/policy.json` `entropy`
|
||||
section supports:
|
||||
|
||||
```json
|
||||
{
|
||||
"entropy": {
|
||||
"thresholds": {
|
||||
"critical": { "entropy": 5.4, "minLen": 128 },
|
||||
"high": { "entropy": 5.1, "minLen": 64 },
|
||||
"medium": { "entropy": 4.7, "minLen": 40 }
|
||||
},
|
||||
"suppress_extensions": [".custom"],
|
||||
"suppress_line_patterns": ["MY_VENDOR_MARKER"],
|
||||
"suppress_paths": ["vendored/", "generated/"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The synthesizer agent reports calibration prominently if >80 % of files were
|
||||
skipped (signals a policy so aggressive the scan is effectively bypassed)
|
||||
and omits it silently if <5 % were skipped.
|
||||
|
||||
### 6.3 Typosquat allowlist
|
||||
|
||||
The DEP scanner flags Levenshtein-close package names against a top-N list
|
||||
to catch typosquats (`lod-ash`, `expres`). On real codebases this tripped on
|
||||
short-name tools like `knip`, `nx`, `tsx`, `uv`, `ruff`. v7.0.0 extends
|
||||
`knowledge/typosquat-allowlist.json` with 22 npm + 5 PyPI entries for modern
|
||||
tools.
|
||||
|
||||
### 6.4 Tuning workflow
|
||||
|
||||
1. Run `/security deep-scan` on a representative codebase.
|
||||
2. Read `calibration.files_skipped_by_extension` and `files_skipped_by_path`
|
||||
from the envelope — are they reasonable?
|
||||
3. Review the top 10 findings. For each false positive, pick the narrowest
|
||||
suppression that catches it:
|
||||
- Whole extension noisy → `suppress_extensions`
|
||||
- One line pattern recurring → `suppress_line_patterns`
|
||||
- Whole directory vendored → `suppress_paths`
|
||||
4. Raise thresholds only as a last resort — you're hiding real signal.
|
||||
5. Re-scan and verify verdict/band/score make sense relative to the finding
|
||||
set.
|
||||
|
||||
---
|
||||
|
||||
## 7. Recommended baseline for production
|
||||
|
||||
1. Set `CLAUDE_CODE_EFFORT_LEVEL=xhigh` for audit and planning sessions.
|
||||
2. Set `ENABLE_PROMPT_CACHING_1H=1` globally — reduces cost, does not weaken
|
||||
|
|
@ -155,9 +251,11 @@ attacks but does not eliminate them.
|
|||
3. All three plugin hook modes: start at `warn`, promote to `block` after
|
||||
baselining.
|
||||
4. Keep sandbox wrappers enabled (default on macOS / Linux).
|
||||
5. Periodically run `/security posture` (13-category scorecard) and
|
||||
5. Periodically run `/security posture` (16-category scorecard) and
|
||||
`/security dashboard` (cross-project view) to catch drift.
|
||||
6. After first `/security deep-scan`, run the §6.4 tuning workflow once to
|
||||
calibrate the noise floor for your codebase.
|
||||
|
||||
---
|
||||
|
||||
**Last updated:** 2026-04-17 for v6.2.0.
|
||||
**Last updated:** 2026-04-19 for v7.0.0.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue