fix(dep): B7 — token-overlap typosquat heuristic alongside Levenshtein

Critical-review §2 B7 finding: pure Levenshtein <=2 misses the most common
modern typosquat pattern — popular-name + token-injection suffix. Examples:
  lodash → lodash-utils    (edit distance 6, not flagged pre-B7)
  react  → react-helper    (edit distance 7, not flagged pre-B7)
  express → express-wrapper (edit distance 8, not flagged pre-B7)

Three coordinated edits:

scanners/lib/string-utils.mjs
- Adds tokenize(name): string[]    splits on -/_, lowercases
- Adds tokenOverlap(a, b): number  intersection.size / min(|a|,|b|)
- Adds TYPOSQUAT_SUSPICIOUS_TOKENS frozen list of common typosquat
  suffixes. Excludes language-extension tokens (js, jsx, ts, tsx) — the
  v7.0.0 allowlist contains `tsx` as a legit package and including the
  same token in the suspicious set creates a contradiction. Caught by
  the new allowlist-intersection-guard test. Also excludes 'pro'
  (legitimate edition marker).

scanners/dep-auditor.mjs + scanners/supply-chain-recheck.mjs
- New checkTyposquatTokenOverlap() helper — fires AFTER Levenshtein 1/2
  branches, only when:
    1. popular package's tokens ⊆ declared name's tokens (strict superset)
    2. declared name has at least one suspicious suffix
    3. popular package is in topCutoff window
  All three conditions required — conservative by design. Allowlist
  precedence preserved (existing 22 npm + 13 PyPI entries always pass).
  MEDIUM severity, NOT block. New finding title prefix:
  "Possible typosquatting via token-overlap".

Tests: +21 cases across two new files
- tests/lib/string-utils-tokens.test.mjs (15) — tokenize, tokenOverlap,
  TYPOSQUAT_SUSPICIOUS_TOKENS frozen contract, allowlist-intersection
  guard (caught the tsx conflict on first run)
- tests/scanners/dep-token-overlap.test.mjs (7) — integration via
  in-memory tmpdir fixtures: lodash-utils flagged, react-helper flagged,
  express-wrapper flagged, lodash exact NOT flagged, allowlist tools
  (knip/tsx/nx/rimraf) NOT flagged, react-router-dom (no suspicious
  suffix) NOT flagged, react itself (equal token set, not superset)
  NOT flagged.

Existing dep.test.mjs and supply-chain-recheck.test.mjs unchanged —
all green (149 → 149 regression guard).

Suite: 1570 → 1591 (+21). All green.
This commit is contained in:
Kjell Tore Guttormsen 2026-04-29 14:10:53 +02:00
commit 5f8f2d3c41
5 changed files with 438 additions and 2 deletions

View file

@ -54,6 +54,72 @@ export function levenshtein(a, b) {
return prev[n];
}
/**
* Split a package name into lowercase tokens on `-` and `_` boundaries.
* Used by the B7 typosquat token-overlap heuristic. Empty tokens are
* dropped. Single-character tokens are kept (some package names like
* `a-b` are real).
*
* @param {string} name
* @returns {string[]}
*/
export function tokenize(name) {
if (!name) return [];
return name
.toLowerCase()
.split(/[-_]+/)
.filter(t => t.length > 0);
}
/**
* Token-overlap ratio between two package names. Returns the size of the
* intersection divided by the size of the smaller token set. Returns 0 if
* either input is empty.
*
* Example: `tokenOverlap('lodash-utils', 'lodash')` 1.0
* `tokenOverlap('react-router-dom', 'react')` 1.0
* `tokenOverlap('react-helper', 'react-router')` 0.5
* `tokenOverlap('foo', 'bar')` 0.0
*
* Used by B7 (v7.2.0) as a complementary signal alongside Levenshtein
* Levenshtein <=2 catches small typos; token-overlap catches
* popular-name-with-suffix typosquats.
*
* @param {string} a
* @param {string} b
* @returns {number} 0..1
*/
export function tokenOverlap(a, b) {
const ta = new Set(tokenize(a));
const tb = new Set(tokenize(b));
if (ta.size === 0 || tb.size === 0) return 0;
let intersection = 0;
for (const t of ta) if (tb.has(t)) intersection++;
return intersection / Math.min(ta.size, tb.size);
}
/**
* Suspicious suffix tokens commonly used by typosquats to dress up a
* popular package name. Module-level for B7 reuse.
*
* Excluded by design (would conflict with the v7.0.0 typosquat allowlist
* or trigger false positives on legitimate packages):
* - `js`, `jsx`, `ts`, `tsx` language-extension suffixes used by many
* legitimate packages (`react-jsx`, the `tsx` runtime, etc.). The
* v7.0.0 allowlist contains `tsx` directly; including the same token
* in the suspicious set would create an internal contradiction.
* - `pro` too common as a legitimate edition marker (`vue-pro`,
* `tailwindcss-pro`).
*
* Kept tokens are the unambiguous typosquat suffixes: utility/helper
* dressing, wrapper/shim packages, and tool/cli/sdk/kit qualifiers.
*/
export const TYPOSQUAT_SUSPICIOUS_TOKENS = Object.freeze([
'utils', 'util', 'helper', 'helpers', 'core', 'plus', 'extra', 'extras',
'bin', 'cli', 'tool', 'tools',
'wrapper', 'wrappers', 'lib', 'libs', 'kit', 'sdk', 'shim',
]);
/**
* Check if a string looks like base64-encoded data.
* @param {string} s