Critical-review §4 E1 finding: pre-v7.2.0 the Unicode-stego detector
(`containsUnicodeTags`) covered only U+E0001-E007F (Tag block). Private
Use Areas — also invisible in most terminals and surviving normalization
— were not detected. Attackers could encode payloads in PUA codepoints
that pass through `scanForInjection` undetected.
Coverage extended to:
- U+E0001-E007F Unicode Tag block (existing — DeepMind kat. 1)
- U+F0000-FFFFD Supplementary PUA-A (NEW — E1)
- U+100000-10FFFD Supplementary PUA-B (NEW — E1)
Detection-only for PUA: PUA characters have NO standard ASCII mapping,
so `decodeUnicodeTags` leaves them unchanged. Detection alone is
sufficient — `scanForInjection` emits HIGH on any presence, regardless
of decoded content.
Function name `containsUnicodeTags` preserved for back-compat. All
existing call sites (injection-patterns.mjs:259, etc.) work unchanged.
Semantically the function is now "containsHiddenUnicode".
Tests: +21 cases in tests/lib/string-utils-hidden-unicode.test.mjs:
- 5 Tag-block regression guards
- 4 PUA-A range cases (start, just-inside, end, buried-in-ASCII)
- 3 PUA-B range cases
- 5 boundary cases (gap U+E0080-EFFFF, U+10FFFE noncharacter, emoji,
CJK, Latin Extended — all must be FALSE)
- 4 decodeUnicodeTags passthrough cases (PUA-A unchanged, PUA-B
unchanged, Tag block still decodes, mixed Tag+PUA)
Suite: 1596 → 1617 (+21). All green.
Critical-review §2 B7 finding: pure Levenshtein <=2 misses the most common
modern typosquat pattern — popular-name + token-injection suffix. Examples:
lodash → lodash-utils (edit distance 6, not flagged pre-B7)
react → react-helper (edit distance 7, not flagged pre-B7)
express → express-wrapper (edit distance 8, not flagged pre-B7)
Three coordinated edits:
scanners/lib/string-utils.mjs
- Adds tokenize(name): string[] splits on -/_, lowercases
- Adds tokenOverlap(a, b): number intersection.size / min(|a|,|b|)
- Adds TYPOSQUAT_SUSPICIOUS_TOKENS frozen list of common typosquat
suffixes. Excludes language-extension tokens (js, jsx, ts, tsx) — the
v7.0.0 allowlist contains `tsx` as a legit package and including the
same token in the suspicious set creates a contradiction. Caught by
the new allowlist-intersection-guard test. Also excludes 'pro'
(legitimate edition marker).
scanners/dep-auditor.mjs + scanners/supply-chain-recheck.mjs
- New checkTyposquatTokenOverlap() helper — fires AFTER Levenshtein 1/2
branches, only when:
1. popular package's tokens ⊆ declared name's tokens (strict superset)
2. declared name has at least one suspicious suffix
3. popular package is in topCutoff window
All three conditions required — conservative by design. Allowlist
precedence preserved (existing 22 npm + 13 PyPI entries always pass).
MEDIUM severity, NOT block. New finding title prefix:
"Possible typosquatting via token-overlap".
Tests: +21 cases across two new files
- tests/lib/string-utils-tokens.test.mjs (15) — tokenize, tokenOverlap,
TYPOSQUAT_SUSPICIOUS_TOKENS frozen contract, allowlist-intersection
guard (caught the tsx conflict on first run)
- tests/scanners/dep-token-overlap.test.mjs (7) — integration via
in-memory tmpdir fixtures: lodash-utils flagged, react-helper flagged,
express-wrapper flagged, lodash exact NOT flagged, allowlist tools
(knip/tsx/nx/rimraf) NOT flagged, react-router-dom (no suspicious
suffix) NOT flagged, react itself (equal token set, not superset)
NOT flagged.
Existing dep.test.mjs and supply-chain-recheck.test.mjs unchanged —
all green (149 → 149 regression guard).
Suite: 1570 → 1591 (+21). All green.