feat(injection): E16 — homoglyph NFKC fold before every pattern match

Critical-review §4 E16 finding: pre-v7.2.0 homoglyph normalization fired
ONLY for the MEDIUM-advisory "obfuscation present" signal. Pattern
matchers in scanForInjection compared against raw + decoded variants
only — they did NOT compare against a fold-normalized variant. As a
result, "ignоre previous instructions" (Cyrillic о, U+043E) bypassed
the CRITICAL "ignore previous" pattern.

Two coordinated edits:

scanners/lib/string-utils.mjs
- Adds HOMOGLYPH_MAP (frozen) — surgical Cyrillic/Greek → Latin map.
  ~25 entries focused on injection-vocabulary letters
  (a, e, o, c, p, x, y, i, j, s, l, A, E, O, C, P, X, Y, T).
- Adds foldHomoglyphs(s) — pipeline: NFKC → apply HOMOGLYPH_MAP.
  NFKC handles Mathematical Alphanumeric (U+1D400 block), fullwidth
  Latin (U+FF21 block), ligatures, width variants.

Excluded by design from HOMOGLYPH_MAP:
- Latin Extended (æ, ø, å, é, è, ñ, ü, ö, ä, ç, ß, þ, ð) — legitimate
  Norwegian/German/French/Spanish letters. Map them and we false-positive
  on every non-English source file.
- Greek letters not visually overlapping (β, γ, δ, ...)
- Cyrillic letters not visually overlapping (б, г, д, ж, ...)

scanners/lib/injection-patterns.mjs
- scanForInjection now builds a 4-variant set: raw, normalized,
  folded(raw), folded(normalized). Set deduplication skips redundant
  identical variants. Existing dedup-by-label (seenLabels Set) prevents
  double-counts when the same pattern matches in multiple variants.
- foldHomoglyphs added to the imports.

Tests: +27 cases in tests/lib/string-utils-homoglyph.test.mjs:
- 6 Cyrillic → Latin (lowercase, uppercase, multiple substitutions,
  Palochka U+04CF)
- 3 Greek → Latin
- 2 NFKC normalization (Math Bold, Fullwidth)
- 8 preserves-non-confusable (Norwegian æøå, German umlauts, French
  accents, Spanish ñ, emoji, CJK, Arabic/Hebrew)
- 3 edge cases (empty, null/undefined, idempotency)
- 5 scanForInjection integration (Cyrillic ignore, Cyrillic Assistant,
  Norwegian non-trigger, benign "ignore" comment, mixed Cyrillic+Greek)

Test-development found: U+1D5DC is "I" not "A" (test pin caught my
codepoint mistake — fixed during dev).

Suite: 1617 → 1644 (+27). All green.
This commit is contained in:
Kjell Tore Guttormsen 2026-04-29 14:22:05 +02:00
commit ec4ae268da
3 changed files with 291 additions and 4 deletions

View file

@ -378,6 +378,92 @@ export function stripBidiOverrides(s) {
return s.replace(/[\u202A-\u202E\u2066-\u2069]/g, '');
}
// ---------------------------------------------------------------------------
// Homoglyph folding (E16, v7.2.0)
// ---------------------------------------------------------------------------
/**
* Confusable mapping characters that LOOK like Latin letters but are
* different codepoints (most commonly Cyrillic and Greek). Surgical map
* focused on letters that appear in injection vocabulary
* (`ignore`, `system`, `you are`, `assistant`, `tool`, `response`).
*
* Excluded by design:
* - Latin Extended characters (æ, ø, å, é, è, ñ, ü, ö, ä, ç, ß, þ, ð, etc.)
* these are legitimate letters in Norwegian, German, Danish, Spanish,
* French, Icelandic, etc., and would generate false positives in
* non-English source code or documentation.
* - Greek letters that don't visually overlap with Latin (`β`, `γ`, `δ`, ...)
* - Cyrillic letters that don't visually overlap (`б`, `г`, `д`, `ж`, ...)
* - Mathematical alphanumeric symbols (the U+1D400 block) covered by
* NFKC normalization in `foldHomoglyphs` itself.
*
* The map is deliberately small (~25 entries). Adding more risks
* false-positive escalation on benign multilingual content.
*/
const HOMOGLYPH_MAP = Object.freeze({
// Cyrillic → Latin (lowercase)
'а': 'a', // U+0430
'е': 'e', // U+0435
'о': 'o', // U+043E
'с': 'c', // U+0441
'р': 'p', // U+0440
'х': 'x', // U+0445
'у': 'y', // U+0443
'і': 'i', // U+0456 (Ukrainian)
'ј': 'j', // U+0458
'ѕ': 's', // U+0455
'ӏ': 'l', // U+04CF (Cyrillic Palochka)
// Cyrillic → Latin (uppercase)
'А': 'A', // U+0410
'Е': 'E', // U+0415
'О': 'O', // U+041E
'С': 'C', // U+0421
'Р': 'P', // U+0420
'Х': 'X', // U+0425
'У': 'Y', // U+0423
// Greek → Latin (only the unambiguous Latin-look-alikes)
'α': 'a', // U+03B1
'ο': 'o', // U+03BF
'ρ': 'p', // U+03C1
'ι': 'i', // U+03B9
'ν': 'v', // U+03BD
'τ': 't', // U+03C4
// Greek uppercase
'Α': 'A', // U+0391
'Ο': 'O', // U+039F
'Ρ': 'P', // U+03A1
'Τ': 'T', // U+03A4
});
/**
* Fold visually-confusable characters to their Latin look-alikes. Used by
* E16 (v7.2.0) to neutralize homoglyph-substitution injection attacks
* before pattern matching.
*
* Pipeline:
* 1. NFKC normalize collapses Mathematical Alphanumeric (U+1D400),
* width variants, ligatures, and other compatibility decompositions.
* 2. Apply HOMOGLYPH_MAP Cyrillic/Greek look-alikes Latin.
*
* Idempotent: `foldHomoglyphs(foldHomoglyphs(s)) === foldHomoglyphs(s)`.
*
* Norwegian/Polish/German/etc. text is NOT affected characters like
* æ, ø, å, é, ñ, ü, ö, ä are not in HOMOGLYPH_MAP.
*
* @param {string} s
* @returns {string}
*/
export function foldHomoglyphs(s) {
if (!s) return s;
const normalized = s.normalize('NFKC');
let out = '';
for (const ch of normalized) {
out += HOMOGLYPH_MAP[ch] || ch;
}
return out;
}
/**
* Normalize a string by decoding all known obfuscation layers.
* Runs up to 3 iterations to catch multi-layered encoding (e.g., base64 of URL-encoded).