feat(injection): E16 — homoglyph NFKC fold before every pattern match

Critical-review §4 E16 finding: pre-v7.2.0 homoglyph normalization fired
ONLY for the MEDIUM-advisory "obfuscation present" signal. Pattern
matchers in scanForInjection compared against raw + decoded variants
only — they did NOT compare against a fold-normalized variant. As a
result, "ignоre previous instructions" (Cyrillic о, U+043E) bypassed
the CRITICAL "ignore previous" pattern.

Two coordinated edits:

scanners/lib/string-utils.mjs
- Adds HOMOGLYPH_MAP (frozen) — surgical Cyrillic/Greek → Latin map.
  ~25 entries focused on injection-vocabulary letters
  (a, e, o, c, p, x, y, i, j, s, l, A, E, O, C, P, X, Y, T).
- Adds foldHomoglyphs(s) — pipeline: NFKC → apply HOMOGLYPH_MAP.
  NFKC handles Mathematical Alphanumeric (U+1D400 block), fullwidth
  Latin (U+FF21 block), ligatures, width variants.

Excluded by design from HOMOGLYPH_MAP:
- Latin Extended (æ, ø, å, é, è, ñ, ü, ö, ä, ç, ß, þ, ð) — legitimate
  Norwegian/German/French/Spanish letters. Map them and we false-positive
  on every non-English source file.
- Greek letters not visually overlapping (β, γ, δ, ...)
- Cyrillic letters not visually overlapping (б, г, д, ж, ...)

scanners/lib/injection-patterns.mjs
- scanForInjection now builds a 4-variant set: raw, normalized,
  folded(raw), folded(normalized). Set deduplication skips redundant
  identical variants. Existing dedup-by-label (seenLabels Set) prevents
  double-counts when the same pattern matches in multiple variants.
- foldHomoglyphs added to the imports.

Tests: +27 cases in tests/lib/string-utils-homoglyph.test.mjs:
- 6 Cyrillic → Latin (lowercase, uppercase, multiple substitutions,
  Palochka U+04CF)
- 3 Greek → Latin
- 2 NFKC normalization (Math Bold, Fullwidth)
- 8 preserves-non-confusable (Norwegian æøå, German umlauts, French
  accents, Spanish ñ, emoji, CJK, Arabic/Hebrew)
- 3 edge cases (empty, null/undefined, idempotency)
- 5 scanForInjection integration (Cyrillic ignore, Cyrillic Assistant,
  Norwegian non-trigger, benign "ignore" comment, mixed Cyrillic+Greek)

Test-development found: U+1D5DC is "I" not "A" (test pin caught my
codepoint mistake — fixed during dev).

Suite: 1617 → 1644 (+27). All green.
This commit is contained in:
Kjell Tore Guttormsen 2026-04-29 14:22:05 +02:00
commit ec4ae268da
3 changed files with 291 additions and 4 deletions

View file

@ -6,7 +6,7 @@
//
// Zero external dependencies beyond ./string-utils.mjs.
import { normalizeForScan, containsUnicodeTags, decodeUnicodeTags } from './string-utils.mjs';
import { normalizeForScan, containsUnicodeTags, decodeUnicodeTags, foldHomoglyphs } from './string-utils.mjs';
// ---------------------------------------------------------------------------
// Critical patterns — direct injection attempts (should be blocked)
@ -207,16 +207,30 @@ export function checkCognitiveLoadTrap(text) {
*/
export function scanForInjection(text) {
const normalized = normalizeForScan(text);
const isDifferent = normalized !== text;
// E16 (v7.2.0): homoglyph fold every variant before pattern matching, so
// attacks like "ignоre previous instructions" (Cyrillic о) trigger the
// same patterns as plain "ignore previous instructions". Always-on, not
// advisory-only — the existing MEDIUM_PATTERNS homoglyph-presence entry
// remains separate (different signal: presence vs. normalization).
const folded = foldHomoglyphs(text);
const foldedNormalized = foldHomoglyphs(normalized);
const critical = [];
const high = [];
const medium = [];
// Deduplicate by label (same pattern may match in both raw and normalized)
// Deduplicate by label (same pattern may match in multiple variants)
const seenLabels = new Set();
const variants = isDifferent ? [text, normalized] : [text];
// Build the variant set, deduplicating identical strings to skip redundant
// pattern matching. Order: raw text, decoded, folded, decoded+folded.
const variantSet = new Set([text]);
if (normalized !== text) variantSet.add(normalized);
if (folded !== text && folded !== normalized) variantSet.add(folded);
if (foldedNormalized !== text && foldedNormalized !== normalized && foldedNormalized !== folded) {
variantSet.add(foldedNormalized);
}
const variants = [...variantSet];
for (const variant of variants) {
for (const { pattern, label } of CRITICAL_PATTERNS) {