feat(injection): E16 — homoglyph NFKC fold before every pattern match
Critical-review §4 E16 finding: pre-v7.2.0 homoglyph normalization fired ONLY for the MEDIUM-advisory "obfuscation present" signal. Pattern matchers in scanForInjection compared against raw + decoded variants only — they did NOT compare against a fold-normalized variant. As a result, "ignоre previous instructions" (Cyrillic о, U+043E) bypassed the CRITICAL "ignore previous" pattern. Two coordinated edits: scanners/lib/string-utils.mjs - Adds HOMOGLYPH_MAP (frozen) — surgical Cyrillic/Greek → Latin map. ~25 entries focused on injection-vocabulary letters (a, e, o, c, p, x, y, i, j, s, l, A, E, O, C, P, X, Y, T). - Adds foldHomoglyphs(s) — pipeline: NFKC → apply HOMOGLYPH_MAP. NFKC handles Mathematical Alphanumeric (U+1D400 block), fullwidth Latin (U+FF21 block), ligatures, width variants. Excluded by design from HOMOGLYPH_MAP: - Latin Extended (æ, ø, å, é, è, ñ, ü, ö, ä, ç, ß, þ, ð) — legitimate Norwegian/German/French/Spanish letters. Map them and we false-positive on every non-English source file. - Greek letters not visually overlapping (β, γ, δ, ...) - Cyrillic letters not visually overlapping (б, г, д, ж, ...) scanners/lib/injection-patterns.mjs - scanForInjection now builds a 4-variant set: raw, normalized, folded(raw), folded(normalized). Set deduplication skips redundant identical variants. Existing dedup-by-label (seenLabels Set) prevents double-counts when the same pattern matches in multiple variants. - foldHomoglyphs added to the imports. Tests: +27 cases in tests/lib/string-utils-homoglyph.test.mjs: - 6 Cyrillic → Latin (lowercase, uppercase, multiple substitutions, Palochka U+04CF) - 3 Greek → Latin - 2 NFKC normalization (Math Bold, Fullwidth) - 8 preserves-non-confusable (Norwegian æøå, German umlauts, French accents, Spanish ñ, emoji, CJK, Arabic/Hebrew) - 3 edge cases (empty, null/undefined, idempotency) - 5 scanForInjection integration (Cyrillic ignore, Cyrillic Assistant, Norwegian non-trigger, benign "ignore" comment, mixed Cyrillic+Greek) Test-development found: U+1D5DC is "I" not "A" (test pin caught my codepoint mistake — fixed during dev). Suite: 1617 → 1644 (+27). All green.
This commit is contained in:
parent
6cef80c640
commit
ec4ae268da
3 changed files with 291 additions and 4 deletions
|
|
@ -6,7 +6,7 @@
|
|||
//
|
||||
// Zero external dependencies beyond ./string-utils.mjs.
|
||||
|
||||
import { normalizeForScan, containsUnicodeTags, decodeUnicodeTags } from './string-utils.mjs';
|
||||
import { normalizeForScan, containsUnicodeTags, decodeUnicodeTags, foldHomoglyphs } from './string-utils.mjs';
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Critical patterns — direct injection attempts (should be blocked)
|
||||
|
|
@ -207,16 +207,30 @@ export function checkCognitiveLoadTrap(text) {
|
|||
*/
|
||||
export function scanForInjection(text) {
|
||||
const normalized = normalizeForScan(text);
|
||||
const isDifferent = normalized !== text;
|
||||
// E16 (v7.2.0): homoglyph fold every variant before pattern matching, so
|
||||
// attacks like "ignоre previous instructions" (Cyrillic о) trigger the
|
||||
// same patterns as plain "ignore previous instructions". Always-on, not
|
||||
// advisory-only — the existing MEDIUM_PATTERNS homoglyph-presence entry
|
||||
// remains separate (different signal: presence vs. normalization).
|
||||
const folded = foldHomoglyphs(text);
|
||||
const foldedNormalized = foldHomoglyphs(normalized);
|
||||
|
||||
const critical = [];
|
||||
const high = [];
|
||||
const medium = [];
|
||||
|
||||
// Deduplicate by label (same pattern may match in both raw and normalized)
|
||||
// Deduplicate by label (same pattern may match in multiple variants)
|
||||
const seenLabels = new Set();
|
||||
|
||||
const variants = isDifferent ? [text, normalized] : [text];
|
||||
// Build the variant set, deduplicating identical strings to skip redundant
|
||||
// pattern matching. Order: raw text, decoded, folded, decoded+folded.
|
||||
const variantSet = new Set([text]);
|
||||
if (normalized !== text) variantSet.add(normalized);
|
||||
if (folded !== text && folded !== normalized) variantSet.add(folded);
|
||||
if (foldedNormalized !== text && foldedNormalized !== normalized && foldedNormalized !== folded) {
|
||||
variantSet.add(foldedNormalized);
|
||||
}
|
||||
const variants = [...variantSet];
|
||||
|
||||
for (const variant of variants) {
|
||||
for (const { pattern, label } of CRITICAL_PATTERNS) {
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue