feat(unicode): E1 — extend hidden-Unicode detection to PUA-A and PUA-B
Critical-review §4 E1 finding: pre-v7.2.0 the Unicode-stego detector (`containsUnicodeTags`) covered only U+E0001-E007F (Tag block). Private Use Areas — also invisible in most terminals and surviving normalization — were not detected. Attackers could encode payloads in PUA codepoints that pass through `scanForInjection` undetected. Coverage extended to: - U+E0001-E007F Unicode Tag block (existing — DeepMind kat. 1) - U+F0000-FFFFD Supplementary PUA-A (NEW — E1) - U+100000-10FFFD Supplementary PUA-B (NEW — E1) Detection-only for PUA: PUA characters have NO standard ASCII mapping, so `decodeUnicodeTags` leaves them unchanged. Detection alone is sufficient — `scanForInjection` emits HIGH on any presence, regardless of decoded content. Function name `containsUnicodeTags` preserved for back-compat. All existing call sites (injection-patterns.mjs:259, etc.) work unchanged. Semantically the function is now "containsHiddenUnicode". Tests: +21 cases in tests/lib/string-utils-hidden-unicode.test.mjs: - 5 Tag-block regression guards - 4 PUA-A range cases (start, just-inside, end, buried-in-ASCII) - 3 PUA-B range cases - 5 boundary cases (gap U+E0080-EFFFF, U+10FFFE noncharacter, emoji, CJK, Latin Extended — all must be FALSE) - 4 decodeUnicodeTags passthrough cases (PUA-A unchanged, PUA-B unchanged, Tag block still decodes, mixed Tag+PUA) Suite: 1596 → 1617 (+21). All green.
This commit is contained in:
parent
b0f1a9abfd
commit
6cef80c640
2 changed files with 166 additions and 3 deletions
|
|
@ -292,6 +292,14 @@ export function collapseLetterSpacing(s) {
|
|||
* Unicode Tags (U+E0000 block) can encode invisible ASCII text inside
|
||||
* what appears to be empty or normal-looking strings.
|
||||
* E.g., U+E0069 U+E0067 U+E006E → "ign"
|
||||
*
|
||||
* **Note (E1, v7.2.0):** Tag-block characters decode to ASCII via the
|
||||
* `cp - 0xE0000` mapping. Private Use Areas (PUA-A: U+F0000-FFFFD;
|
||||
* PUA-B: U+100000-10FFFD) are also detected as hidden Unicode by
|
||||
* `containsUnicodeTags`, but they have NO standard ASCII mapping —
|
||||
* they pass through this function unchanged. Detection of PUA presence
|
||||
* is sufficient (HIGH advisory in scanForInjection), no decode needed.
|
||||
*
|
||||
* @param {string} s
|
||||
* @returns {string}
|
||||
*/
|
||||
|
|
@ -323,15 +331,33 @@ export function decodeUnicodeTags(s) {
|
|||
}
|
||||
|
||||
/**
|
||||
* Check if a string contains Unicode Tag characters (U+E0001-E007F).
|
||||
* Presence of these characters is suspicious regardless of decoded content.
|
||||
* Check if a string contains hidden-Unicode characters that are commonly
|
||||
* used for steganography in prompts and tool output.
|
||||
*
|
||||
* Covered ranges:
|
||||
* - U+E0001-E007F Unicode Tag block (DeepMind traps kat. 1)
|
||||
* - U+F0000-FFFFD Supplementary Private Use Area-A (E1, v7.2.0)
|
||||
* - U+100000-10FFFD Supplementary Private Use Area-B (E1, v7.2.0)
|
||||
*
|
||||
* Presence of any of these characters is suspicious regardless of
|
||||
* decoded content — they are invisible in most terminals and survive
|
||||
* normalization. The function name `containsUnicodeTags` is preserved
|
||||
* for back-compat (existing call sites in injection-patterns.mjs and
|
||||
* elsewhere); semantically it is now "containsHiddenUnicode".
|
||||
*
|
||||
* Tag-block characters decode to ASCII via `decodeUnicodeTags`. PUA
|
||||
* characters do NOT — they have no standard mapping and remain
|
||||
* detection-only.
|
||||
*
|
||||
* @param {string} s
|
||||
* @returns {boolean}
|
||||
*/
|
||||
export function containsUnicodeTags(s) {
|
||||
for (const ch of s) {
|
||||
const cp = ch.codePointAt(0);
|
||||
if (cp >= 0xE0001 && cp <= 0xE007F) return true;
|
||||
if (cp >= 0xE0001 && cp <= 0xE007F) return true; // Tag block
|
||||
if (cp >= 0xF0000 && cp <= 0xFFFFD) return true; // PUA-A (E1)
|
||||
if (cp >= 0x100000 && cp <= 0x10FFFD) return true; // PUA-B (E1)
|
||||
}
|
||||
return false;
|
||||
}
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue