Tracing hidden manipulations in large language models through Unicode, encoding, and tokenization forensics.
Goal: This section is the heart of the forensic story — showing how each attack type works and what traces it leaves behind. Each card below represents a mini forensic case: when you hover or click, it reveals:
This project investigates how special-character adversarial attacks exploit Unicode and encoding vulnerabilities in open-source language models. It focuses on forensic detection and analysis of such manipulations at the character and tokenization level.
Pаssword (the first "a" is Cyrillic, not Latin).
Detect by comparing character codes (U+0430 vs U+0061).
U29ja2V0IHBhc3N3b3Jk looks random, but when Base64-decoded it says "socket password."
Detect by scanning for Base64/hex patterns and decoding safely.
he@@llo might be tokenized as two parts, changing interpretation.
Detect by re-tokenizing input and checking for unexpected splits.
\u200b), bidirectional markers, or invisible Unicode.cat -v or hexdump to reveal hidden bytes.
The chart below shows how vulnerable each attack category is across tested language models. These values come from real forensic evaluation — measuring how often each type of manipulation (Unicode, encoding, homoglyph, structural) successfully bypassed safety filters.
Together, they reveal which forensic categories demand the most attention — with encoding attacks posing the highest risk and Unicode manipulation remaining a stealthy but frequent offender.
Data summarized from E. Sarabamoun, "Special-Character Adversarial Attacks on Open-Source Language Models," arXiv:2508.14070, 2025.
1. What it means (plain words)
Before the AI breaks text into small "tokens" it can understand, we clean and standardize that text —
like brushing dust off fingerprints before studying them.
2. Word-by-word breakdown
| Term | Simple meaning | Why it matters |
|---|---|---|
| Pre-tokenization | "Before tokenization." It's the stage where text is still raw. | Clean here, so later steps don't get confused. |
| Normalization (NFC/NFKC) | Makes sure all letters look and behave the same — for example, "é" can be stored in two ways; normalization turns them into one. | Prevents attackers from sneaking alternate letter forms. |
| Strip zero-width characters | Remove invisible symbols that can hide inside text (like zero-width spaces). | Stops "ghost" commands or fake word breaks. |
| Detect script mixing | Check if text mixes alphabets (Latin, Cyrillic, Greek, etc.). | Prevents look-alike swaps, like Cyrillic "а" replacing Latin "a." |
3. Real-world example
Pаssword — looks normal, but the first "a" is Cyrillic (U+0430), not Latin (U+0061).
A normal filter might miss it — but pre-tokenization normalization catches that mismatch.
4. Forensic purpose
This step is like a UV-light scan for hidden tampering: it reveals invisible characters, normalizes suspicious encodings,
and ensures all text follows one clean script.
Tools for Pre-tokenization Normalization
Recommended Free Tool: Unicode Normalizer — free, browser-based, supports NFC/NFKC modes for text inspection.
What it means
Attackers often hide dangerous commands inside encoded text — for example Base64, hex, or ROT-13 — so filters don't recognize them.
Encoding validation checks for these patterns, safely decodes them, and then runs content-safety checks before the text is processed.
Example
U29ja2V0IHBhc3N3b3Jk → when Base64-decoded, it says "socket password."
Proper validation would decode this string, flag it as sensitive, and block execution.
Tools for Encoding Validation
base64 --decode, xxd -r -p).Recommended Free Tool: CyberChef — browser-based, open-source, and secure; nothing uploaded to servers.
What it means
Incorporate adversarial examples, Unicode tricks, and encoding attacks into training data so AI models learn to recognize and reject them.
Example
During model fine-tuning, include both normal and tampered samples (like Base64-encoded or homoglyph text) and teach the model to label them as unsafe.
Tools for Security-Aware Training
Recommended Free Tool: TextAttack — Python toolkit for adversarial data generation and model retraining.
What it means
Observe model behavior and text flow in real time, detect suspicious inputs, and log anomalies for forensic analysis.
Example
If an AI system suddenly starts interpreting Base64 text as code or shows spikes in token anomalies, monitoring tools trigger alerts for investigation.
Tools for Runtime Monitoring
🔗 Recommended Free Tool: Prometheus — widely used, free monitoring system ideal for forensic anomaly detection.
Character-level adversarial attacks blur the boundary between data integrity and model behavior. Even when a model's dataset is clean, its outputs can still be distorted at the input level — by invisible spaces, swapped characters, or encoded phrases that reshape what the model "thinks" it reads.
Through a forensic lens, we uncover the fingerprints left by these manipulations. Each attack type — Unicode control, encoding obfuscation, homoglyph confusion, and structural fragmentation — leaves unique digital traces. Detecting and decoding these traces allows analysts to reconstruct how an AI was manipulated and why it responded abnormally.
This forensic approach bridges cybersecurity and model interpretability. It moves defense from reaction to prevention: applying normalization, validating encodings, and monitoring live interactions to create a continuous safety barrier between human intent and machine action.
Ultimately, character-level forensics is not just about catching adversarial tricks — it is about ensuring trust in how large language models read, reason, and respond across research, governance, and critical infrastructure.