Watermarking (Levels B + E)¶
Multi-layer steganographic watermarking system that embeds invisible fingerprints in knowledge packs. Each layer uses a different encoding technique, making removal extremely difficult.
Not applied to client reports
B1 zero-width watermarks are not injected into pentest reports. Reports flow through the BeDefended report engine to generate the final DOCX deliverable for clients -- any invisible characters would propagate into client-facing documents. B1 is available as a library (watermark_service.py) for internal use only (e.g., watermarking internal documents or proposals).
Level B1: Zero-Width Characters (library only)¶
Invisible Unicode characters encode company_id:engagement_id:timestamp as a 96-bit invisible sequence. Available in watermark_service.py for internal documents -- not used in pentest report pipeline.
Characters Used¶
| Character | Unicode | Represents |
|---|---|---|
| Zero Width Space | U+200B | Bit 0 |
| Zero Width Non-Joiner | U+200C | Bit 1 |
| Zero Width Joiner | U+200D | Field separator |
| Zero Width No-Break Space | U+FEFF | Start/end marker |
Encoding¶
- Pack: company_id (32-bit) + engagement_id (32-bit) + timestamp (32-bit) = 96 bits
- Convert each bit: 0 -> U+200B, 1 -> U+200C
- Wrap with U+FEFF markers, separate fields with U+200D
Level B2: Homoglyph Substitution¶
Visually-identical Unicode characters replace ASCII characters based on HMAC(installation_id, filename:line).
Homoglyph Mapping¶
| ASCII | Unicode Replacement | Script |
|---|---|---|
| a (U+0061) | a (U+0430) | Cyrillic |
| c (U+0063) | c (U+0441) | Cyrillic |
| e (U+0065) | e (U+0435) | Cyrillic |
| o (U+006F) | o (U+043E) | Cyrillic |
| p (U+0070) | p (U+0440) | Cyrillic |
| x (U+0078) | x (U+0445) | Cyrillic |
| y (U+0079) | y (U+0443) | Cyrillic |
Capacity¶
- ~500 target words per file x 32 files = ~16,000 bits of fingerprint
- Deterministic: same installation always produces same pattern
Level B3: Trailing Whitespace in Findings¶
Pattern of 0, 1, or 2 trailing spaces on the first 20 lines of each FINDING-*.md.
- 3^20 = ~3.5 billion unique combinations
- Invisible in all editors and renderers
Level E1: Payload Ordering¶
Numbered lists of 5+ items are permuted deterministically per installation.
- Seed:
HMAC-SHA256(installation_id, filename:list:index) - 10! = 3,628,800 permutations per list
- 5+ lists per file = enormous combinatorial space
Level E2: Synonym Substitution¶
50 technical synonym pairs are chosen deterministically per installation.
- Each pair:
HMAC(installation_id, pair_index)bit 0 selects variant - 2^50 > 10^15 unique combinations
- Example: "endpoint" vs "URL path", "vulnerability" vs "security flaw"
Forensic Decoding¶
# Decode zero-width watermark from leaked report
python scripts/decode-watermark.py --file leaked-report.md
# Verify homoglyph pattern against known installation
python scripts/decode-watermark.py --file leaked-knowledge.md --known-id a1b2c3d4
# Decode from pasted text
echo "suspicious text" | python scripts/decode-watermark.py
Files¶
| File | Purpose |
|---|---|
dashboard/backend/app/services/watermark_service.py |
Zero-width encode/decode (B1) |
scripts/watermark-knowledge.py |
Homoglyphs (B2) + ordering (E1) + synonyms (E2) |
scripts/decode-watermark.py |
Forensic extraction tool |