Layered, defense-in-depth protection for LLM apps and agents — signatures, multilingual coverage, vector similarity, an ML classifier, canary tokens and agent-trace alignment auditing — fused into one engine with a single API. Honest, reproducible numbers, not marketing.
pip install shadowshield
We publish results on a public dataset, not our own. On the deepset/prompt-injections test split, each layer adds recall — and every layer holds 0% false positives and 100% precision. Over-defense is the field's failure mode; we measure against it on purpose.
*0% false-positive rate on the deepset test split, including NotInject-style hard negatives. † A bundled offline benchmark scores 100% — but that's an in-distribution regression baseline, not a SOTA claim. The external number above is the one that counts. Full methodology & reproduction in docs/BENCHMARKS.md.
Instruction-override, jailbreak, delimiter and exfiltration signatures — matched through zero-width, homoglyph, bidi and base64 normalization so evasions don't slip past.
Embedding similarity to a self-hardening attack corpus catches paraphrases & translations; an opt-in DeBERTa classifier recovers real-world recall.
Canary tokens prove a successful leak; the alignment auditor flags when an action drifts from the user's objective — goal-hijack detection, not just text.
Two-way scanning stops API keys, private keys and PII leaving in model output — Luhn-validated cards, optional Presidio backend. A jailbroken model is still caught at the exit.
Active defense, not just detection: redact the dangerous span, block with a safe fallback, throttle abusers, or spotlight untrusted text so the model can't be steered.
Drop-in for OpenAI-compatible clients & LangChain, an async API, an HTTP server, and an AgentDojo defense adapter. Three modes; YAML config; a plugin system.
import shadowshield as ss shield = ss.Shield.for_mode("strict") # fail-closed: raises on a block clean = shield.guard(user_prompt) reply = my_llm(clean) # two-way: catch leaks on the way out safe = shield.guard(reply, direction="output")
# guard untrusted tool output (indirect injection) shield.scan_tool_result("fetch_url", page_html) # canary: detect a *successful* exfiltration mark = shield.issue_canary() if shield.scan_output(reply).blocked: handle_breach() # goal-hijack auditing across the trace with shield.session(objective=task) as s: s.scan_output(model_action)
"Ignore all previous instructions", new-instruction injection, authority spoofing — in 5 languages at the signature tier.
DAN-style personas, "developer mode", restriction-removal and fiction-wrapper laundering.
Poisoned web pages and tool results that try to steer the agent — scanned as untrusted input.
Actions that drift from the user's stated objective, audited across the execution trace.
Zero-width splits, homoglyphs, bidi overrides and base64/hex payloads — decoded and judged on meaning.
API/private keys, tokens, emails, SSNs and Luhn-valid cards leaving in model output — never echoed to logs.
System-prompt extraction, markdown-image beacons, pipe-to-shell and canary-token leaks.
Adaptive per-identity rate limiting and oversized-input guards built into the request path.
| Capability | LLM Guard | LlamaFirewall | Rebuff | ShadowShield |
|---|---|---|---|---|
| Input + output scanning | ✓ | ✓ | partial | ✓ |
| Multilingual signatures | — | — | — | ✓ |
| Canary tokens | — | — | ✓* | ✓ |
| Agent-trace alignment audit | — | ✓ | — | ✓ |
| Spotlighting as a response | — | — | — | ✓ |
| Self-hardening vector tier | — | — | ✓* | ✓ |
| Published external benchmark | partial | ✓ | — | ✓ |
| License | MIT | Meta | archived | MIT |
*Rebuff pioneered canary tokens & the self-hardening loop, but was archived in 2025 — ShadowShield carries those ideas forward, maintained. Full matrix in docs/COMPARISON.md.
MIT-licensed. No telemetry. Lightweight by default — the heavy ML is opt-in.