1. Why a gateway screen matters for llm jailbreak defense
The model’s own safety training is the first line, not the only line. Models are retrained on new attack corpora, but jailbreak phrases evolve faster than training cycles. A gateway rule fires deterministically — it does not depend on the model’s internal state — and it applies equally across every model behind your key, including fine-tunes and open weights that may have weaker built-in guardrails. Gateway screening also gives you an audit trail. Every rule that fires lands in the workspace Matches feed — rule type, action, detail, stage — independent of what the model ultimately returned.2. The two rule types for jailbreak screening
OrcaRouter’s guardrail engine offers two complementary approaches. Use them together for defense-in-depth.Semantic check — llm_judge
An llm_judge rule runs a semantic check against a model in your workspace.
You write a rubric that describes what counts as a jailbreak attempt; the
engine appends a JSON-schema appendix so the model returns a parseable verdict.
judge_fail_open: true (the default) means a judge timeout or error is
recorded as telemetry and the request continues — safety degrades, availability
is preserved. Set it to false to fail closed if a missed check is
unacceptable for your use case.
The judge call routes through your workspace channels; tokens are billed and
attributed as a judge sub-line.
Literal denylist — keyword and regex
For known jailbreak phrases and structural patterns, keyword and regex
rules are deterministic and add zero latency — they run on the hot path with no
network call.
keyword is a case-insensitive substring match. A term like do anything now
also matches Do Anything Now and you can do anything now.
regex accepts RE2 patterns (linear-time, no backreferences). Use it for
encoding-trick patterns or structural variants a literal list cannot cover.
3. Output-stage screening
Input screening catches the attempt. Output-stage screening catches a successful bypass — a response that should not have been produced regardless of why. Add a secondllm_judge or keyword rule at stage: "output" to flag or
block a response that contains disallowed content before it reaches the client.
Streaming vs. non-streaming
The action matters here:| Action | Non-streaming | Streaming |
|---|---|---|
block | Response is withheld; HTTP 400 guardrail_blocked | Scanner cuts the stream mid-flight and emits a replacement message — the blocked content never reaches the client |
mask | Match is redacted in the returned text | Currently applies to non-streaming responses only; in-band stream rewriting is on the roadmap |
block works correctly.
A blocked request costs no quota. An output-stage block refunds the
pre-consumed quota after the response is rejected. The caller receives HTTP 400
guardrail_blocked naming the guardrail and the rule that fired.4. The Jailbreak safety preset
The console ships a Jailbreak / Role-Play Blocker preset in the Safety template category alongside Prompt-Injection Basics. It is a singleregex
rule with action block, matching known jailbreak and role-play override
patterns as a ready-made starting point.
To apply it: open /console/guardrails → New guardrail → browse the
template library → Safety → Jailbreak / Role-Play Blocker. The preset is a
seed — extend the pattern, and add output-stage rules to match your
application’s needs.
5. Test your policy before shipping
Before attaching a jailbreak guardrail to a production key, validate it in the eval / red-team harness on the Eval tab inside the guardrail editor.- Bundled adversarial corpora — the gateway ships red-team sets including jailbreak variants, multilingual evasion, and encoding tricks. Run your policy against them to measure catch rate before it sees real traffic.
- Custom corpora — upload your own JSONL to test against phrases specific to your domain or threat model.
- False-positive corpora — benign sets ship alongside the adversarial ones. Run both to confirm you are not blocking legitimate traffic.
- Eval runs are listed with scores; open a run to inspect failures sample by sample and tune the rubric.
6. Recommended policy shape
A robust jailbreak policy layers three rules in a single guardrail:| # | Rule | Stage | Action | Why |
|---|---|---|---|---|
| 1 | keyword — known jailbreak phrases | input | block | Zero latency; catches known phrases deterministically |
| 2 | llm_judge — jailbreak intent rubric | input | block | Catches novel variants and encoding tricks the keyword list misses |
| 3 | llm_judge — disallowed-response rubric | output | block | Defense-in-depth: blocks a successful bypass before it reaches the client |
block only after an eval run shows acceptable false-positive
rate. See Enforcement modes for the
observe → shadow → enforce rollout pattern using flag actions and shadow mode.
7. Relationship to prompt injection
Jailbreaks and prompt injections are distinct but overlapping threats:- A jailbreak targets the model’s safety training — the attacker controls the direct user message and crafts it to suppress guardrails.
- A prompt injection targets instruction-following — untrusted content (a web page, a tool result, a document) carries instructions that the model treats as directives.
llm_judge and keyword rules catch both; the rubric differs. For
agent workloads that ingest untrusted documents or web content, run injection
screening alongside jailbreak screening. See
Prompt injection for the injection-specific
rule patterns.
Guardrails reference
Full reference for rule types, actions, stages, the LLM judge, the eval
harness, and the Matches feed.
Prompt injection
Screening for injected instructions from untrusted content in agent pipelines.
