1. Why screen for unsafe ai output at the output stage
Input screening catches a bad prompt. It can’t catch a bad answer: a model that’s coaxed off-policy, a fine-tune with weaker built-in guardrails, or a perfectly reasonable prompt that produced an unreasonable completion. The output stage is where you assert “regardless of why, this text does not leave the gateway.” A gateway rule fires deterministically and applies equally across every model behind your key. And every rule that fires lands in the workspace Matches feed — rule type, action, stage — so you have an audit trail of what was caught and what was let through.The defense lives in the gateway, not your app. Edit the guardrail
and the change takes effect on the next call for every key attached to
it — no redeploy, no SDK change. Your app keeps calling
/v1/chat/completions exactly as before.2. The two ways to catch it
Pair a deterministic denylist with a semantic judge for defense-in-depth.Literal — keyword / regex (zero latency)
Literal — keyword / regex (zero latency)
A
keyword rule is a case-insensitive substring match; a regex
rule is an RE2 pattern (linear-time, no backreferences). Both run on
the hot path with no network call — ideal for a known banned-words
list, a competitor denylist, or a structural pattern (a leaked
chat-template token, a definitive “you are entitled to damages”
phrase).Semantic — llm_judge (catches what no regex can)
Semantic — llm_judge (catches what no regex can)
An
llm_judge rule evaluates the response against a rubric you write
using a model in your workspace — toxicity, off-brand tone,
off-policy advice that no literal list captures. It carries a
judge_timeout_ms, is fail-open by default (a judge error is
logged and the response continues), and its tokens are billed as a
judge sub-line. See the
LLM judge reference.3. One concrete example — block toxic, mask off-brand
A single output-stage guardrail that blocks a toxic response semantically and masks banned brand terms in whatever’s left:/console/guardrails → New
guardrail, add the two rules, and attach it to a key from the Token
editor (the binding lives on the key as guardrail_id). Configuration
runs on your console session, not your relay key; only the /v1/* call
below uses an sk-orca-... key.
guardrail_blocked. If it’s clean but name-drops a banned term, that span
renders as a typed redaction and the rest flows through.
4. Start from a preset
The New guardrail template library ships ready-made starting points in the Safety, Brand, and Compliance categories. A preset is a seed — apply it, then edit freely.| Category | Output-stage preset to start from |
|---|---|
| Safety | System-Prompt Leak Detector (output), Strong System Prompt Leak — flag/block responses that echo system-prompt or chat-template tokens. |
| Brand | Profanity Filter (mask) — runs on both stages and masks denylisted words in the response. (The block-style Profanity / Brand Safety and Competitor Mentions presets are input-stage seeds; retarget a copy to output if you want them to screen the answer.) |
| Compliance | Legal Disclaimer Enforce — flag responses giving definitive legal/financial advice for team review. |
5. Streaming: the caveat that matters
Whether an output rule is enforced live depends on the action and on whether you stream.| Action | Non-streaming | Streaming |
|---|---|---|
block | Response withheld; HTTP 400 guardrail_blocked | Scanner cuts the stream mid-flight and emits a replacement message — blocked content never reaches the client |
mask | Match redacted in the returned text | Non-streaming only today; in-band stream rewriting is on the roadmap |
flag | Records a match, changes nothing | Records a match, changes nothing |
6. Recommended policy shape
Layer three rules in one guardrail
-
keyword/regexatoutput— zero-latency catch for known banned terms and structural patterns. -
llm_judgeatoutput— semantic toxicity / off-brand / off-policy catch for what the literal list misses. -
Roll out via
flagfirst, watch the Matches feed, then promote toblockonce the false-positive rate is acceptable. See Enforcement modes.
Guardrails reference
Full reference for rule types, actions, stages, the LLM judge, presets,
the eval harness, and the Matches feed.
Data exfiltration
Stopping sensitive data from leaving in a model’s response or a tool call.
