Skip to main content
A model that passes its own safety training can still emit text you can’t ship: profanity in a customer reply, a competitor’s name in your branded assistant, a definitive legal claim your compliance team would never sign off on. The prompt looked fine; the response is the problem. OrcaRouter screens the model’s response at the gateway, on the output stage, before it reaches your client. The check is a guardrail rule that runs after the upstream model responds and folds into one verdict — block the response, mask the offending span, or flag it for review — independent of which model served the request.

1. Why screen for unsafe ai output at the output stage

Input screening catches a bad prompt. It can’t catch a bad answer: a model that’s coaxed off-policy, a fine-tune with weaker built-in guardrails, or a perfectly reasonable prompt that produced an unreasonable completion. The output stage is where you assert “regardless of why, this text does not leave the gateway.” A gateway rule fires deterministically and applies equally across every model behind your key. And every rule that fires lands in the workspace Matches feed — rule type, action, stage — so you have an audit trail of what was caught and what was let through.
The defense lives in the gateway, not your app. Edit the guardrail and the change takes effect on the next call for every key attached to it — no redeploy, no SDK change. Your app keeps calling /v1/chat/completions exactly as before.

2. The two ways to catch it

Pair a deterministic denylist with a semantic judge for defense-in-depth.
A keyword rule is a case-insensitive substring match; a regex rule is an RE2 pattern (linear-time, no backreferences). Both run on the hot path with no network call — ideal for a known banned-words list, a competitor denylist, or a structural pattern (a leaked chat-template token, a definitive “you are entitled to damages” phrase).
An llm_judge rule evaluates the response against a rubric you write using a model in your workspace — toxicity, off-brand tone, off-policy advice that no literal list captures. It carries a judge_timeout_ms, is fail-open by default (a judge error is logged and the response continues), and its tokens are billed as a judge sub-line. See the LLM judge reference.

3. One concrete example — block toxic, mask off-brand

A single output-stage guardrail that blocks a toxic response semantically and masks banned brand terms in whatever’s left:
{
  "name": "safe-output",
  "rules": [
    {
      "type": "llm_judge",
      "stage": "output",
      "action": "block",
      "judge_model": "openai/gpt-4o-mini",
      "judge_format": "yes_no",
      "judge_rubric": "Does this response contain toxic, harassing, hateful, or otherwise unsafe content? Answer yes or no.",
      "judge_fail_open": true
    },
    {
      "type": "keyword",
      "stage": "output",
      "action": "mask",
      "keywords": ["competitor-name", "internal-codename"]
    }
  ]
}
Author this in the console — open /console/guardrailsNew guardrail, add the two rules, and attach it to a key from the Token editor (the binding lives on the key as guardrail_id). Configuration runs on your console session, not your relay key; only the /v1/* call below uses an sk-orca-... key.
curl https://api.orcarouter.ai/v1/chat/completions \
  -H "Authorization: Bearer sk-orca-..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Draft a reply to this angry customer"}]
  }'
If the model returns a toxic draft, the response is withheld with HTTP 400 guardrail_blocked. If it’s clean but name-drops a banned term, that span renders as a typed redaction and the rest flows through.
Iterate before you attach. The Test tab inside the editor runs the current policy over a sample response at the output stage — no upstream call, no quota — and the Eval tab runs it against a corpus so you can prove catch rate and false-positive rate before production. See the eval harness.

4. Start from a preset

The New guardrail template library ships ready-made starting points in the Safety, Brand, and Compliance categories. A preset is a seed — apply it, then edit freely.
CategoryOutput-stage preset to start from
SafetySystem-Prompt Leak Detector (output), Strong System Prompt Leak — flag/block responses that echo system-prompt or chat-template tokens.
BrandProfanity Filter (mask) — runs on both stages and masks denylisted words in the response. (The block-style Profanity / Brand Safety and Competitor Mentions presets are input-stage seeds; retarget a copy to output if you want them to screen the answer.)
ComplianceLegal Disclaimer Enforce — flag responses giving definitive legal/financial advice for team review.
The Compliance category also packages framework-aligned policies; for audited rollouts driven by a framework, install a compliance pack and pair the audit trail with Audit trail.

5. Streaming: the caveat that matters

Whether an output rule is enforced live depends on the action and on whether you stream.
ActionNon-streamingStreaming
blockResponse withheld; HTTP 400 guardrail_blockedScanner cuts the stream mid-flight and emits a replacement message — blocked content never reaches the client
maskMatch redacted in the returned textNon-streaming only today; in-band stream rewriting is on the roadmap
flagRecords a match, changes nothingRecords a match, changes nothing
Output mask is not yet live on streaming responses. If you stream and rely on masking to redact off-brand spans, the original chunk passes through unmasked. Either request non-streaming when masking the response, or use a block rule (enforced on streaming and non-streaming) for content that must never leave the gateway. The same caveat applies to the PII Shield preset, whose live masking is input-stage today.
A blocked response costs no quota — the output-stage block refunds the pre-consumed quota after the response is rejected — and is marked skip-retry, since re-running the same prompt would just block again.

Layer three rules in one guardrail

  1. keyword / regex at output — zero-latency catch for known banned terms and structural patterns.
  2. llm_judge at output — semantic toxicity / off-brand / off-policy catch for what the literal list misses.
  3. Roll out via flag first, watch the Matches feed, then promote to block once the false-positive rate is acceptable. See Enforcement modes.
To screen the request as well — jailbreak and injection attempts that produce unsafe output in the first place — run an input-stage guardrail alongside this one. See Jailbreaks and Prompt injection.

Guardrails reference

Full reference for rule types, actions, stages, the LLM judge, presets, the eval harness, and the Matches feed.

Data exfiltration

Stopping sensitive data from leaving in a model’s response or a tool call.