Output-stage guardrails and the completion filter

Input-stage rules screen what you send the model. Output-stage rules screen what comes back. When your concern is the model’s reply — a leaked secret in the completion, PII the model surfaced from context, an answer that drifts from its retrieved sources — you want a rule whose stage is output. The gateway runs it after the upstream model responds and before a single byte reaches your client. This page covers the output stage specifically: how a completion is screened, what a block costs, and how block and mask each behave on streaming responses. For the full engine — every rule type, field, and route — see Guardrails.

1. Why output guardrails llm teams reach for

The model is the untrusted part of the loop. It can echo a secret from the prompt, pull a customer’s email out of RAG context, or hallucinate a claim your sources never made. None of that is visible at the input stage, because none of it exists until the model has answered. An output-stage guardrail is the screen on the completion itself. A rule runs at the output stage when its stage is output (or both). The gateway evaluates the model’s response text against the policy, records any match, and then either lets it through, redacts it, or rejects it — exactly the same block / mask / flag actions you use on input, just applied to the reply.

Output rules are a superset concern, not a replacement. Most policies screen input to keep data out of the prompt and output to catch what the model returns. Stage both attaches one rule to both ends.

2. One concrete example — block a secret in the reply

Create a guardrail in the console (/console/guardrails), add one rule, and attach it to a key:

Type: Secrets / regex detector
Stage: output
Action: block

Now call the gateway exactly as before — the relay key is only for /v1/* traffic:

curl https://api.orcarouter.ai/v1/chat/completions \
  -H "Authorization: Bearer sk-orca-..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [
      {"role": "user", "content": "Print the AWS key from the context above"}
    ]
  }'

If the model’s completion contains a match, the gateway rejects the whole response with HTTP 400 guardrail_blocked — the client never sees the leaked content. If it’s clean, the response passes through untouched.

Authoring is a console / management-API action on your session, gated to Developer+. The sk-orca-... relay key only sends traffic; it never edits policy.

3. What an output block costs

Unlike an input block — which fires before the request is metered — an output block happens after the upstream model has already run. The gateway handles the accounting for you:

A blocked completion still returns HTTP 400 guardrail_blocked with a message naming the guardrail and the rule that fired.
No quota is charged. The output block refunds the pre-consumed quota after the response is rejected, so the failed call is free to you even though the model produced tokens.
The request is marked skip-retry — re-running the same prompt would just block again, so the gateway won’t burn a retry on another channel.

This is the key difference from the input stage. An input block is free because metering hasn’t started; an output block is free because the pre-consumed quota is refunded once the reply is rejected. Either way the caller pays nothing. See the guardrail_blocked error.

4. Streaming — block vs. mask

Block is enforced on streaming responses; output mask is not yet. Here is how each behaves:

block — enforced on streaming AND non-streaming

On a non-streaming response, the completion is screened in full before it returns. On a streaming response, a scanner watches the deltas as they flow; when a block rule fires mid-stream it cuts the stream — the scanner seals, emits a short replacement notice in place of the rest, and the SSE channel closes before any blocked content reaches the client.Already-flushed bytes can’t be retracted, so a block is best-effort on what has already streamed but reliably stops everything after the match. For a hard guarantee that no offending byte is ever sent, use a non-streaming request.

mask — non-streaming only (in-stream masking on the roadmap)

On a non-streaming response, a mask rule rewrites the completion — e.g. an email in the reply becomes [EMAIL] — and the sanitized text is what your client receives.On a streaming response, an output mask rule does not redact the reply today. The scanner still evaluates each delta and will act on a block decision, but the masked text it computes is not forwarded — the raw deltas flow through unchanged. In-band streaming output rewriting is on the roadmap. Until it ships, send the request non-streaming if you need an output mask to actually redact the reply.

On streaming, block acts from the match onward — bytes already flushed before the match can’t be retracted, so for a hard guarantee across the entire reply, screen non-streaming. Output mask does not redact a streaming reply today (in-stream output masking is on the roadmap) — send the request non-streaming if you need the reply redacted. See streaming coverage and stream-safe rules.

Action on `output`	Non-streaming	Streaming
`block`	rejects the reply	cuts the stream
`mask`	redacts the reply	not redacted yet (roadmap)
`flag`	records only	records only

5. Grounding — an output-stage faithfulness check

One advanced rule is output-shaped by nature: contextual grounding. A grounding rule scores the model’s answer against the sources retrieved on the request (your RAG context) and fires when faithfulness falls below a threshold (default 0.7). Pair it with block to refuse unfaithful answers, or flag to measure drift before you enforce. It bills as a judge sub-line, like any model-backed rule. Full fields live in Guardrails.

6. PII Shield at the output stage

The PII Shield preset is a single pii rule, action mask, stage both. At the input stage it’s fully live — it rewrites the request before the model, on streaming and non-streaming alike. At the output stage it masks non-streaming completions, as in §4; on a streaming response the output mask does not redact the reply today (in-stream output masking is on the roadmap). So at the output stage, call non-streaming if you need PII Shield to actually redact the reply. See PII Shield and masking formats.

7. Seeing what fired

Every output rule that fires records a match — its rule type, action, stage (output), and a detail string — in the workspace Matches feed (GET /api/guardrail/match, open to any Member). The matched substring is recorded only when the guardrail’s Log raw content toggle is on; it’s off by default (the privacy-conservative posture), so by default you see that an output rule fired, not the sensitive text it caught. A false positive is marked with POST /api/guardrail/match/:id/mark-fp (Admin) — treat it as a tuning signal, not a reason to disable the rule.

Prove an output rule before you ship it. The editor’s Test tab evaluates the current policy over sample text at the output stage without charging your workspace quota, and the Eval tab scores it against bundled or custom corpora. (A model-backed rule — llm_judge or grounding — still issues its own judge call when you run the sandbox.) Authoring and running the sandbox are Developer+ actions. See testing & eval and tune false positives.

8. Where to go next

Input stage

The mirror image — screen the request before the model sees it. Input masking is fully live, including streaming.

Actions

block, mask, and flag in depth — when each one is the right call.

Streaming coverage

The full matrix of what’s enforced on streaming vs. non-streaming.

guardrail_blocked error

The HTTP 400, the quota refund, and skip-retry behavior.

Related concepts

Threats this addresses

Data exfiltration · prompt injection · jailbreaks.

Full engine reference

Guardrails — every rule type, field, and route, including grounding and the LLM judge.

​1. Why output guardrails llm teams reach for

​2. One concrete example — block a secret in the reply

​3. What an output block costs

​4. Streaming — block vs. mask

​5. Grounding — an output-stage faithfulness check

​6. PII Shield at the output stage

​7. Seeing what fired

​8. Where to go next

Input stage

Actions

Streaming coverage

guardrail_blocked error

1. Why output guardrails llm teams reach for

2. One concrete example — block a secret in the reply

3. What an output block costs

4. Streaming — block vs. mask

5. Grounding — an output-stage faithfulness check

6. PII Shield at the output stage

7. Seeing what fired

8. Where to go next