Guardrails overview - OrcaRouter

A guardrail is the content-policy layer of the OrcaRouter gateway. You author one named policy in your workspace, attach it to an API key, and every /v1/* call that key makes is screened — before the model sees the prompt and after the model answers — with no redeploy and no SDK change. This page is the hub for the Guardrails section: what a guardrail is, the rule types, the stages and actions, and how a policy attaches to a key. Each spoke goes deeper. For the full engine reference, see Guardrails.

1. What ai guardrails do on the gateway

Most teams reach for guardrails to keep sensitive data out of prompts (PII, secrets), to gate unsafe content (jailbreaks, prompt-injection intent), or to satisfy a compliance control. A guardrail is the gateway’s answer: a workspace-scoped, named policy — an ordered list of rules the gateway runs against request input and model output. Because the binding lives on the API key in the gateway — not in your application — editing a guardrail shifts every attached key on the next call. Your code keeps calling /v1/chat/completions exactly as before.

Guardrails are content policy (text in, text out). The companion Agent Firewall is tool policy — it governs which tool calls an agent may make. The two compose; see Guardrails vs. firewall.

2. One concrete example

Create a guardrail named pii-shield in the console (/console/guardrails), add a single PII rule — stage input, action mask, entities email, ssn — and attach it to a key. From then on:

curl https://api.orcarouter.ai/v1/chat/completions \
  -H "Authorization: Bearer sk-orca-..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [
      {"role": "user", "content": "Reply to jane@acme.com please"}
    ]
  }'

The gateway rewrites the prompt to Reply to [EMAIL] please before forwarding — the upstream model never sees the address. Flip that ssn entity to block and the next request carrying an SSN is rejected with HTTP 400. No application change.

Authoring is a console / management-API action on your session — the sk-orca-... relay key is only for /v1/* traffic, never for editing policy. Creating or editing a guardrail requires the Developer+ role.

3. Rules: type, stage, action

Every rule answers three questions. The engine runs all applicable rules and folds them into one decision.

Type — what to look for

Seven rule types. The built-ins are deterministic (pure string/regex, no network); the advanced ones call out to a model or vendor and run concurrently.

keyword — literal denylist, case-insensitive substring match.
regex — an RE2 pattern (linear-time, no backreferences).
pii — built-in entity detectors plus your own. See §5.
max_chars — caps the character count at a stage.
external — delegates to a connected vendor (Aporia, Averta, or your own webhook).
llm_judge — a semantic check against a model in your workspace.
grounding — scores answer faithfulness against the request’s retrieved sources (RAG).

Stage — where to look

input (the request), output (the model’s response), or both. Input rules run before the upstream call; output rules run after the model responds. See input stage and output stage.

Action — what to do

Five actions surface in the rule builder:

block — reject the call with HTTP 400.
mask — redact the match and let the sanitized text through.
flag — change nothing about the traffic; record the match only.
annotate — leave the text alone but inject a security note upstream (e.g. a CVE advisory before the model answers).
spotlight — wrap the matched untrusted text in delimiters and tell the model to treat it as data, not instructions.

See Actions. Use flag to measure a rule on live traffic before you enforce it.

4. How a guardrail attaches and resolves

A guardrail binds to a key via guardrail_id, or a workspace can mark one guardrail as its default. For any request the gateway resolves in this order:

Explicit attachment — if the key’s guardrail_id points at a guardrail that exists and is enabled, that one applies. An explicit attachment never falls back: disabling it is the off switch.
Workspace default — if the key has no attachment, the enabled default guardrail applies.
Neither — no enforcement; the request is byte-identical to a workspace that never turned the feature on.

This differs from the firewall. A disabled attached firewall policy falls back to the workspace default; a disabled attached guardrail goes to none. The off switch is literal for guardrails.

Walkthroughs: create your first guardrail, attach to a key, set an account default.

5. PII detectors

A pii rule ships a closed set of built-in detectors: email, phone, credit_card, ssn, ip, iban, mac_address, jwt, aws_access_key, api_key_openai, bitcoin_address — plus the regional jp_mynumber, kr_rrn, and cn_resident_id. On a mask action each match becomes a typed tag — an email renders as [EMAIL], an SSN as [SSN]. You can layer up to 25 custom entities per rule (a regex with optional Luhn checksum), and route different entities to different actions in one rule via per-entity overrides.

The turnkey starting point is the PII Shield preset — a single pii rule, mask, stage both. Input-stage masking rewrites the request before the model (streaming or not); output masking rewrites the response on non-streaming responses only — in-stream output rewriting is on the roadmap. See PII Shield, custom entities, and masking formats.

6. The preset picker

New guardrail opens into a template. Presets are authored server-side, so the console, the sandbox, and these docs describe the same behavior. The picker groups them into categories:

Category	Example presets	Spoke
pii / secrets	PII Shield, secret-credential blockers	block secrets
safety	prompt-injection, jailbreak, self-harm	prompt injection
compliance	GDPR, PCI, HIPAA, compliance logger	compliance logger
brand / cost	profanity, competitor mentions, size caps	brand safety · cost
agent	URL / shell-tool / SQL-in-output filters	agentic
code_security	secret-file blocks, copyleft-license review	code security

A preset is a seed, not a lock — apply it, then edit freely. More starting points live under templates.

7. When a guardrail blocks

A blocked request returns HTTP 400 with error code guardrail_blocked and a message naming the guardrail and rule that fired.

No quota is charged. An input-stage block fires before metering; an output-stage block refunds the pre-consumed quota.
The request is marked skip-retry — re-running the same prompt would just block again, so the gateway won’t waste a retry on another channel.

On streaming, block is enforced best-effort — a scanner buffers a small lookahead and cuts the stream when a rule fires, so already-flushed bytes can’t be retracted. Mask on output applies to non-streaming responses only — on a streaming response the gateway computes the mask but does not forward the redacted text; in-stream output rewriting is on the roadmap. (Input-stage masking is live on streaming and non-streaming alike.) See the guardrail_blocked error and streaming coverage.

8. After it’s live

Matches feed

Every rule that fires records type, action, stage, and detail. Group, filter, export, and drill into a single match.

Logging & privacy

The matched substring is recorded only when Log raw content is on — off by default, the privacy-conservative posture.

Versioning

Every change writes a history row. Diff any two versions and revert as a new version — history is never mutated.

Testing & eval

A sandbox Test tab evaluates the current policy with no upstream call, and an eval harness scores it against bundled or custom corpora.

A false positive is a tuning signal, not a reason to disable the rule. Mark it in the Matches feed and narrow the pattern — see tune false positives.

9. Where to go next

Pick the right rule type

Sensitive-word denylists · regex detectors · input stage · output stage · stream-safe rules.

Understand the model

Guardrails vs. firewall · how OrcaRouter inspects traffic · enforcement modes · scope: keys, policies, workspaces.

Map to threats

Prompt injection · jailbreaks · data exfiltration.

Full engine reference

Guardrails — every field, every route, the LLM-judge and grounding rules, and external vendors in depth.

​1. What ai guardrails do on the gateway

​2. One concrete example

​3. Rules: type, stage, action

​4. How a guardrail attaches and resolves

​5. PII detectors

​6. The preset picker

​7. When a guardrail blocks

​8. After it’s live

Matches feed

Logging & privacy

Versioning

Testing & eval

​9. Where to go next

1. What ai guardrails do on the gateway

2. One concrete example

3. Rules: type, stage, action

4. How a guardrail attaches and resolves

5. PII detectors

6. The preset picker

7. When a guardrail blocks

8. After it’s live

9. Where to go next