Skip to main content
A guardrail is the content-policy layer of the OrcaRouter gateway. You author one named policy in your workspace, attach it to an API key, and every /v1/* call that key makes is screened — before the model sees the prompt and after the model answers — with no redeploy and no SDK change. This page is the hub for the Guardrails section: what a guardrail is, the rule types, the stages and actions, and how a policy attaches to a key. Each spoke goes deeper. For the full engine reference, see Guardrails.

1. What ai guardrails do on the gateway

Most teams reach for guardrails to keep sensitive data out of prompts (PII, secrets), to gate unsafe content (jailbreaks, prompt-injection intent), or to satisfy a compliance control. A guardrail is the gateway’s answer: a workspace-scoped, named policy — an ordered list of rules the gateway runs against request input and model output. Because the binding lives on the API key in the gateway — not in your application — editing a guardrail shifts every attached key on the next call. Your code keeps calling /v1/chat/completions exactly as before.
Guardrails are content policy (text in, text out). The companion Agent Firewall is tool policy — it governs which tool calls an agent may make. The two compose; see Guardrails vs. firewall.

2. One concrete example

Create a guardrail named pii-shield in the console (/console/guardrails), add a single PII rule — stage input, action mask, entities email, ssn — and attach it to a key. From then on:
curl https://api.orcarouter.ai/v1/chat/completions \
  -H "Authorization: Bearer sk-orca-..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [
      {"role": "user", "content": "Reply to jane@acme.com please"}
    ]
  }'
The gateway rewrites the prompt to Reply to [EMAIL] please before forwarding — the upstream model never sees the address. Flip that ssn entity to block and the next request carrying an SSN is rejected with HTTP 400. No application change.
Authoring is a console / management-API action on your session — the sk-orca-... relay key is only for /v1/* traffic, never for editing policy. Creating or editing a guardrail requires the Developer+ role.

3. Rules: type, stage, action

Every rule answers three questions. The engine runs all applicable rules and folds them into one decision.
Seven rule types. The built-ins are deterministic (pure string/regex, no network); the advanced ones call out to a model or vendor and run concurrently.
  • keyword — literal denylist, case-insensitive substring match.
  • regex — an RE2 pattern (linear-time, no backreferences).
  • pii — built-in entity detectors plus your own. See §5.
  • max_chars — caps the character count at a stage.
  • external — delegates to a connected vendor (Aporia, Averta, or your own webhook).
  • llm_judge — a semantic check against a model in your workspace.
  • grounding — scores answer faithfulness against the request’s retrieved sources (RAG).
input (the request), output (the model’s response), or both. Input rules run before the upstream call; output rules run after the model responds. See input stage and output stage.
Five actions surface in the rule builder:
  • block — reject the call with HTTP 400.
  • mask — redact the match and let the sanitized text through.
  • flag — change nothing about the traffic; record the match only.
  • annotate — leave the text alone but inject a security note upstream (e.g. a CVE advisory before the model answers).
  • spotlight — wrap the matched untrusted text in delimiters and tell the model to treat it as data, not instructions.
See Actions. Use flag to measure a rule on live traffic before you enforce it.

4. How a guardrail attaches and resolves

A guardrail binds to a key via guardrail_id, or a workspace can mark one guardrail as its default. For any request the gateway resolves in this order:
  1. Explicit attachment — if the key’s guardrail_id points at a guardrail that exists and is enabled, that one applies. An explicit attachment never falls back: disabling it is the off switch.
  2. Workspace default — if the key has no attachment, the enabled default guardrail applies.
  3. Neither — no enforcement; the request is byte-identical to a workspace that never turned the feature on.
This differs from the firewall. A disabled attached firewall policy falls back to the workspace default; a disabled attached guardrail goes to none. The off switch is literal for guardrails.
Walkthroughs: create your first guardrail, attach to a key, set an account default.

5. PII detectors

A pii rule ships a closed set of built-in detectors: email, phone, credit_card, ssn, ip, iban, mac_address, jwt, aws_access_key, api_key_openai, bitcoin_address — plus the regional jp_mynumber, kr_rrn, and cn_resident_id. On a mask action each match becomes a typed tag — an email renders as [EMAIL], an SSN as [SSN]. You can layer up to 25 custom entities per rule (a regex with optional Luhn checksum), and route different entities to different actions in one rule via per-entity overrides.
The turnkey starting point is the PII Shield preset — a single pii rule, mask, stage both. Input-stage masking rewrites the request before the model (streaming or not); output masking rewrites the response on non-streaming responses only — in-stream output rewriting is on the roadmap. See PII Shield, custom entities, and masking formats.

6. The preset picker

New guardrail opens into a template. Presets are authored server-side, so the console, the sandbox, and these docs describe the same behavior. The picker groups them into categories:
CategoryExample presetsSpoke
pii / secretsPII Shield, secret-credential blockersblock secrets
safetyprompt-injection, jailbreak, self-harmprompt injection
complianceGDPR, PCI, HIPAA, compliance loggercompliance logger
brand / costprofanity, competitor mentions, size capsbrand safety · cost
agentURL / shell-tool / SQL-in-output filtersagentic
code_securitysecret-file blocks, copyleft-license reviewcode security
A preset is a seed, not a lock — apply it, then edit freely. More starting points live under templates.

7. When a guardrail blocks

A blocked request returns HTTP 400 with error code guardrail_blocked and a message naming the guardrail and rule that fired.
  • No quota is charged. An input-stage block fires before metering; an output-stage block refunds the pre-consumed quota.
  • The request is marked skip-retry — re-running the same prompt would just block again, so the gateway won’t waste a retry on another channel.
On streaming, block is enforced best-effort — a scanner buffers a small lookahead and cuts the stream when a rule fires, so already-flushed bytes can’t be retracted. Mask on output applies to non-streaming responses only — on a streaming response the gateway computes the mask but does not forward the redacted text; in-stream output rewriting is on the roadmap. (Input-stage masking is live on streaming and non-streaming alike.) See the guardrail_blocked error and streaming coverage.

8. After it’s live

Matches feed

Every rule that fires records type, action, stage, and detail. Group, filter, export, and drill into a single match.

Logging & privacy

The matched substring is recorded only when Log raw content is on — off by default, the privacy-conservative posture.

Versioning

Every change writes a history row. Diff any two versions and revert as a new version — history is never mutated.

Testing & eval

A sandbox Test tab evaluates the current policy with no upstream call, and an eval harness scores it against bundled or custom corpora.
A false positive is a tuning signal, not a reason to disable the rule. Mark it in the Matches feed and narrow the pattern — see tune false positives.

9. Where to go next

Guardrails — every field, every route, the LLM-judge and grounding rules, and external vendors in depth.