1. The sensitive word filter ai use case
Akeyword rule is the simplest rule in the engine: you give it a list
of terms, and the gateway matches any of them against the text at a
stage. Matching is case-insensitive substring — BadWord,
badword, and BADWORD all match, and the term matches even when it’s
embedded in a longer word (so class also matches classic). Each term
is treated as a literal string, not a pattern; you don’t escape regex
metacharacters.
Save the rule once in the console, attach the guardrail to any API key
(or make it the workspace default), and every call on that key is
screened with no SDK change and no redeploy. The policy lives in the
gateway, not your application — your app keeps calling
/v1/chat/completions exactly as before.
2. Author the rule in the console
Every step here is a console action under your own session. Creating and editing guardrails requires Developer+ in the workspace. Only the final/v1/* call uses an sk-orca-... relay key.
Create a guardrail
In the console, open Guardrails and click New guardrail. Name
it (≤ 64 chars), e.g.
banned-terms.Add a keyword rule
Add one rule:
- Type: Keyword denylist (
keyword) - Stage: Both (request and response)
- Action: Block
- Keywords: your banned terms, one per row
Test it
Open the Test tab, paste a sample that contains a banned term,
pick a stage, and run the policy locally — no upstream call, no quota
(see §5).
Attach a key
Edit an API key and pick
banned-terms from the Guardrail
dropdown (sets guardrail_id on the key), or mark the guardrail the
workspace default. See
Attach to a key and
Account default.3. Pick the action
A keyword rule chooses one action per rule:Block — reject the call
Block — reject the call
Any match rejects the request with HTTP 400
guardrail_blocked.
A blocked request costs no quota — an input-stage block fires
before metering; an output-stage block refunds the pre-consumed quota
— and it’s marked skip-retry. Use it for terms that must never
pass in either direction. See the
guardrail_blocked error.Mask — redact the term
Mask — redact the term
Each match is replaced in place with a redaction tag and the request
continues with the sanitized text — the upstream model never sees the
original term. See Actions.
Flag — observe only
Flag — observe only
Records a match and changes nothing about the traffic. Use it to
measure how often a term appears before you switch to enforcement.
Spotlight — wrap as untrusted data (input)
Spotlight — wrap as untrusted data (input)
Wraps the matched text in delimiters (e.g.
⟦UNTRUSTED⟧…⟦/UNTRUSTED⟧) so the model treats it as data, not
instructions — an input-stage prompt-injection defense. The text
still reaches the model, just fenced off. See
Actions.Stage matters.
input scans the caller’s request, output scans
the model’s response, both scans each side independently. A banned term
your users type and one a model might emit are different problems — pick
the stage(s) that fit. See
Input-stage rules and
Output-stage rules.4. Streaming coverage
The action you pick interacts with whether the response streams:| Action | Non-streaming | Streaming |
|---|---|---|
block (output) | Enforced | Enforced — scanner cuts the stream |
mask (output) | Enforced | Not yet — block decision honored, masked text not forwarded (roadmap) |
5. Test before you attach
Prove the rule does what you expect before any key points at it. Open the Test tab inside the editor, paste a sample, pick the stage, and run:6. Send a request
Using a key bound tobanned-terms, call OrcaRouter exactly as before —
no new headers, no SDK change:
guardrail_blocked before it ever reaches the model. Swap the action to
mask and the term is redacted in place before forwarding instead.
7. See what fired
Every rule that fires records a match — rule type, action, stage, and a detail string (for keyword rules, how many terms matched) — surfaced in the workspace Matches feed. If a benign term keeps matching (a denylist entry that’s a substring of a common word), mark it a false positive from the Matches feed and tighten the entry. See Tune false positives.8. Where to go next
Regex detectors
Match structured patterns — SKUs, order numbers, formats — when a
literal denylist isn’t enough.
Brand safety
Profanity, competitor mentions, and child-safety presets built on
keyword rules.
Actions
How block, mask, and flag differ and when to use each.
Guardrails reference
The complete engine — every rule type, field, and route.
llm_judge rule runs a semantic check against a
workspace model.