Sensitive-word and banned-term filtering

You have a list of terms that must never reach a model or come back from one — a competitor’s name, an internal code-name, a banned slur, a product that isn’t announced yet. The fastest control for that is a keyword denylist: a list of literal terms the gateway scans for on every call and then blocks, masks, or flags. This is a focused landing for the banned-term use case. For the full guardrail engine — every rule type, field, and route — see the Guardrails reference.

1. The sensitive word filter ai use case

A keyword rule is the simplest rule in the engine: you give it a list of terms, and the gateway matches any of them against the text at a stage. Matching is case-insensitive substring — BadWord, badword, and BADWORD all match, and the term matches even when it’s embedded in a longer word (so class also matches classic). Each term is treated as a literal string, not a pattern; you don’t escape regex metacharacters. Save the rule once in the console, attach the guardrail to any API key (or make it the workspace default), and every call on that key is screened with no SDK change and no redeploy. The policy lives in the gateway, not your application — your app keeps calling /v1/chat/completions exactly as before.

Reach for a keyword rule when your denylist is a finite set of literal terms. When you need wildcards, word boundaries, or structure (a SKU format, an order-number shape), use a regex detector instead.

2. Author the rule in the console

Every step here is a console action under your own session. Creating and editing guardrails requires Developer+ in the workspace. Only the final /v1/* call uses an sk-orca-... relay key.

Create a guardrail

In the console, open Guardrails and click New guardrail. Name it (≤ 64 chars), e.g. banned-terms.

Add a keyword rule

Add one rule:

Type: Keyword denylist (keyword)
Stage: Both (request and response)
Action: Block
Keywords: your banned terms, one per row

Save.

Test it

Open the Test tab, paste a sample that contains a banned term, pick a stage, and run the policy locally — no upstream call, no quota (see §5).

Attach a key

Edit an API key and pick banned-terms from the Guardrail dropdown (sets guardrail_id on the key), or mark the guardrail the workspace default. See Attach to a key and Account default.

The rule’s JSON is exactly what you’d expect:

{
  "type": "keyword",
  "stage": "both",
  "action": "block",
  "keywords": ["project-orca", "competitor-name", "unannounced-sku"]
}

3. Pick the action

A keyword rule chooses one action per rule:

Block — reject the call

Any match rejects the request with HTTP 400 guardrail_blocked. A blocked request costs no quota — an input-stage block fires before metering; an output-stage block refunds the pre-consumed quota — and it’s marked skip-retry. Use it for terms that must never pass in either direction. See the guardrail_blocked error.

Mask — redact the term

Each match is replaced in place with a redaction tag and the request continues with the sanitized text — the upstream model never sees the original term. See Actions.

Flag — observe only

Records a match and changes nothing about the traffic. Use it to measure how often a term appears before you switch to enforcement.

Spotlight — wrap as untrusted data (input)

Wraps the matched text in delimiters (e.g. ⟦UNTRUSTED⟧…⟦/UNTRUSTED⟧) so the model treats it as data, not instructions — an input-stage prompt-injection defense. The text still reaches the model, just fenced off. See Actions.

Stage matters. input scans the caller’s request, output scans the model’s response, both scans each side independently. A banned term your users type and one a model might emit are different problems — pick the stage(s) that fit. See Input-stage rules and Output-stage rules.

4. Streaming coverage

The action you pick interacts with whether the response streams:

Action	Non-streaming	Streaming
`block` (output)	Enforced	Enforced — scanner cuts the stream
`mask` (output)	Enforced	Not yet — block decision honored, masked text not forwarded (roadmap)

Input-stage rules run before the upstream call, so they’re unaffected by streaming — an input mask sanitizes the request whether or not the response streams. A banned-term block gets full coverage either way. An output mask, however, only redacts on non-streaming responses today: on a streaming reply the scanner still acts on the block decision, but in-band rewriting of the streamed text is on the roadmap, not live. See Streaming coverage.

5. Test before you attach

Prove the rule does what you expect before any key points at it. Open the Test tab inside the editor, paste a sample, pick the stage, and run:

Tell me about Project-Orca and our competitor-name

The sandbox evaluates the current policy locally and returns the verdict — nothing is sent upstream, nothing is metered. With a block action the sample is rejected; with mask the rendered text comes back with each term redacted. For an A/B grid against a corpus — to confirm a denylist catches what it should without flagging benign traffic — the Eval harness lives one tab over.

6. Send a request

Using a key bound to banned-terms, call OrcaRouter exactly as before — no new headers, no SDK change:

curl https://api.orcarouter.ai/v1/chat/completions \
  -H "Authorization: Bearer sk-orca-..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [
      {"role": "user", "content": "Summarize Project-Orca for me"}
    ]
  }'

With a block action the call is rejected with HTTP 400 guardrail_blocked before it ever reaches the model. Swap the action to mask and the term is redacted in place before forwarding instead.

7. See what fired

Every rule that fires records a match — rule type, action, stage, and a detail string (for keyword rules, how many terms matched) — surfaced in the workspace Matches feed.

The matched term itself is recorded only when Log raw content is on, which is off by default — the privacy-conservative posture. With it off you still see that a keyword rule fired and how often, just not the literal term. Turn it on per guardrail when you need the substring for triage; the setting is non-retroactive. See Matches feed and Logging & privacy.

If a benign term keeps matching (a denylist entry that’s a substring of a common word), mark it a false positive from the Matches feed and tighten the entry. See Tune false positives.

8. Where to go next

Regex detectors

Match structured patterns — SKUs, order numbers, formats — when a literal denylist isn’t enough.

Brand safety

Profanity, competitor mentions, and child-safety presets built on keyword rules.

Actions

How block, mask, and flag differ and when to use each.

Guardrails reference

The complete engine — every rule type, field, and route.

A keyword denylist governs content. To govern an agent’s tool calls — deny destructive actions, redact tool-call arguments, require approval — use the Firewall. For fuzzy policies no literal list can express (toxicity, off-topic, injection intent), an llm_judge rule runs a semantic check against a workspace model.

​1. The sensitive word filter ai use case

​2. Author the rule in the console

​3. Pick the action

​4. Streaming coverage

​5. Test before you attach

​6. Send a request

​7. See what fired

​8. Where to go next

Regex detectors

Brand safety

Actions

Guardrails reference

1. The sensitive word filter ai use case

2. Author the rule in the console

3. Pick the action

4. Streaming coverage

5. Test before you attach

6. Send a request

7. See what fired

8. Where to go next