Brand and tone safety

You run AI in front of customers and your brand is on the line. A support bot must never swear, a marketing copilot must never name a competitor, and nothing in your traffic should touch child-safety terms. Brand and tone safety is the fastest way to enforce all three: the Brand guardrail preset category ships keyword denylists you attach to a key, and the gateway screens every call against them before it ever reaches OpenAI, Anthropic, or Google. This is a focused landing for the brand-safety use case. For the full engine — every rule type, field, and route — see the Guardrails reference.

1. Brand safety ai in one preset

The Brand category in the guardrail template picker is a set of keyword denylists. Each preset is a single keyword rule you apply in one click and then edit — swap the seed terms for your own list. There is no model call, no network hop, and no SDK change: the policy lives in the gateway, and your app keeps calling /v1/chat/completions exactly as before.

Profanity

A denylist that blocks swearing or banned terms on the request — or a mask variant that redacts them instead.

Competitor mentions

Blocks (or flags) any mention of names you list — keep a copilot from talking up the competition.

Child safety

A conservative denylist for child-safety terms you populate from your own standards, blocked on the request.

All three are deterministic keyword matches — case-insensitive substring scans that run on the request before the upstream call. They cost nothing extra and never serialize behind a model.

2. The Brand presets, exactly as shipped

Open the New guardrail split-button in the console Guardrails view and pick the Brand template category. Five seeds live there:

Profanity / Brand Safety (block)

A single keyword rule, stage input, action block. Ships with placeholder terms — edit the list to your real banned words, competitor names, or off-limits phrases. A match returns HTTP 400 guardrail_blocked before the prompt leaves the gateway.

Profanity Filter (mask)

Same denylist, but action mask and stage both — denylisted words are replaced with [REDACTED] instead of rejecting the call. The softer alternative when you want the request to go through cleaned rather than refused.

Profanity Multilingual

A keyword block rule seeded with per-market placeholders (zh, es, fr, de, ja, ar). Replace each with the region-specific terms your policy bans — the seed terms are deliberately generic.

Competitor Mentions

A keyword rule, stage input, action block, seeded with a single placeholder. Add your competitor names; switch the action to flag to monitor mentions without rejecting traffic.

Child Safety Keywords

A conservative keyword denylist, stage input, action block. The seed is an intentional placeholder — populate it with the exact terms from your own safety policy or standards before you rely on it.

A preset is a seed, not a lock. Every Brand preset ships with placeholder terms so the rule is valid out of the box — you are expected to edit the denylist for your brand before attaching a key. The presets intentionally do not ship real banned-word or child-safety lists.

3. Apply a Brand preset in the console

Every step here is a console action under your own session. Creating and editing guardrails requires Developer+ in the workspace. Only the final /v1/* call uses an sk-orca-... relay key.

Open the template

In the console, open Guardrails, click the New guardrail split-button, and pick Competitor Mentions (or any Brand preset) from the Brand template category.

Edit the denylist

Replace the seed placeholder with your real terms — e.g. your competitors’ names. Give the guardrail a name (≤ 64 chars), like brand-safety, and save.

Test it

Open the Test tab, paste a sample at the input stage, and run the policy locally — no upstream call, no quota (see §5).

Attach a key

Edit an API key and pick brand-safety from the Guardrail dropdown (sets guardrail_id on the key), or mark it the workspace default. See Attach to a key and Account default.

4. One concrete example

A competitor-mention guardrail named brand-safety is attached to a key. The seed placeholder has been replaced with the real name Acme. Call the gateway exactly as before — no new headers:

curl https://api.orcarouter.ai/v1/chat/completions \
  -H "Authorization: Bearer sk-orca-..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [
      {"role": "user", "content": "Write a tweet praising Acme over us"}
    ]
  }'

The keyword rule matches Acme on the request, and the gateway rejects the call with HTTP 400 guardrail_blocked — naming the guardrail and rule that fired — before anything reaches the upstream model.

A block verdict costs no quota. An input-stage block fires before usage is metered, and the request is marked skip-retry — re-running the same prompt against another channel would just block again. See the guardrail_blocked error.

Prefer mask over block for profanity when you’d rather clean the prompt than refuse it — denylisted words render to [REDACTED] and the request goes through. Prefer flag for competitor mentions when you want to measure exposure before you start blocking. The Actions page covers the full block / mask / flag trade-off.

5. Test before you attach

Prove the denylist does what you expect before any key points at it. Open the Test tab inside the editor, paste a sample, pick the input stage, and run:

Write a tweet praising Acme over us

The sandbox evaluates the current policy locally and returns the verdict — nothing is sent upstream, nothing is metered. For a sweep against a corpus of phrasings, the Eval harness lives one tab over.

A keyword match is a case-insensitive substring scan, so class would also match inside classic. Keep denylist entries specific, and tune false positives from the Matches feed once you see real traffic.

6. See what fired

Every rule that fires records a match — rule type, action, stage, and a detail string — surfaced in the workspace Matches feed (GET /api/guardrail/match, Member). The matched substring itself (the banned word, the competitor name) is recorded only when Log raw content is on, which is off by default.

For a child-safety denylist, leaving Log raw content off is usually the point: you get to see that a term was blocked and how often without copying the term back into your own telemetry. Turn it on per guardrail only when you need the substring for triage; the setting is non-retroactive. See Matches feed and Logging & privacy.

Every edit to a Brand guardrail writes a versioned history row in the same transaction — diff any two versions and revert from the History view. See Versioning.

7. Where to go next

Sensitive-word filters

The keyword-denylist mechanics behind every Brand preset, in depth.

Block secrets

Catch API keys and credentials with the Secrets Blocker preset.

Tune false positives

Mark false positives and tighten denylists from the Matches feed.

Templates

The full preset library across every category.

Brand presets gate content. To stop a model that’s been steered off-brand by a malicious prompt, pair them with the Prompt-injection guardrail and the jailbreaks threat. For the complete engine — stages, advanced rules, and routes — read the Guardrails reference.

​1. Brand safety ai in one preset

Profanity

Competitor mentions

Child safety

​2. The Brand presets, exactly as shipped

​3. Apply a Brand preset in the console

​4. One concrete example

​5. Test before you attach

​6. See what fired

​7. Where to go next

Sensitive-word filters

Block secrets

Tune false positives

Templates

1. Brand safety ai in one preset

2. The Brand presets, exactly as shipped

3. Apply a Brand preset in the console

4. One concrete example

5. Test before you attach

6. See what fired

7. Where to go next