sanitize verdict actually redacts, what it does not,
and which control governs the content a tool returns.
1. What “sanitize” means on the mcp surface
When an agent calls a tool through the MCP gateway, everytools/call is evaluated on the mcp surface before dispatch. A
matching rule can carry one of the authorable
firewall verdicts — allow, audit, deny, sanitize, pending_approval,
or cap_cost. The sanitize verdict is the redacting one:
- It runs a set of secret-shape detectors over the call’s arguments (the JSON the model passed into the tool).
- Each match is replaced with a canonical token like
[redacted:openai_key], and the rewritten arguments are what get forwarded to the server. - The tool still runs —
sanitizeis a non-blocking, let-through verdict. The agent doesn’t crash; it just never hands the raw secret to the tool.
sk--style API keys, Bearer tokens, US SSN, Luhn-valid card numbers,
email), and a rule can add custom regexes whose matches render as
[redacted:custom].
On the inbound surface — the advertised
tools[] a request declares,
before any tool is called — there are no call-time arguments to redact, so
a sanitize verdict there fails closed and escalates to deny. Sanitize
is meaningful only where there’s a live argument payload to rewrite: the
mcp and response surfaces.2. One concrete rule
Say you want any tool call whose arguments contain an OpenAI-style key to be forwarded with the key scrubbed out, rather than blocked. Author a rule on the mcp surface with asanitize verdict, configured to detect that
secret shape. Do this from the console (Firewall → policy → rules); the
write requires Developer+.
The rule, conceptually:
| Field | Value |
|---|---|
| Surface | mcp |
tool_name_glob | * (or scope to one server, e.g. github.*) |
| Verdict | sanitize |
| Sanitize presets | the secret detectors to enable |
sanitize verdict, surface, and matched rule.
3. Tool results are untrusted — govern them on the model reply
Here is the part most “sanitize the output” setups get wrong. Thesanitize verdict touches arguments only. A tool’s result — the text
or JSON an MCP server hands back — is never rewritten by a firewall verdict.
OrcaRouter treats tool-result content as untrusted input to the model.
A compromised or poisoned MCP server can return a secret, a PII record, or a
prompt-injection payload dressed up as data. The control for that content is
a guardrail on the output stage — the model’s
reply, evaluated after the model has incorporated the tool result.
Catch secrets that surface in the reply
Catch secrets that surface in the reply
Attach a guardrail with the Secrets & API-Key Blocker preset
(category
secrets). It blocks AWS / OpenAI / GitHub-style credentials;
pair it with Private Keys & Cloud Tokens for PEM keys, Slack/Stripe
tokens, Google keys, and JWTs. An output-stage block returns
guardrail_blocked (HTTP 400) and refunds the request’s quota.Redact PII in the reply
Redact PII in the reply
The PII Shield preset masks typed entities —
[EMAIL], [SSN],
[CREDIT_CARD], … — rendering matched values as tags. Input-stage
masking is live on every request (streaming or not): it masks the
request before the model sees it. Output-stage masking rewrites the
model reply on non-streaming responses only; in-band rewriting
of a streaming reply is on the roadmap, so a mask rule does not yet
redact a streamed reply.Neutralize injection riding in tool results
Neutralize injection riding in tool results
A poisoned result can carry “ignore previous instructions”-style text.
The Prompt-Injection Basics safety preset (keyword/regex) plus an
llm_judge rule that scores for injection intent are the controls here.
See MCP tool poisoning and
Prompt injection.Output enforcement and streaming. Output-stage block is enforced on
both streaming and non-streaming replies — on a stream, a block cuts the
stream when it matches and emits a generic block notice. Output-stage
mask applies to non-streaming replies only; in-band rewriting of a
streaming reply is on the roadmap, so a mask rule does not yet redact a
streamed reply.
4. Where each control lives
A compact map of the two surfaces, so you wire the right knob to the right risk:| You want to govern… | Control | Where |
|---|---|---|
| Secrets in a tool call’s arguments | Firewall sanitize verdict (mcp surface) | Firewall rules |
| Secrets / PII / injection in a tool’s result | Guardrail on the output stage | Guardrails |
5. Attaching and observing
Both controls are workspace-scoped, named, and ordered, and both attach the same two ways:- Per key — set
firewall_policy_id(for the sanitize rule) andguardrail_id(for the output policy) on the key the agent uses. - Workspace default — mark a policy / guardrail as the workspace default so every key inherits it.
6. Where to go next
Allow-list MCP tools
Default-deny a server and permit only the tools you’ve reviewed.
Firewall rules
The full rule DSL — verdicts, globs, args-match, sanitize config.
Guardrails
Content policies, presets, PII entities, and output-stage enforcement.
MCP tool poisoning
The threat that makes tool results untrusted in the first place.
