Sanitize MCP tool outputs

You connected an MCP server, and now you want the gateway to strip a leaked secret out of a tool call before it reaches the real server — and to keep whatever that tool returns from smuggling a credential (or an injection payload) back into the model. Those are two different jobs, handled by two different controls, and the honest version matters: if you assume one knob covers both, you’ll ship a gap. This page is the focused guide to sanitize mcp output on OrcaRouter — what the firewall sanitize verdict actually redacts, what it does not, and which control governs the content a tool returns.

The sanitize verdict redacts tool-call arguments, never the result a tool returns. It rewrites what your agent sends into a tool. To govern what a tool sends back, you use an output-stage guardrail on the model’s reply — see §3.

1. What “sanitize” means on the mcp surface

When an agent calls a tool through the MCP gateway, every tools/call is evaluated on the mcp surface before dispatch. A matching rule can carry one of the authorable firewall verdicts — allow, audit, deny, sanitize, pending_approval, or cap_cost. The sanitize verdict is the redacting one:

It runs a set of secret-shape detectors over the call’s arguments (the JSON the model passed into the tool).
Each match is replaced with a canonical token like [redacted:openai_key], and the rewritten arguments are what get forwarded to the server.
The tool still runs — sanitize is a non-blocking, let-through verdict. The agent doesn’t crash; it just never hands the raw secret to the tool.

Built-in detectors cover well-known secret shapes (AWS access keys, sk--style API keys, Bearer tokens, US SSN, Luhn-valid card numbers, email), and a rule can add custom regexes whose matches render as [redacted:custom].

On the inbound surface — the advertised tools[] a request declares, before any tool is called — there are no call-time arguments to redact, so a sanitize verdict there fails closed and escalates to deny. Sanitize is meaningful only where there’s a live argument payload to rewrite: the mcp and response surfaces.

2. One concrete rule

Say you want any tool call whose arguments contain an OpenAI-style key to be forwarded with the key scrubbed out, rather than blocked. Author a rule on the mcp surface with a sanitize verdict, configured to detect that secret shape. Do this from the console (Firewall → policy → rules); the write requires Developer+. The rule, conceptually:

Field	Value
Surface	`mcp`
`tool_name_glob`	`` (or scope to one server, e.g. `github.`)
Verdict	`sanitize`
Sanitize presets	the secret detectors to enable

At call time, an argument payload like:

{ "note": "use key sk-AAAABBBBCCCCDDDDEEEEFFFFGGGGHHHH for the upstream" }

is forwarded to the server as:

{ "note": "use key [redacted:openai_key] for the upstream" }

The call succeeds; the secret never reaches the server. The firewall event records the sanitize verdict, surface, and matched rule.

Reach for sanitize when a tool legitimately needs most of an argument but a secret occasionally rides along in free text. When the whole call is dangerous, use deny (or pending_approval) instead — see Allow-list MCP tools.

3. Tool results are untrusted — govern them on the model reply

Here is the part most “sanitize the output” setups get wrong. The sanitize verdict touches arguments only. A tool’s result — the text or JSON an MCP server hands back — is never rewritten by a firewall verdict. OrcaRouter treats tool-result content as untrusted input to the model. A compromised or poisoned MCP server can return a secret, a PII record, or a prompt-injection payload dressed up as data. The control for that content is a guardrail on the output stage — the model’s reply, evaluated after the model has incorporated the tool result.

Catch secrets that surface in the reply

Attach a guardrail with the Secrets & API-Key Blocker preset (category secrets). It blocks AWS / OpenAI / GitHub-style credentials; pair it with Private Keys & Cloud Tokens for PEM keys, Slack/Stripe tokens, Google keys, and JWTs. An output-stage block returns guardrail_blocked (HTTP 400) and refunds the request’s quota.

Redact PII in the reply

The PII Shield preset masks typed entities — [EMAIL], [SSN], [CREDIT_CARD], … — rendering matched values as tags. Input-stage masking is live on every request (streaming or not): it masks the request before the model sees it. Output-stage masking rewrites the model reply on non-streaming responses only; in-band rewriting of a streaming reply is on the roadmap, so a mask rule does not yet redact a streamed reply.

Neutralize injection riding in tool results

A poisoned result can carry “ignore previous instructions”-style text. The Prompt-Injection Basics safety preset (keyword/regex) plus an llm_judge rule that scores for injection intent are the controls here. See MCP tool poisoning and Prompt injection.

Output enforcement and streaming. Output-stage block is enforced on both streaming and non-streaming replies — on a stream, a block cuts the stream when it matches and emits a generic block notice. Output-stage mask applies to non-streaming replies only; in-band rewriting of a streaming reply is on the roadmap, so a mask rule does not yet redact a streamed reply.

4. Where each control lives

A compact map of the two surfaces, so you wire the right knob to the right risk:

You want to govern…	Control	Where
Secrets in a tool call’s arguments	Firewall `sanitize` verdict (mcp surface)	Firewall rules
Secrets / PII / injection in a tool’s result	Guardrail on the output stage	Guardrails

Don’t try to make sanitize cover tool results — it can’t see them. And don’t assume an input-stage guardrail will catch what a tool returns mid-conversation; tool-result content is governed on the model’s reply, which is the output stage.

5. Attaching and observing

Both controls are workspace-scoped, named, and ordered, and both attach the same two ways:

Per key — set firewall_policy_id (for the sanitize rule) and guardrail_id (for the output policy) on the key the agent uses.
Workspace default — mark a policy / guardrail as the workspace default so every key inherits it.

Configure all of this from the console with your session/access token (the management routes use UserAuth, not the relay key). Firewall writes require Developer+; guardrail writes require Developer+. Once live, sanitize matches show up as firewall events (verdict, surface, matched rule), and guardrail matches show up in the guardrail match feed. The two have different read gates: the firewall events feed requires Developer+, while the guardrail match feed is readable by any workspace member. By default a match records its type, action, and stage but not the raw matched content; turn on Log raw content only when you need the substring for triage.

6. Where to go next

Allow-list MCP tools

Default-deny a server and permit only the tools you’ve reviewed.

Firewall rules

The full rule DSL — verdicts, globs, args-match, sanitize config.

Guardrails

Content policies, presets, PII entities, and output-stage enforcement.

MCP tool poisoning

The threat that makes tool results untrusted in the first place.

New to the split between these two layers? Read Guardrails vs. firewall, then Data exfiltration for the leak path sanitize and output guardrails close together.

​1. What “sanitize” means on the mcp surface

​2. One concrete rule

​3. Tool results are untrusted — govern them on the model reply

​4. Where each control lives

​5. Attaching and observing

​6. Where to go next

Allow-list MCP tools

Firewall rules

Guardrails

MCP tool poisoning

1. What “sanitize” means on the mcp surface

2. One concrete rule

3. Tool results are untrusted — govern them on the model reply

4. Where each control lives

5. Attaching and observing

6. Where to go next