Stop data exfiltration end-to-end

An agent that can reach the network can be turned into a data pipe. Injected instructions tell it to gather secrets, rows, or PII with the tools it already holds and POST them to an attacker host — or probe internal services (SSRF). The agent never “decides” to exfiltrate; it executes what looks, to it, like a legitimate instruction. This recipe wires up three controls that close the loop end-to-end — an egress allow-list that locks where outbound calls can go, the Secrets Blocker guardrail that stops credentials before they ever reach a model, and an argument sanitizer that strips secrets out of the tool calls a model does emit. All of it lives in the gateway, so you configure it once in the console with zero change to your agent code. For the full attack anatomy, read Data exfiltration over the network; this page is the build steps.

Everything here binds to your workspace and is configured from the console. Your agent keeps calling https://api.orcarouter.ai/v1/... with the same sk-orca-... key — only the policy in the gateway changes. Configuration actions need the roles called out per step; relay calls use the scoped key. The firewall sees egress only for destinations routed through the gateway (the MCP dispatch path or the evaluate hook) — route your network-bound tool calls through it and they are governed.

1. The three layers that prevent ai data exfiltration

Each layer catches the attack at a different point in the request lifecycle. Stack all three — they’re independent and complementary.

Credentials in the prompt

A secret pasted into (or pulled into) the request is caught at the input stage by the Secrets Blocker guardrail — before any model sees it.

Secrets in tool args

A model that emits a tool call carrying a credential is cleaned by a sanitize firewall rule, which redacts the matched argument.

Outbound destination

The actual network step is bounded by an egress allow-list — only enumerated hosts pass; everything else is denied.

This recipe uses both planes: Guardrails for the text in the request, the Firewall for the actions and the network. See guardrails vs firewall for where the line sits.

2. Stop credentials at the prompt — the Secrets Blocker guardrail

The first thing to lock down is the credential itself. The Secrets & API-Key Blocker guardrail runs at the input stage and scans the request for credential patterns — AWS-style access keys, OpenAI keys, JWTs, and similar tokens — before the request leaves the gateway. On a match the request is blocked: the credential never reaches a model and never lands in a tool call. In the console, open Guardrails → New guardrail (the Developer role; reads and the Test sandbox are open to any member), name it exfil-shield, and apply the Secrets & API-Key Blocker preset from the template library (category secrets). The preset seeds three input-stage regex block rules, one per credential shape — AWS access keys, OpenAI-style keys, and GitHub tokens:

[
  { "type": "regex", "stage": "input", "action": "block", "pattern": "AKIA[0-9A-Z]{16}" },
  { "type": "regex", "stage": "input", "action": "block", "pattern": "sk-[A-Za-z0-9]{20,}" },
  { "type": "regex", "stage": "input", "action": "block", "pattern": "ghp_[A-Za-z0-9]{36}" }
]

To extend coverage, add a pii rule on the built-in entities — the detector set covers email, phone, credit_card, ssn, ip, iban, mac_address, api_key_openai, aws_access_key, jwt, and bitcoin_address. Choose mask (redact to a typed tag like [EMAIL]) or block per entity via entity_actions. Input-stage masking is live; it rewrites the request before the model sees it.

A blocked request returns HTTP 400 guardrail_blocked, costs no quota (an input-stage block fires before metering), and is marked skip-retry. Prove it in the Test tab — paste a sample AWS key, pick the input stage, and confirm the verdict — before you attach a key.

3. Sanitize secrets out of tool-call arguments

A guardrail screens the prompt; it doesn’t see the tool calls a model emits. When the model produces a tool_call whose arguments carry a credential, a firewall sanitize rule catches it. Sanitize redacts the matched substrings from the tool-call arguments and forwards the cleaned call — the tool runs, but with the secret stripped out. In Firewall → Policies → New policy (Developer role), name it exfil-firewall and add a sanitize rule on the response surface — the tool_calls the model emits in its reply:

{
  "priority": 10,
  "label": "Redact secrets from tool args",
  "stage": "response",
  "tool_name_glob": "*",
  "verdict": "sanitize",
  "sanitize": {
    "presets": ["aws_access_key", "openai_key"],
    "custom": ["sk-[A-Za-z0-9]{20,}"]
  }
}

Sanitize redacts tool-call arguments only — never the content a tool returns. It’s a defense on the outbound call shape, not on inbound tool results. On the inbound surface (where there are no call-time args yet) a sanitize verdict escalates to a deny. See the full matching language in Firewall rules.

4. Lock outbound destinations — the egress allow-list

The most durable defense is the network boundary itself: enumerate the hosts your agents are legitimately allowed to reach and deny everything else. An egress rule uses stage: egress and the egress field; the verdict sets polarity — allow passes listed destinations and a lower-priority deny catch-all blocks the rest. Add these rules to the same exfil-firewall policy:

[
  {
    "priority": 10,
    "label": "Allow known API endpoints",
    "stage": "egress",
    "tool_name_glob": "*",
    "verdict": "allow",
    "egress": {
      "allow": ["api.openai.com", "api.anthropic.com", "api.orcarouter.ai"]
    }
  },
  {
    "priority": 20,
    "label": "Deny all other outbound destinations",
    "stage": "egress",
    "tool_name_glob": "*",
    "verdict": "deny"
  }
]

Entries match as a CIDR, an IP literal, or a case-insensitive hostname. To stop SSRF toward internal services without an explicit allow-list, author your own egress deny rule listing the cloud metadata endpoint (169.254.169.254) and the RFC-1918 private ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16). A denied call returns HTTP 400 firewall_blocked.

No preset ships CIDR egress rules — you author the host/CIDR allow and deny entries yourself. The tight autonomy level is the adjacent fast path: it denies the fetch-shaped tool names (http_fetch, web_search, fetch_url, request) outright, removing the network capability before a destination is ever evaluated. Use it when your agent doesn’t need those tools at all.

5. Attach one scoped key

A policy only enforces on keys that resolve to it. Give the agent its own key, scoped to the minimum it needs — never your account-wide key. In API Keys → New key (Developer role):

Attach both policies

Pick exfil-shield from the Guardrail dropdown (sets guardrail_id) and exfil-firewall from the Firewall policy dropdown (sets firewall_policy_id). Both bindings live on the key in the gateway. An explicit guardrail attachment never silently falls back — disabling it is the off switch. A disabled firewall policy, by contrast, falls back to the workspace default policy.

Cap the blast radius

Set credit_limit_usd to a sane ceiling (0 = unlimited) so a compromised key can’t drain quota, and allow_ips to your backend’s egress IPs if the agent calls from a fixed server. Set an expired_time for temporary keys (-1 = never expires).

The key is masked on display after creation — copy it once. Your agent now runs every request through exfil-shield and every tool call through exfil-firewall with no code aware that enforcement is happening.

6. Roll out with shadow mode, then watch

If you don’t yet know every host your agent legitimately reaches, don’t enforce blind — observe first. See enforcement modes for the full observe → shadow → enforce path.

Shadow the egress rules

Set shadow_mode: true on exfil-firewall. Every enforcing verdict is downgraded to audit and logged as [shadow] would deny with the destination. No traffic is blocked while shadow mode is on.

Watch the feeds

Firewall → Events / Runs (Developer+) shows every tool call and egress destination your agent hit and what would have been denied. Guardrails → Matches (any Member) shows every secret the input guardrail caught. Tune the egress allow list until only attacker-reachable hosts would be denied.

Enforce

Turn off shadow_mode. The very next request is governed — credentials blocked at the prompt, secrets stripped from tool args, outbound calls confined to your allow-list. No application change.

The Matches feed records the matched substring only when Log raw content is on for the guardrail (off by default — the privacy-conservative posture). Mark a false positive (Admin) to tune the policy. Every guardrail change writes a version-history row you can diff and revert; firewall policy changes are recorded in the audit trail.

7. Coverage at a glance

Exfiltration step	Layer that stops it
Credential enters the request	Secrets Blocker guardrail (input)
Model emits a tool call carrying a secret	`sanitize` firewall rule (response surface)
Tool dials an attacker host	Egress `allow` / `deny` rule
Agent reaches cloud metadata or RFC-1918	Egress deny rule listing those CIDRs
Fetch-shaped tool offered to the model	`tight` autonomy level (tool-name deny)

8. Where to go next

Firewall rules reference

The full matching language — egress lists, CIDRs, sanitizers, and all verdicts.

Data exfiltration threat

The attack anatomy this recipe defends against, end to end.

Harden an MCP agent

Govern every tools/call an agent dispatches through an MCP server.

PII-safe logging

Keep sensitive data out of your request logs and the Matches feed.

​1. The three layers that prevent ai data exfiltration

Credentials in the prompt

Secrets in tool args

Outbound destination

​2. Stop credentials at the prompt — the Secrets Blocker guardrail

​3. Sanitize secrets out of tool-call arguments

​4. Lock outbound destinations — the egress allow-list

​5. Attach one scoped key

​6. Roll out with shadow mode, then watch

​7. Coverage at a glance

​8. Where to go next

Firewall rules reference

Data exfiltration threat

Harden an MCP agent

PII-safe logging

1. The three layers that prevent ai data exfiltration

2. Stop credentials at the prompt — the Secrets Blocker guardrail

3. Sanitize secrets out of tool-call arguments

4. Lock outbound destinations — the egress allow-list

5. Attach one scoped key

6. Roll out with shadow mode, then watch

7. Coverage at a glance

8. Where to go next