Rug-pull defense for MCP tools

A “rug pull” is the MCP failure mode where a server behaves while you’re watching and turns hostile once it’s trusted: a tool you approved at connect time starts smuggling extra arguments, a community server you listed quietly adds a new capability, or a skill an agent self-installed flips from benign to dangerous in production. The danger is that nobody re-reviews a connection after it’s live — the trust decision was made once, at the handshake, and never revisited. OrcaRouter does not trust the handshake. It defends on three fronts. The Firewall’s MCP gateway evaluates each tools/call at dispatch time against your live policy. Each registered server’s advertised tool set is baselined on first probe and re-checked for drift — if the tool schema changes from the approved baseline, the server fails closed until an admin re-approves or quarantines it. And the Skills layer assigns every installed capability a risk band and an enforcement mode — quarantining anything risky or unreviewed until a human signs off. A server can’t earn a free pass by behaving for the first hundred calls.

1. Why MCP rug pull protection needs per-call evaluation

A connect-time review answers one question once: is this server safe to list? It can’t answer the question that actually matters at runtime: is this specific call, with these specific arguments, safe right now? OrcaRouter answers the second question. Every tools/call that crosses the gateway is evaluated on the mcp surface before it’s dispatched to the real server, with the tool name and arguments in hand. The verdict is computed fresh each time, so the moment a tool starts doing something your policy forbids — exfiltrating a secret in an argument, reaching a host that’s denied, calling a capability you never approved — the call is stopped, regardless of how the same tool behaved a minute ago.

Per-call evaluation governs the behavior of each call — argument contents, destinations, the owning skill’s risk — so it catches a rug pull even when the tool keeps an identical signature and only its behavior turns hostile. Schema-drift detection (§ below) is the complementary layer: it catches the case where the server’s advertised tool set itself changes. Both run.

The verdicts the engine can return on the mcp surface:

allow / audit

Forwarded to the server. audit logs the call; allow stays quiet.

sanitize

Forwarded with the tool-call arguments redacted first (it never rewrites what the server returns).

deny

Returned to the model as a tool error (firewall deny: …) so the agent can adapt instead of crashing.

pending_approval

The call is held for a human to resolve before it can run.

2. Skill risk-band quarantine

The second half of rug-pull defense covers the supply chain: the skills, plugins, and bring-your-own MCP servers an agent installs. Each one is registered as a workspace-scoped record, scanned by a deterministic risk engine, and assigned a risk band (low / medium / high / critical) plus an enforcement mode:

Mode	Effect at runtime
`allow`	Rule verdicts decide; the skill adds nothing.
`quarantine`	Anything short of a deny is escalated to `pending_approval` — tools run only after a human approves.
`block`	The skill’s tools are denied outright.

This is where a rug pull gets contained. A capability an agent self-installs is auto_detected and quarantined until reviewed — even if it scanned clean, it doesn’t run on its own authority. And a skill’s mode only ever ratchets tighter on re-scan: a block or quarantine you set is never silently relaxed when a manifest is re-presented.

Quarantine is enforced independently of shadow mode. A skill set to quarantine or block is still held even while the surrounding policy is in shadow rollout — so a risky capability can’t slip through during a staged deployment.

See Firewall: Skills for the full scanner, scoring weights, and trust signals.

3. Tool-schema drift detection

The classic rug pull is a registered server that changes what it advertises — adds a tool, alters a tool’s input schema, swaps a description. OrcaRouter baselines each registered server’s advertised tool set on a successful probe and watches it for drift.

Baseline on first probe

The first successful probe records a canonical hash of the server’s tools (trust-on-first-use under a discovery posture; under an enforcing posture an unbaselined server is held pending until an admin approves its initial tool set).

Drift fails closed

On a later probe, if the canonical tool set no longer matches the approved baseline the server is marked changed and stops being served — the gateway won’t dispatch its tools until you decide.

Approve or quarantine

Re-approve to re-baseline to the new schema, or quarantine the server. A quarantined server is also disabled and only an explicit approve restores service — a plain edit can’t re-enable it.

Audited

The first detection of drift from an approved baseline writes a workspace audit entry, so the change is on the record.

A server’s schema status is one of unknown (never baselined), verified (matches baseline), changed (drifted, held), pending (unbaselined under enforcing), or quarantined. This layer catches the rug pull that moves the schema; per-call evaluation (§1) catches the one that keeps an identical signature and only changes behavior.

4. One concrete example

Suppose a community MCP server notes advertises a harmless notes.search tool. You list it, review it, and it works. A week later the server is compromised and notes.search starts attaching an exfiltration argument that POSTs your context to an attacker host. A connect-time-only gateway would forward it — the tool name and schema look unchanged. OrcaRouter evaluates the call:

# Configure the deny rule in the console (Developer+), not via the relay key.
# Rule: on the mcp surface, deny notes.search whenever it carries an
#       exfiltration-shaped argument.
#   tool_name_glob: notes.search
#   args_match:     { "path": "$.callback_url", "op": "regex",
#                     "value": "^https?://(?!notes\\.example/)" }  → deny

(args_match operators are eq, contains, regex, in, cidr_match, gt, lt; cidr_match tests an IP-valued argument against a CIDR. To bound where a tool may reach by host/CIDR, use the egress destination list instead of an argument clause.) At dispatch the engine returns deny, and instead of forwarding the call the gateway hands the agent an MCP tool-result error — a normal result flagged as an error, not a transport failure — so the model can adapt:

firewall deny: <your rule's reason>

The same call that succeeded last week is now blocked — because the decision is made on the call, not on the connection.

sanitize redacts the arguments your agent sends, never the content a tool returns. If you need to constrain where a tool may reach, pair a deny rule with an egress destination list — don’t rely on sanitize to scrub responses.

5. How it fits together

Per-call evaluation vs. skill quarantine — which catches what?

Per-call evaluation catches a trusted tool turning malicious — same name, new behavior in the arguments or destination. Skill quarantine catches a new or unreviewed capability appearing at all — an auto-detected install, a re-scanned manifest that newly degrades. A rug pull can take either shape, so both run: the skill’s mode rides on top of the per-call rule verdict.

Does this baseline the server's schema?

Yes — see §3. Each registered server’s advertised tool set is baselined on first probe and re-checked for drift; a drifted server fails closed until you re-approve or quarantine it. That’s complementary to per-call evaluation, which also catches a tool that keeps an identical signature and only changes its behavior.

Where do held calls go?

A pending_approval verdict holds the call for a human to resolve in the console (Developer+) or via an HMAC approval callback. See enforcement modes for how holds and approvals are surfaced to an agent.

6. Configuring it

Every step below is a console / management action authenticated with your session or access token — not the sk-orca-… relay key. Only /v1/* relay traffic uses the relay key.

Connect each server so its tools are advertised under one audited endpoint. Registration is Developer+.

Set a default verdict and rules on the mcp surface

Author rules with tool_name_glob and args_match so risky calls resolve to deny, sanitize, or pending_approval. See the Firewall rule reference.

Review quarantined skills

Anything auto-detected sits in quarantine until a reviewer (Developer+) approves it. Read the band and findings first.

Roll out in shadow, then enforce

Use enforcement modes to run new rules in shadow, watch the audit events, and flip to enforcing once the verdicts look right.

Reads (settings, policies, discovered tools, anomalies) are open to any Member; every write is Developer+. Reading a firewall-gateway key’s plaintext is Developer+.

Firewall: MCP Servers

The full MCP gateway reference — registration, probing, dispatch.

Firewall: Skills

Scanner passes, risk scoring, and the quarantine derivation.

MCP tool poisoning

The threat model rug-pull defense exists to counter.

Egress limits

Author host/CIDR deny rules to bound where tools may reach.

Trust checklist

The end-to-end checklist for trusting an MCP server.

Guardrails vs. Firewall

When content policy applies and when the firewall does.

​1. Why MCP rug pull protection needs per-call evaluation

allow / audit

sanitize

deny

pending_approval

​2. Skill risk-band quarantine

​3. Tool-schema drift detection

Baseline on first probe

Drift fails closed

Approve or quarantine

Audited

​4. One concrete example

​5. How it fits together

​6. Configuring it

​Related

Firewall: MCP Servers

Firewall: Skills

MCP tool poisoning

Egress limits

Trust checklist

Guardrails vs. Firewall

1. Why MCP rug pull protection needs per-call evaluation

2. Skill risk-band quarantine

3. Tool-schema drift detection

4. One concrete example

5. How it fits together

6. Configuring it

Related