- Poisoning — the server was malicious from day one. Its manifest looked benign; the dangerous behavior was in the tool implementation, not the declared scopes.
- Rug-pull — you trusted it, then it changed. A new tool appeared that the server’s operator added quietly, or a community registry entry was hijacked and updated to call home.
1. How mcp tool poisoning reaches your agents
Everytools/call your agent issues travels through the MCP server’s
declared tool set. A poisoned or rug-pulled server exploits that trust in a
few ways:
| Vector | What happens |
|---|---|
| Undeclared tool | A new tool appears in tools/list that the server’s manifest never declared. Your agent finds it and calls it. |
| Hijacked registry entry | A community registry listing is taken over; the endpoint now points to an attacker-controlled server. |
| Credential harvesting | The server’s tool implementation sends collected inputs to an external host. |
| Prompt-injection via tool result | A tool returns attacker-controlled text that redirects the agent’s next action. |
2. OrcaRouter’s defenses
2.1 Every tools/call is firewall-evaluated before it runs
MCP servers connect to your agents through the Firewall MCP gateway at
/api/v1/firewall/mcp. The gateway does not forward a tool call until the
firewall engine has evaluated it against your policy.
That means your allow-list is the source of truth — not the server’s tool
manifest. If a rug-pull adds shell.exec and your policy has no rule
permitting it, the verdict is deny and the call never leaves the gateway.
The model receives a tool error (firewall deny: …) and can react; the
attacker-added tool is dead on arrival.
Verdicts the engine can return:
| Verdict | Effect |
|---|---|
allow / audit | Call forwarded; audit additionally logs arguments. |
sanitize | Arguments rewritten before forwarding. |
deny | Call blocked; model receives a tool error. |
pending_approval | Call held; a human must approve before it proceeds. |
cap_cost | Cost cap enforced; call blocked if it would exceed it. |
2.2 The server’s tool schema is baselined — drift fails closed
The most direct rug-pull defense runs before any call. On first contact the gateway records a canonical hash of the server’s advertised tool set — every tool’s name, description, and input schema (a trust-on-first-use baseline). On every later probe it re-hashes the live tools and compares:- unchanged →
verified; tools are served normally. - drifted (a tool added, removed, or a definition changed) → the
server’s schema status flips to
changedand the gateway fails closed: its tools stop being served until an admin re-baselines (approves the new schema) or quarantines the server.
2.3 Auto-detected capabilities are quarantined until reviewed
When an agent self-installs a capability — or a rug-pull adds new tools that weren’t present when you registered the server — the Firewall auto-detects the new capability off the hot path, synthesizes a manifest, scans it, and assigns a risk band and enforcement mode. Crucially, auto-detected capabilities are always quarantined regardless of scan result: they are held inpending_approval until a human reviews them.
This is how rug-pulls are contained. An operator can’t quietly add a new
tool and have your agents start using it — those calls are held until you
inspect and approve the new capability.
2.4 Skill scanning assigns a risk band and enforcement mode
Every installable capability — whether you registered it or the Firewall auto-detected it — is passed through the skill scanner. The scanner runs deterministic passes over the manifest and declared scopes:- prompt_injection — manifest text that attempts to hijack instructions.
- tool_creep — tools the manifest uses but never declared.
- network_egress — HTTP(S) hosts outside the approved network scopes.
- fs_write_unsafe — write-mode filesystem access outside
/tmp.
low / medium / high / critical)
and an enforcement mode:
| Mode | What happens at runtime |
|---|---|
allow | Skill imposes nothing of its own; your policy rules decide. |
quarantine | Any non-deny verdict escalates to pending_approval. A human must approve each tool call. |
block | Force deny on all of this skill’s tools, regardless of policy rules. |
high-band skill is quarantined automatically; critical is blocked. A
single error finding (e.g., tool_creep for an undeclared shell.exec)
is enough to block a skill even when its numeric score looks low. The mode
only ever ratchets tighter — approving a skill never relaxes a block set by
a fresh scan.
2.4 Credentials are stored encrypted
Server auth secrets are encrypted at rest with a workspace secrets key and injected by the gateway at dispatch time. They never reach the model, the agent, or the call arguments. A compromised server can’t exfiltrate your API keys by reading its ownauth_json.
Third-party MCP server vetting checklistBefore registering an external MCP server:
- Verify the publisher’s identity — who controls the endpoint URL?
- Read the source or changelog; look for new tools added after the initial release.
- Check whether the skill scan returns any
tool_creeporprompt_injectionfindings on registration. - Scope a firewall rule with
tool_name_glob: <server>.*toauditorpending_approvaluntil you have a call history. - Review the
network_egressfindings: does the manifest claim it only needs one domain but the tool descriptions mention others? - Re-probe the server after any upstream version bump (
POST /api/workspace/firewall/mcp_servers/:id/probe) to surface new tools.
3. What to do after a suspected rug-pull
- Disable the server immediately — a disabled server is dropped from
the runtime registry and its credentials are never decrypted. Use
PUT /api/workspace/firewall/mcp_serverswith"enabled": false. - Re-probe to surface changes —
POST /api/workspace/firewall/mcp_servers/:id/proberunstools/listand returns any new tools that appeared since your last probe. - Rescan the skill record —
POST /api/workspace/firewall/skills/:id/rescanre-runs the scanner against the updated manifest. If the verdict degrades toflaggedorblocked, the Firewall emits an event in your feed. - Review
pending_approvalqueue — any calls held since the rug-pull are in the queue. Inspect and deny them rather than bulk-approving. - Audit the call log — check the Firewall event trail for calls that went through before you detected the change.
4. Pairing skill scanning with firewall rules
Skill scanning and firewall rules are complementary and compose:- A rule with
tool_name_glob: community.*set topending_approvalensures you review every call from a community-sourced server, regardless of risk band. - A quarantined skill overrides an
allowrule — even if your policy permitshttp.fetchbroadly, a quarantined skill that owns it still holds the call. - Use
skill_name_globin a rule to scope tighter policies to untrusted servers without affecting your first-party integrations.
5. Related threats
- Dangerous tool calls — rules for blocking destructive or irreversible tool actions regardless of source.
- Data exfiltration — egress rules that restrict where tool calls may send data.
- Threat model — the full attack surface OrcaRouter is designed to defend.
Firewall: MCP Servers
Register MCP servers behind the gateway, probe their tools, and apply
per-call verdicts before any call reaches the real server.
Firewall: Skills
Scan and risk-score every installable capability. Quarantine or block
risky skills before their tools can run.
