Skip to main content
You enabled a response-surface rule — deny or sanitize on the tool calls your model emits — and your agent calls the gateway with "stream": true. The question that actually matters: can a streaming response leak a blocked tool call before the firewall decides? It can’t, and this page explains the one mechanism that makes that true so you can reason about latency and the chunks your client receives. This is a focused look at the SSE behavior. For the verdicts themselves see Verdicts; for the rule grammar see the rule reference.

1. The streaming firewall sse problem

A non-streaming response is one JSON body — the firewall sees the whole thing, evaluates the tool_calls, and returns the cleaned result. A stream is different: a model emits a tool call as dozens of tool_call deltas across many SSE frames, and once a frame is forwarded, your agent already has it — there is no retracting a token you’ve sent. Evaluate too early and you don’t have the complete call (name + full arguments) to judge; forward as you go and a deny is already too late. The gateway resolves this with a simple, observable contract:

Content streams live

Normal text and reasoning deltas pass through unchanged, in real time — zero added latency on the tokens your user reads.

Tool-call frames are held

Any frame carrying a tool_call (or legacy function_call) delta is withheld from the live stream until the call is complete and evaluated.
The firewall is a security gate, so it parses every frame. It does not guess a frame is content-only from the raw bytes — a JSON-escaped tool_calls member has no literal substring to match on, so a substring shortcut would forward an unevaluated tool call. SSE frames are small; the gate parses each one.

2. The hold-assemble-evaluate sequence

For a streaming chat-completions response with a response-surface policy active, each frame the upstream emits takes one of two paths:
Streams through to your client immediately, byte-for-byte. These never carry a tool call, so the firewall has nothing to decide.
Buffered out of the live stream. The closing finish_reason frame of a tool turn is held alongside it, because emitting it early would tell your client the turn is over before the firewall has ruled.
At end-of-stream, the gateway assembles the held frames into complete tool calls (joining each call’s streamed arguments fragments), evaluates every one against your policy on the response surface — the same verdict and rule semantics as the non-streaming path — and emits only the survivors:
Held call’s verdictWhat your client receives
allow / auditThe original held frames, unchanged — a delayed pass-through, not a re-batched chunk.
sanitizeThe call with its arguments rewritten (matched secrets/PII replaced with a typed token), re-emitted.
denyThe call is dropped. If it was the turn’s only call, the turn closes with finish_reason: "stop" — the stream looks like the model made no tool call.
If nothing matched, you pay only the buffering delay on the tool-call frames — content already streamed live. The firewall reconstructs frames only when it actually acts (a deny or a sanitize); a clean allow forwards your upstream’s exact bytes.

3. One concrete example

A response policy with a deny rule on *.delete (author it in the console rule editor) and a streaming request whose model decides to call both db.query and db.delete:
SSE timeline (what your agent receives)
───────────────────────────────────────
data: {"choices":[{"delta":{"content":"Looking that up…"}}]}   ← live
data: {"choices":[{"delta":{"content":" one moment."}}]}        ← live
                                                                ← db.query + db.delete
                                                                  tool_call frames HELD
─── end of stream ───
data: {"choices":[{"delta":{"role":"assistant",
        "tool_calls":[{"index":0,"function":{"name":"db.query",…}}]}}]}
data: {"choices":[{"finish_reason":"tool_calls"}]}
Your agent reads the assistant text in real time, then receives only db.querydb.delete was assembled, evaluated, denied, and never emitted. The surviving call is re-indexed from 0, and the firewall event for the denied call lands in your events log with the rule that fired.
Roll a streaming response policy out under shadow mode first. In shadow mode every enforcing verdict is downgraded to audit (reason prefixed [shadow] would …) and all tool-call frames pass through — so you can confirm the policy matches what you expect on real streamed traffic before it starts dropping calls.

4. Inbound blocks short-circuit before the stream starts

The held-frame dance is only for the response surface — calls the model emits. An inbound deny (a tool an agent advertises) fires before the upstream model call, so a streaming request that trips an inbound rule never opens an SSE stream at all: it returns a plain HTTP 400 with error code firewall_blocked, marked skip-retry. No frames, no held window — the block lands like any non-streaming error.

5. Guardrails on the same stream

A streaming response can carry a Guardrail output policy and a firewall response policy at once. They act on different things — guardrails screen the text the model streams; the firewall governs the tool calls — and they compose:
  • Output guardrail block (streaming): the output scanner cuts the stream the moment a rule trips, forwards a single generic replacement chunk — [Response blocked by content policy.] with finish_reason: "content_filter" — and stops. The message is deliberately generic (no rule category) so a prober can’t enumerate your policy. A firewall hold in flight when this happens is discarded, so a withheld tool call can’t slip out after the block.
  • Output guardrail mask (streaming): masking the request before the model is live; live in-band masking of streamed output is on the roadmap. On a stream a mask rule records the match but currently forwards the original chunk — author it knowing the redaction isn’t yet rewritten on the wire. Output block is fully enforced on streams.
This page describes the OpenAI chat-completions SSE shape. The same hold-evaluate-emit contract is wired per format — native Anthropic Messages, Gemini, xAI, and the OpenAI Responses stream each carry it in their own event shape — so the customer-observable behavior is identical regardless of which provider served the request.

6. What this means for your client

A few practical consequences of the held-frame model:
A turn whose only tool call was denied closes with finish_reason: "stop" instead of "tool_calls" — to your agent it reads as “the model chose not to call a tool.” A turn where some calls survived closes with "tool_calls", carrying only the survivors.
When an upstream bundles token usage onto the same terminal chunk the firewall held, the gateway re-attaches it to the final reconstructed frame — a client that requested stream usage still gets it.
If the model emitted content and a tool call in the same frame, the content is recovered and re-emitted even when the tool call is stripped — blocking one call never drops your assistant text.
You don’t opt a stream into any of this. Attach a policy to the key (or set a workspace default) and keep streaming exactly as before — the enforcement is at the gateway.

Where to go next

Stages & surfaces

inbound, response, mcp, egress — where each rule evaluates.

Verdicts

allow, audit, deny, sanitize, pending_approval, cap_cost.

Sanitize arguments

Redact secrets from a tool call’s arguments — argument layer only.

Shadow mode

Downgrade enforcing verdicts to audit while you measure impact.
For where this sits in the request path, see how OrcaRouter inspects and enforcement-path latency. For the threats response-surface enforcement contains, see dangerous tool calls and data exfiltration. For the full rule grammar, see the firewall rule reference.