Streaming

Set "stream": true on any gateway request. The response is a server-sent events stream in the ingress format’s native protocol — the same events the format’s official SDKs already parse.

Per-format protocols

Chat Completions

chat.completion.chunk objects, terminated by data: [DONE]:


data: {"id":"gen-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant","content":""}}]}

data: {"id":"gen-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Hello"}}]}

data: {"id":"gen-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: {"id":"gen-…","object":"chat.completion.chunk","choices":[],"usage":{"prompt_tokens":12,"completion_tokens":7,"total_tokens":19}}

data: [DONE]

Responses

Semantic events (response.created, response.output_text.delta, …, ending with response.completed which carries usage):


event: response.created
data: {"type":"response.created","response":{"id":"gen-…","status":"in_progress",…}}

event: response.output_text.delta
data: {"type":"response.output_text.delta","delta":"Hello"}

event: response.completed
data: {"type":"response.completed","response":{"id":"gen-…","status":"completed","usage":{"input_tokens":12,"output_tokens":7,…}}}

Anthropic Messages

Anthropic event stream (message_start → content_block_deltas → message_delta with usage → message_stop):


event: message_start
data: {"type":"message_start","message":{"id":"gen-…","role":"assistant","content":[],…}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hello"}}

event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{"output_tokens":7}}

event: message_stop
data: {"type":"message_stop"}

Usage in streams — always

Token usage is reported on every request, streamed or not. For streaming, the final usage chunk is always emitted — you do not need to send stream_options: {"include_usage": true} (Chat Completions ingress accepts it for compatibility; the behavior is always on). Usage includes prompt, completion, and reasoning tokens, with cached vs. uncached prompt tokens broken out — see Usage & Credits.

Keep-alives, timeouts, aborts

SSE keep-alive comments every 15 seconds, so load balancers and proxies never idle-close a healthy stream.
Time to first token up to 5 minutes; 120 seconds inter-chunk idle; 60 minutes absolute stream cap. Details in Limits & Timeouts.
If your client aborts mid-stream, the tokens generated up to that point are still metered, billed, and recorded — the partial usage settles when the stream closes.

Errors that occur before any byte has streamed return a normal HTTP error response in your ingress format’s error shape. We never retry mid-stream.