Streaming
Set "stream": true on any gateway request. The response is a server-sent events
stream in the ingress format’s native protocol — the same events the format’s official
SDKs already parse.
Per-format protocols
Chat Completions
chat.completion.chunk objects, terminated by data: [DONE]:
data: {"id":"gen-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant","content":""}}]}
data: {"id":"gen-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Hello"}}]}
data: {"id":"gen-…","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: {"id":"gen-…","object":"chat.completion.chunk","choices":[],"usage":{"prompt_tokens":12,"completion_tokens":7,"total_tokens":19}}
data: [DONE]Usage in streams — always
Token usage is reported on every request, streamed or not. For streaming, the final
usage chunk is always emitted — you do not need to send
stream_options: {"include_usage": true} (Chat Completions ingress accepts it for
compatibility; the behavior is always on). Usage includes prompt, completion, and
reasoning tokens, with cached vs. uncached prompt tokens broken out — see
Usage & Credits.
Keep-alives, timeouts, aborts
- SSE keep-alive comments every 15 seconds, so load balancers and proxies never idle-close a healthy stream.
- Time to first token up to 5 minutes; 120 seconds inter-chunk idle; 60 minutes absolute stream cap. Details in Limits & Timeouts.
- If your client aborts mid-stream, the tokens generated up to that point are still metered, billed, and recorded — the partial usage settles when the stream closes.
Errors that occur before any byte has streamed return a normal HTTP error response in your ingress format’s error shape. We never retry mid-stream.